An Empirical Comparison of Machine-Learning Methods on Bank Client Credit Assessments

Machine learning and artificial intelligence have achieved a human-level performance in many application domains, including image classification, speech recognition and machine translation. However, in the financial domain expert-based credit risk models have still been dominating. Establishing meaningful benchmark and comparisons on machine-learning approaches and human expert-based models is a prerequisite in further introducing novel methods. Therefore, our main goal in this study is to establish a new benchmark using real consumer data and to provide machine-learning approaches that can serve as a baseline on this benchmark. We performed an extensive comparison between the machine-learning approaches and a human expert-based model—FICO credit scoring system—by using a Survey of Consumer Finances (SCF) data. As the SCF data is non-synthetic and consists of a large number of real variables, we applied two variable-selection methods: the first method used hypothesis tests, correlation and random forest-based feature importance measures and the second method was only a random forest-based new approach (NAP), to select the best representative features for effective modelling and to compare them. We then built regression models based on various machine-learning algorithms ranging from logistic regression and support vector machines to an ensemble of gradient boosted trees and deep neural networks. Our results demonstrated that if lending institutions in the 2001s had used their own credit scoring model constructed by machine-learning methods explored in this study, their expected credit losses would have been lower, and they would be more sustainable. In addition, the deep neural networks and XGBoost algorithms trained on the subset selected by NAP achieve the highest area under the curve (AUC) and accuracy, respectively.


Introduction
For lending institutions, credit scoring systems aim to provide probability of default (PD) for their clients and to satisfy a minimum-loss principle for their sustainability.Therefore, a credit scoring system supports decision making for credit applications, manages credit risks and influences the amount of non-performing loans that are likely to lead to bankruptcy, financial crisis and environment sustainability.
In the last decade, although credit officers or expert-based credit scoring model determine whether borrowers can fulfill their requirements, it has changed over time with technological advances.This change needs the establishment of an automated credit decision-making system that can avoid loss of opportunity or credit losses to reduce potential loss for each lending institution.Therefore, in recent years, automated credit scoring has become very crucial because of the growing number of financial services without human involvement.An example of such financial services is the recent establishment of the first internet-only banking firm in South Korea [1].In other words, the use of technology and automation to reduce the operating costs for modern lending institutions requires the development of an accurate credit scoring model.Although it is extremely difficult to perform an efficient model for estimating clients' creditworthiness, machine learning now plays a vital role in credit scoring application.A line of work has studied automated credit scoring as a binary classification problem in the machine-learning context.Existing studies have incorporated the use of data-mining techniques and machine-learning algorithms such as Discriminant analysis [2], Neural networks [3], Support vector machine [4], Decision trees [5], Logistic regression [6], Fuzzy logic [7], Genetic algorithm [8], Bayesian networks [9], Hybrid methods [10,11] and Ensemble methods [12].In addition, numerous authors have proposed different feature-selection methods for credit scoring such as wrapper-feature-selection algorithms [13], Wald statistic using chi-square test [14], evolutionary feature selection with correlation [15], hybrid feature-selection methods [16] and multi-stage feature selection based on genetic algorithm [17].
Unfortunately, the prior work focused only on their performance in binary credit classification.It is inefficient and not practical from the perspective of the banking risk management.The result of predictive accuracy of the estimated PD can be more valuable and expressive than the output of the binary classifier, i.e., credible or not credible clients [18].Furthermore, the regulatory organizations for lending institutions require PDs with internal ratings or credit ratings than performance in the simple binary credit classification.For example, if lending institutions follow the International Financial Reported Standards (IFRS), they have to perform a multi-class credit rating to assess the PD and loss given default (LGD) for loan loss provisions on each credit rating [19] as well as the international committee of banking supervisory authorities that the Basel Committee recommends to perform internal credit ratings [20].
In addition, the previous studies mainly built upon the German (1994), Australian (1992), Japanese (1992) and other available datasets [21].Louzada [22] found that nearly 45% of all reviewed papers relating on the theory and application of binary credit scoring used the Australian or German credit dataset.Although these datasets can be viewed as benchmarks in artificial intelligence, they do not represent a realistic setup as they have a limited number of variables and without such the realistic data established for benchmarking.It is nontrivial to provide a direct comparison between machine learning and expert-based models.More recently, Xia [23] also highlighted that finding other public datasets in credit scoring problem is still difficult.This fact indicates that how difficult is to obtain datasets on the credit scoring scenario since there are issues related to maintenance of confidentiality of credit scoring databases.
However, a small number of studies used real-life credit scoring dataset, but these datasets are not available for retrieving and analyzing [24][25][26][27].For example, in accordance to bank managers' expert opinions in Taiwan, Chen [24] discussed the evaluation and selection factors for client credit granting quality and adopts Decision-Making Trial and Evaluation Laboratory to compare and analyze the similarities and the differences in a bank's evaluation for client traits, abilities, financial resources, collaterals, and other dimensions (criteria).Dinh [25] developed econometric credit scoring model using Vietnam's commercial banks dataset.Jacobson [26] proposed a method to estimate portfolio credit risk using bivariate probit regression based on Swedish consumer credit dataset.
To summarize, although many studies have been applied various machine-learning algorithms for credit scoring, none of them compared their performances to human expert-based models because available benchmark dataset for this comparison is rare.However, establishing meaningful benchmark and comparisons on machine-learning approaches and human expert-based models have to be prerequisite in further introducing novel methods.
In this study, our main goal is to establish a new benchmark using real consumer data and to provide machine-learning approaches that can serve as a baseline on this benchmark.Then the contribution of this study is to introduce a more realistic setting in order to fill the gap between experimental studies from the literature and the demanding needs of the lending institutions.To overcome this, an open source dataset as a benchmark to compare credit scoring applications in the real world is explored.The existing credit scoring system and the evaluation metric for comparison are demonstrated as well.More specifically, we explored a Survey of Consumer Finances (SCF) data, which is a U.S. families' survey retrieved from The Federal Reserve [28].SCF dataset contains a large number of variables which consists of a variety of useful information that can directly be interpreted into a credit scoring system such as types of credit used, credit history, demographics, attitudinal, income, capital gains, expenditures, assets, etc. [29].Description of variables is given as Supplementary Material.
Since we use SCF data come from the U.S. population to construct machine-learning based-credit scoring models, FICO credit scores-the industry standard for measuring consumer credit risk in the U.S. [30]-can be compared to them.However, in order to perform this empirical comparison, we have to consider a few limitations as follows:

•
The distribution of SCF data and FICO credit scores may be slightly different.Therefore, we resampled several times from the test dataset to generate equivalent distribution matching FICO credit scores.

•
The estimated PD for the overall population of FICO credit scores is not necessarily the same as for those who have debt in sampled SCF data.To avoid this issue, Arezzo [31] introduced the response-based sampling schemes in the context of binary response models with a sample selection.This study, however, did not use this due to the lack of data.Instead, we grouped the clients into eight ratings the same as FICO credit scores based on their estimated PD to compute the average PD on each credit rating.It may reduce the bias.
We used a variety of machine-learning methods such as Logistic Regression (LR), Multivariate Adaptive Regression Splines (MARS), Support Vector Machine (SVM), Random Forest (RF), Extreme Gradient Boosting (XGBoost) and Artificial Neural Network (ANN) to compare the FICO credit scoring system [32][33][34][35][36][37].In addition, the two variable-selection algorithms were used for extracting informative features from the high-dimensional social survey data.The first algorithm is a two-stage filter feature selection (TSFFS) consisting of the t-test, chi-square test, correlation and random-forest feature ranking algorithm and the second algorithm is a random forest-based new approach (NAP), which was introduced by Hapfelmeier [38] as an extension of the random forest variable-selection approach that is based on the theoretical framework of permutation tests and meets important statistical properties.The model performance of test set was evaluated against five theoretical measures, AUC, h-measure, true positive rate (TPR), false positive rate (FPR) and accuracy [39].
For performing empirical comparison between machine-learning models and FICO credit scores, we then calculated cumulative Expected Credit Loss (ECL) according to IFRS-9 on each credit rating [40].The cumulative ECL is a practical measurement to estimate average credit losses with the probability of default.The experimental results show that if lending institutions in the 2001s had used their own credit scoring model constructed by machine-learning approaches, their expected credit losses would have been lower, and they would be more sustainable.The prediction performances of deep neural networks and XGBoost algorithm are superior to other comparative models on the subset selected by NAP method.This confirms that those models and the NAP feature-selection method are effective and appropriate for credit scoring system.This paper is organized as follows.In Section 2, we introduce our proposed framework, SCF dataset and the strategy for comparing FICO credit scores.The methods section includes feature-selection algorithms, machine-learning approaches and cumulative ECL evaluation metrics, which is displayed in the second part of Section 2 as well.Section 3 presents data pre-processing, the result of feature-selection algorithms and the empirical comparison of performances.Finally, in Sections 4 and 5, the discussion and the general findings from this study are summarized.

Proposed Framework
The overall architectural diagram of our proposed system design for credit scoring consists of three phases (Figure 1).Firstly, the SCF data is pre-processed.In the second phase, we apply the TSFFS and NAP algorithms to choose the best representative feature subsets that contain the most effective and least redundant variables.In the final phase, the selected feature subsets are used for training machine-learning algorithms to construct credit scoring models.Then we perform an extensive comparison between the machine-learning models and a human expert-based model to determine whether those algorithms can be used in the credit scoring system or not.Machine-learning models trained on the two subsets selected by TSFFS and NAP feature-selection methods are compared with each other to find the appropriate algorithms for credit scoring system as well.feature-selection algorithms and the empirical comparison of performances.Finally, in Section 4 and 5, the discussion and the general findings from this study are summarized.

Proposed Framework
The overall architectural diagram of our proposed system design for credit scoring consists of three phases (Figure 1).Firstly, the SCF data is pre-processed.In the second phase, we apply the TSFFS and NAP algorithms to choose the best representative feature subsets that contain the most effective and least redundant variables.In the final phase, the selected feature subsets are used for training machine-learning algorithms to construct credit scoring models.Then we perform an extensive comparison between the machine-learning models and a human expert-based model to determine whether those algorithms can be used in the credit scoring system or not.Machinelearning models trained on the two subsets selected by TSFFS and NAP feature-selection methods are compared with each other to find the appropriate algorithms for credit scoring system as well.

SCF Dataset
The dataset is retrieved from The Federal Reserve's normally triennial cross-sectional survey of U.S. families [28].SCF consists of information about families' balance sheets, pensions, income, demographic characteristics and the borrower's attitude.Zhang [29] noted that SCF dataset had established an excellent foundation for the household payment problem.Therefore, this dataset is more suitable for the investigation of techniques and methodologies of credit scoring.We used the SCF (1998) as a training set and SCF (2001) as a test set to build credit scoring models.The SCF (1998) and SCF (2001) datasets are summarized in Table 1; the training and test datasets contain 4113 and 4245 observations, respectively.Each observation contains 345 variables and dependent variable.Surprisingly, from 1983, the SCF survey started to provide information obtained from borrowers about their debt repayment behavior.Prior to the SCF, most information about delinquent debt repayment came from lenders [41].Therefore, we chose delinquent debt repayment variable (LATE) as a dependent variable.If a household had no late debt payments, the LATE variable is "no" and 0. Otherwise, LATE variable is "yes" and 1.In addition, those panel datasets give the beneficial advantages by evaluating the model trained on the SCF 1998 dataset and tested on the SCF 2001 dataset, as well as discovering new variables can interpret household's creditworthiness.

SCF Dataset
The dataset is retrieved from The Federal Reserve's normally triennial cross-sectional survey of U.S. families [28].SCF consists of information about families' balance sheets, pensions, income, demographic characteristics and the borrower's attitude.Zhang [29] noted that SCF dataset had established an excellent foundation for the household payment problem.Therefore, this dataset is more suitable for the investigation of techniques and methodologies of credit scoring.We used the SCF (1998) as a training set and SCF (2001) as a test set to build credit scoring models.The SCF (1998) and SCF (2001) datasets are summarized in Table 1; the training and test datasets contain 4113 and 4245 observations, respectively.Each observation contains 345 variables and dependent variable.Surprisingly, from 1983, the SCF survey started to provide information obtained from borrowers about their debt repayment behavior.Prior to the SCF, most information about delinquent debt repayment came from lenders [41].Therefore, we chose delinquent debt repayment variable (LATE) as a dependent variable.If a household had no late debt payments, the LATE variable is "no" and 0. Otherwise, LATE variable is "yes" and 1.In addition, those panel datasets give the beneficial advantages by evaluating the model trained on the SCF 1998 dataset and tested on the SCF 2001 dataset, as well as discovering new variables can interpret household's creditworthiness.In this study, the regression-type algorithm of well-known classification methods is used to estimate the borrowers' PD.Since the predicted dependent variables are expressed by the probability of borrowers' creditworthiness, it can be grouped into any number of categories based on the estimated PD.Additionally, when their distributions are equivalent, machine-learning models and FICO credit scores can be compared.Considering that the result of FICO credit scores (Table 2) and SCF data come from the same population, the performance of machine-learning models and FICO scores can be compared [29].Then the predicted PDs by machine-learning approaches were rationally grouped into eight categories equivalent to the company standard grouping of FICO scores which became de facto.This grouping was made by "Fair, Isaac and Company", a famous data Analytics Company focused on consumer credit scoring in the U.S [42].To make a comparison between our performances and FICO credit scores, we use the percent of FICO's population (column 3 of Table 1) to determine cut-off values to separate credit categories.Then cumulative ECL of each credit category is calculated on the test set.

Feature-Selection Algorithms
The investigated survey dataset is high dimensional.Accordingly, we used feature-selection algorithms to reduce the computation cost and choose the most informative variables.In this study, we present TSFFS algorithm and adapt the NAP method for variable-selection.
A two-stage filter feature selection (TSFFS): TSFFS algorithm is implemented in two main steps.In the first step, to avoid redundant and irrelevant variables, we assess the significance of each variable using two hypothesis tests, t-test for continuous variables [43] and chi-square test for categorical variables [44].In the social sciences, the hypothesis test is generally needed for quantitative research.These hypothesis tests assess whether independent variables provide statistically significant information about clients' creditworthiness.In other words, for the tested variable, the rejection of the null hypothesis means that the distributions of good and bad borrowers are different.Consequently, the tested variable is believed to have a significant effect on the clients' creditworthiness.
In the second step, we also eliminate the most unimportant ones from similar variables based on the random forest feature importance and correlation as demonstrated in Figure 2. Random forest-based variable importance is a proper assessment to determine which variables are the most relevant to the dependent variable for both discrete and continuous variables.The correlation coefficient indicates the similarity between the two variables.In this step, if two explanatory variables are highly correlated to each other, we compare the random-forest feature importance for those two variables and choose the most important ones from them.As a result of this step, it is possible to avoid a multicollinearity problem, a situation in which two or more explanatory variables in a multiple regression model are highly linearly related [45].After selecting variables, the variance inflation factor (VIF) is utilized to quantify the severity of multicollinearity by estimating a score that assesses how much the variance of an estimated regression coefficient is inflated because of multicollinearity in the model [46].
Sustainability 2019, 11, x FOR PEER REVIEW 6 of 23 In the second step, we also eliminate the most unimportant ones from similar variables based on the random forest feature importance and correlation as demonstrated in Figure 2. Random forestbased variable importance is a proper assessment to determine which variables are the most relevant to the dependent variable for both discrete and continuous variables.The correlation coefficient indicates the similarity between the two variables.In this step, if two explanatory variables are highly correlated to each other, we compare the random-forest feature importance for those two variables and choose the most important ones from them.As a result of this step, it is possible to avoid a multicollinearity problem, a situation in which two or more explanatory variables in a multiple regression model are highly linearly related [45].After selecting variables, the variance inflation factor (VIF) is utilized to quantify the severity of multicollinearity by estimating a score that assesses how much the variance of an estimated regression coefficient is inflated because of multicollinearity in the model [46].A random forest-based new approach (NAP): this approach for variable selection was presented by Hapfelmeier [38], as an extension of random forest feature-selection algorithm.Although random forest measures variable importance, it cannot answer the question that "Which variables are related to some other independent variables or to the dependent variable?"NAP uses a permutation test framework to assess a null hypothesis of independence between the dependent variable Y and multidimensional vectors of variables X to distinguish relevant from irrelevant variables.The implementation of NAP algorithm: 1. Compute random forest importance measure using the training set.
2. To assess the empirical distribution of each variable's random forest importance measure under the null hypothesis, this method permutes each variable separately and several times.
3. The p-value is assessed for each variable by means of the empirical distributions and the random forest importance measures.
4. Choose the variables with p-value adjusted by Bonferroni-Adjustment lower than a certain threshold.
The authors compared NAP to another eight popular variable-selection methods in three simulation studies and four real data applications.The results showed that NAP provided a higher A random forest-based new approach (NAP): this approach for variable selection was presented by Hapfelmeier [38], as an extension of random forest feature-selection algorithm.Although random forest measures variable importance, it cannot answer the question that "Which variables are related to some other independent variables or to the dependent variable?"NAP uses a permutation test framework to assess a null hypothesis of independence between the dependent variable Y and multidimensional vectors of variables X to distinguish relevant from irrelevant variables.The implementation of NAP algorithm: 1.
Compute random forest importance measure using the training set.2.
To assess the empirical distribution of each variable's random forest importance measure under the null hypothesis, this method permutes each variable separately and several times.

3.
The p-value is assessed for each variable by means of the empirical distributions and the random forest importance measures.

4.
Choose the variables with p-value adjusted by Bonferroni-Adjustment lower than a certain threshold.
The authors compared NAP to another eight popular variable-selection methods in three simulation studies and four real data applications.The results showed that NAP provided a higher power to distinguish relevant from irrelevant variables and lead to models which are located among the very best performing ones.

Machine-Learning Approaches
According to Louzada [22], the LR, MARS, SVM, RF, XGBoost and ANN machine-learning approaches are chosen for comparing them to the FICO credit scoring system.
Logistic Regression (LR): Most previous studies compared their own proposed method to the LR in order to demonstrate their methods' strengths and achievements [6,11,23,[47][48][49].This indicates the LR method can be a benchmark in the credit scoring problem [47].LR estimates conditional probability of borrower's default and explains the relationship between clients' creditworthiness and explanatory variables.The procedure for LR to build a model consists in the estimation of a linear combination between interpreter X and binary dependent variable Y and labeling that converts log-odds to probability using the logistic function.The LR formula is as: The maximum likelihood estimation is usually used to estimate regression coefficients.For each data point, we have interpreter x and binary dependent variable y.The probability of dependent variable is either p(x), if y = 1, or 1 − p(x), if y = 0. Then likelihood is written as: Advanced machine-learning techniques are quickly gaining applications throughout the financial services industry, transforming the treatment of large and complex datasets, but there is a huge gap between their ability to build powerful predictive models and their ability to understand and manage those models [50].LR is a phenomenal technique that is commonly used in practice because it satisfies the huge gap as a mentioned above.However, the LR predictability seems to be weaker than other advanced machine-learning algorithms.
Multivariate Adaptive Regression Splines (MARS): This approach has been widely used in modelling problems in the areas of prediction and classification problems [51,52].Firstly, Lee [11] introduced a two-stage hybrid credit scoring model using the MARS.Although MARS demonstrated the capability of identifying important features, its classification capability was not that good in comparison with MLP neural network.Chuang [53] compared five commonly used credit scoring approaches and demonstrated the advantages of MARS, ANNs and Case Based Reasoning (CBR) to credit analysis.The combination of MARS, ANNs, and CBR methods showed better performance than each individual method, linear discriminant analysis (LDA), LR, classification and regression tree (CART) and ANN.
MARS is a nonlinear and non-parametric regression technique introduced by Friedman [33] for prediction and classification problems.The modelling process of MARS method consists of two phases, the forward and the backward pass.This two-stage approach is based on the "divide and conquers" strategy in which the training sets are partitioned into separate piecewise linear segments (splines) of differing gradients (slope).For interpreter X and binary dependent variable Y, the MARS model, which is a linear combination of basis functions B i (x) and their interactions, is expressed as: where each B i (x) is a basis function, k is the number of the basis functions, and each c i is a constant coefficient.
In the forward pass, MARS repeatedly adds basis function to the model according to a pre-determined maximum reduction in sum-of-squares residual error.After implementing the forward pass, to build a model with better generalization ability, a backward procedure is applied in which the model is pruned by removing those basis functions.It removes the basis functions one by one until it finds the best sub-model.The Generalized Cross-Validation (GCV) error is a criterion to compare the performance of sub-models.It is described as: where n is the number of instances in the dataset, C is equal to 1 + cd, d is the effective degrees of freedom (the number of independent basis functions) and c is the penalty for adding a basis function.Support Vector Machine (SVM): The SVM has been applied in several financial applications recently, mainly in the area of time-series prediction and classification.There are several studies that have applied SVM with various feature-selection methods and hyper-parameters tuning algorithms to credit scoring problem [4,[54][55][56].However, Huang [54] observed SVMs classify credit applications no more accurately than ANN, decision trees or genetic algorithms (GA), and compared the relative importance of using features selected by GA and SVM along with ANN and genetic programming.That study used datasets far smaller and with fewer features than would be used by a financial institution.In this study, we apply SVM to high-dimensional dataset and compare it to other alternative approaches.
The SVM finds a function that has at most ε-insensitive loss deviation from the actually obtained binary dependent variable for each data point [34].This study briefly describes the case of linear function f (x) for SVM problem as: where x i is independent variables of n instances with observed binary dependent variable y i .We can write this problem as a convex optimization problem to minimize error, individualizing the hyperplane which maximizes the margin: However, it is possible that there is no existing function f (x) to provide these constraints for all observations.Analogously to the "soft margin" loss function [57], one can add slack variables ξ i , ξ * n to cope with otherwise infeasible constraints of the optimization problem.
Parameter C determines the tradeoff between the model complexity and the degree to which deviations larger than ε are tolerated in optimization formulation.This optimization problem can be transformed into the dual problem using Lagrange multipliers and its solution is given by: where α i , α * i are Lagrange multipliers.We use the Radial basis function (RBF) for SVM regression in this study.
Ensemble Methods: The ensemble procedure applies to methods of combining classifiers, whereby multiple techniques are employed to solve the same problem in order to improve credit scoring performance.There are three popular ensemble approaches: bagging [58], boosting [59], and stacking [60].Bagging (bootstrap aggregating) technique in which multiple training sets are generated by using bootstrapping, and classifiers are learning for each training set and the predicted class is determined by combining the classification results of each classifier.RF is a bagging algorithm that uses decision trees as the member classifiers.
For credit scoring problem, numerous studies also proposed ensemble classifiers including RF classification [23,[61][62][63].RF often demonstrates better results compared to other machine-learning methods.To estimate borrower's PD, RF regression is used in this study.This ensemble regression method is built by voting the result of individual regression trees that trained on the diversified subsets from training dataset using bagging by minimizing the mean-squared generalization error (PE*) for any numerical predictors as: where X, Y are the random vector from the training set, h(X) is any numerical predictor.We can define the average generalization error (PE*) of tree as: where Θ is random vector from the training set.Additionally, we can define the average generalization error of forest for all Θ as: where ρ is the weighted correlation between the residuals Y − h(X, Θ) and Y − h X, Θ are independent. where RF also can be used to rank the importance of variables in a regression using internal out-of-bag (OOB) estimates.As mentioned above, this study used OOB estimates for choosing the most important variable from similar variables in the feature-selection procedure.
Furthermore, recently, Xia [23] used XGBoost algorithm with Bayesian hyper-parameter optimization method to construct credit scoring model.They achieved the classification performances compared to other machine-learning methods on the different benchmark credit scoring datasets.XGBoost is a boosting ensemble algorithm; it optimizes the objective of function, size of the tree and the magnitude of the weights are controlled by standard regularization parameters.This method uses CART [64].Mathematically, K additive function f k (x) is used in tree ensemble models to approximate the function F K (x), and can be written: where K is the number of trees, x i is the i-th training instance and f k represents a decision rules of the tree and weight of leaf score.
The objective function to be optimized is represented by: where F K (x i ) is a prediction on the i-th instance at the K-th boost, Ψ( * ) is a specified loss function, in terms of regression-type, which can be the mean-squared error function, and Ω( f ) = γT + 0.5 × λ ω 2 is the regularization term that penalizes the complexity of the model to avoid overfitting problem.In the regularization term, γ is the complexity parameter, λ is a constant coefficient, ω 2 is the L2 norm of leaf weights and T denotes the number of leaves.Since XGBoost is trained in an additive manner, the prediction F K (x i ) of the i-th instance at the k-th iteration and it can be written as below: The goal of XGBoost is to find the f k that minimizes the above objective function using gradient descent optimization method.
Artificial Neural Network: Neural networks have been widely used for the credit scoring problem [3,11,48].Firstly, West [3] applied five different neural network architectures for credit scoring problem.He showed the mixture-of-experts and radial basis function neural network models must be considered for credit scoring application.More recently, different ANNs have been suggested to tackle the credit scoring problem.Namely, probabilistic neural network [65], partial logistic ANN [66], artificial metaplasticity neural network [67] and hybrid neural networks [68].In some datasets, the neural networks achieve the highest average correct classification rate when compared with other traditional techniques, such as discriminant analysis and LR, taking into account the fact that results were very close [69].In this study, Multilayer perceptron (MLP) neural network is utilized to construct credit scoring model.MLP is a general architecture in ANN that has been developed to be similar to human brain function (the basic concept of a single perceptron was introduced by Rosenblatt [37]).
MLP consists of three types of layers with completely different roles called input, hidden and output layers.Each layer contains given number of nodes with the activation function and nodes in neighbor layers are linked by weights.The optimal weights are obtained by optimizing objective or loss function using a backpropagation algorithm to build a model as defined: where ω denotes the vector of weights, x is the vector of inputs, b is the bias and f ( * ) is the activation function and λΩ(ω) is a regularizer.There are several parameters that need to be determined in advance for the training model, such as number of hidden layers, number of their nodes, learning rate, batch size and epoch number.
In a neural network, the choice of optimization algorithm has a significant impact on the training dynamics and task performance.There are many techniques to improve the gradient descent optimization and one of the best optimizers is Adam [70].Adam computes adaptive learning rates for different parameters from estimates of first and second moments of the gradients and realizes the benefits of both Adaptive Gradient Algorithm and Root Mean Square Propagation.Therefore, Adam is considered one of the best gradient descent optimization algorithms in the field of deep learning because it achieves good results faster than others [71].
In addition, an Early Stopping algorithm is addressed for finding the optimal epoch number based on other given hyper-parameters.This algorithm is to prematurely stop the training at the optimal epoch number when the validation error starts to increase.This also helps to avoid overfitting [72].However, overfitting is still a challenging issue when the training neural networks are extremely large or working in domains which offer very small amounts of data.If the training neural networks are extremely large, the model will be too complex and it would be transformed into an untrustworthy model.

A Cumulative Expected Credit Loss
The goal of evaluation metric is to assess goodness of fit between a given model and the data and is used to generate the model and to compare different machine-learning methods in the context of model selection.The AUC, h-measure, TPR, FPR and accuracy are used to evaluate the ability of machine-learning algorithm to distinguish good and bad borrowers [39].
But in practice, the lending institutions reject or approve the borrowers' credit application depending on their credit scores.For example, people who have 300-500 credit score on the FICO scores scale of 300-850 are unlikely to get approved for credit cards and other loans because their credit risk is expressed by their credit scoring.Therefore, the cumulative ECL can be one important evaluation metric to measure performance of credit scoring model [40].We used the cumulative ECL to compare our credit scoring models with FICO scores.In addition, using ECL measurement for model comparison gives the opportunity to choose the credit scoring model with lowest loss and to support decision making to find cut-off credit categories.The cumulative ECL is estimated as following: 1. We estimate PD for each credit category.
The PD is the most important major measurement in credit risk modelling used to assess credit losses [73].It depends on borrower's individual characteristics and macroeconomic factors such as business cycle, per capita income and unemployment.Furthermore, the PD determines the interest rate for each credit rating as shown in Table 2, creating a link between interest rates and credit risk.
The PD is simply computed by the number of default borrowers divided by the total number of borrowers.

PD =
De f ault borrowers Total number o f borrowers 2. We can write the formula of ECL for each credit rating as: where EAD is exposure at default and LGD is loss given default.We assume that EAD is expressed by a percentage of population (percent of portfolio) at each credit rating, and LGD can be equal to 1 for consumer loan.
3. Cumulative ECL for credit rating is sum of ECL of all higher credit ratings.
where CU M_ECL k is k-th credit rating's cumulative ECL, ECL i is i-th credit rating's ECL, k is the number of credit categories.
Assuming that the poor credit scoring model leads to a rise in PD through mispredicting the probability of borrowers' creditworthiness, cumulative ECL therefore increases.According to this assumption, a lower cumulative ECL indicates better expected performance of the borrowers and it proves that a credit scoring model is more profitable and sustainable.

Results
In this section, we will summarize the data pre-processing, experimental setup, result of TSFFS algorithm and comparison of experimental results.In particular, Section 3.4 will present an empirical comparison between machine-learning models and FICO credit scores.

Data Pre-Processing and Experimental Setup
The data pre-processing, the result of TSFFS algorithm and experimental set-up will be described in this section.Section 3.1.1will provide the result of data pre-processing such as variable transformation, creation and outlier detection.Then Section 3.1.2will introduce the process of hyper-parameter tuning for each machine-learning method.

Data Pre-Processing
In data pre-processing, the 1159 instances were dropped from the training set and 1196 from the test set because they have no debt.In addition, the 21 new variables were created because those variables could possibly interpret credit scores well such as total balance of household loan, total number of loan, total number of vehicles, etc.At the outlier detection step, the standard deviation-based outlier detection method was used for finding outliers [74].The 70 and 127 outliers from training and test sets were dropped because the observed value of those instances were higher than critical value of the log-normal distribution (p-value > 0.05).Finally, the training set contained 2889 (93.9%) good and 187 (6.1%) bad instances, the test set contained 2924 (93.8%) good, 193 (6.2%) bad instances and both datasets consisted of 361 explanatory variables as shown in Table 3. 3.1.2.Experimental Setup SVM, RF, XGBoost, and MLP methods insist on tuning hyper-parameters to prevent overfitting problem and improve model performance.The grid search with 10-fold cross-validation (GS with 10-fold CV) method is used to find the optimal hyper-parameters for SVM, RF and XGBoost algorithms.GS with 10-fold CV algorithm performs with the given searching space as summarized in Table 4.For XGBoost and MLP, an Early Stopping algorithm is worked for finding the optimal epoch number based on given other hyper-parameters.
For MLP, the hyper-parameters: learning rate, batch size, and epoch number must be pre-defined to train the model.Since an Early Stopping algorithm is used to find the optimal epoch number, we set the learning rate to 0.0001, maximum epoch number for training to 1000 and use a mini-batch with 32 instances at each iteration.If our algorithm stopped early, a given learning rate and maximum epoch number would be consistent with the training model because our objective function (loss function) that comes from the neural networks is converged before reaching the maximum epoch number.
In this study, we compared six neural networks architectures consisting of different numbers of hidden layers and various activation functions.The first three neural networks used the sigmoid activation function and those are created by one, three, and five hidden layers with eight nodes.The other three neural networks used the ReLU activation function for each hidden layer and the softmax function used for output layer.Those are also built by one, three, and five hidden layers with eight nodes.

TSFFS Algorithm
In the first step of TSFFS algorithm, we considered statistically significant variables based on the t-test and chi-square test.Regarding the t-test, a two-sample t-test was assessed for continuous variables.For example, the total value of aggregate loan balance for home improvement is not related to a client's creditworthiness because there is no statistically significant difference between means of bad and good borrowers (p-value = 0.127).For categorical variables, the chi-square test of independence was used to compare frequencies from bad and good borrowers as well.For example, the frequencies of information used for investing decisions (categories: material in mail, TV, radio, advertisements and telemarketer) are similar for both bad and good borrowers (p-value = 0.067).Figure 3  For MLP, the hyper-parameters: learning rate, batch size, and epoch number must be predefined to train the model.Since an Early Stopping algorithm is used to find the optimal epoch number, we set the learning rate to 0.0001, maximum epoch number for training to 1000 and use a mini-batch with 32 instances at each iteration.If our algorithm stopped early, a given learning rate and maximum epoch number would be consistent with the training model because our objective function (loss function) that comes from the neural networks is converged before reaching the maximum epoch number.
In this study, we compared six neural networks architectures consisting of different numbers of hidden layers and various activation functions.The first three neural networks used the sigmoid activation function and those are created by one, three, and five hidden layers with eight nodes.The other three neural networks used the ReLU activation function for each hidden layer and the softmax function used for output layer.Those are also built by one, three, and five hidden layers with eight nodes.

TSFFS Algorithm
In the first step of TSFFS algorithm, we considered statistically significant variables based on the t-test and chi-square test.Regarding the t-test, a two-sample t-test was assessed for continuous variables.For example, the total value of aggregate loan balance for home improvement is not related to a client's creditworthiness because there is no statistically significant difference between means of bad and good borrowers (p-value = 0.127).For categorical variables, the chi-square test of independence was used to compare frequencies from bad and good borrowers as well.For example, the frequencies of information used for investing decisions (categories: material in mail, TV, radio, advertisements and telemarketer) are similar for both bad and good borrowers (p-value = 0.067).In the second step of TSFFS algorithm, the correlation and random forest feature importance were used to choose the most relevant variable from similar variables.As shown in Figure 4, selected variables are denoted by cyan points.The 116 variables were dropped because they have lower importance than other similar variables.In other words, those variables have similar characteristics to the remaining 106 variables.For example, the correlation between total value of financial assets (FIN) and total value of assets (ASSET) is equal to 0.8287, but random forest importance of FIN and ASSET are 10.40 and 4.13, respectively.In this case FIN is chosen because this variable is more important for explaining borrowers' creditworthiness.
Accordingly, the 106 variables were retained to train the machine-learning model.Those variables match with the part of the information that FICO credit score requires to evaluate the borrower's credit score such as types of credit, payment history, amounts owed, etc.
were used to choose the most relevant variable from similar variables.As shown in Figure 4, selected variables are denoted by cyan points.The 116 variables were dropped because they have lower importance than other similar variables.In other words, those variables have similar characteristics to the remaining 106 variables.For example, the correlation between total value of financial assets (FIN) and total value of assets (ASSET) is equal to 0.8287, but random forest importance of FIN and ASSET are 10.40 and 4.13, respectively.In this case FIN is chosen because this variable is more important for explaining borrowers' creditworthiness.
Accordingly, the 106 variables were retained to train the machine-learning model.Those variables match with the part of the information that FICO credit score requires to evaluate the borrower's credit score such as types of credit, payment history, amounts owed, etc.Finally, we assessed multicollinearity on the selected variables using VIF based on the logistic regression.VIF is a measure of the independent variable's collinearity with the other independent variables in the model.In the literature, when VIF values are less than 5 or 10 values, multicollinearity is not an issue in the regression model [81].Figure 5 shows the result of VIF for each selected variable and according to this, there is no multicollinearity in the model.Finally, we assessed multicollinearity on the selected variables using VIF based on the logistic regression.VIF is a measure of the independent variable's collinearity with the other independent variables in the model.In the literature, when VIF values are less than 5 or 10 values, multicollinearity is not an issue in the regression model [81].Figure 5 shows the result of VIF for each selected variable and according to this, there is no multicollinearity in the model.
variables are denoted by cyan points.The 116 variables were dropped because they have lower importance than other similar variables.In other words, those variables have similar characteristics to the remaining 106 variables.For example, the correlation between total value of financial assets (FIN) and total value of assets (ASSET) is equal to 0.8287, but random forest importance of FIN and ASSET are 10.40 and 4.13, respectively.In this case FIN is chosen because this variable is more important for explaining borrowers' creditworthiness.
Accordingly, the 106 variables were retained to train the machine-learning model.Those variables match with the part of the information that FICO credit score requires to evaluate the borrower's credit score such as types of credit, payment history, amounts owed, etc.Finally, we assessed multicollinearity on the selected variables using VIF based on the logistic regression.VIF is a measure of the independent variable's collinearity with the other independent variables in the model.In the literature, when VIF values are less than 5 or 10 values, multicollinearity is not an issue in the regression model [81].Figure 5 shows the result of VIF for each selected variable and according to this, there is no multicollinearity in the model.

NAP Algorithm
Regarding NAP feature selection, since the SCF dataset consists of a large number of variables, each variable was permuted 10 times to measure random forest importance.Then the p-value adjusted by Bonferroni-Adjustment was assessed and the variables that had a p-value of less than 0.05 were chosen.The NAP method assesses the null hypothesis that needs to be answered as: "Which variables are related to other independent variables or to the dependent variable?"Accordingly, selected variables provide two abilities: a higher predictive strength and being uncorrelated with some other variables.The 76 variables were selected by NAP algorithm and the selected variables by TSFFS and NAP are demonstrated in Tables S1 and S2 in the Supplementary Material.

Evaluation Results
This section presents the evaluating models built over various machine-learning algorithms and two feature-selection algorithms.The experiments were looped 10 times to improve their robustness, and the evaluation measures are averaged in the comparison of results.
The main aims of the comparison were to evaluate the effectiveness of the various machine-learning algorithms and to determine the highest performance algorithm.These objectives are valid for variable-selection algorithms as well.The LR, MARS, SVM, RF, XGBoost, and MLP approaches were used to train credit scoring models in the comparison of the experiment.In order to achieve the best performance, hyper-parameters were optimized for each machine-learning method.The results are summarized in Table 5 and the highest performance for evaluation metrics are bolded.
For the subset selected by the TSFFS algorithm, MLP with sigmoid model indicated the best performance in terms of AUC and h-measure evaluation metrics.This model achieved 86.81% AUC and 0.4336 h-measure, which are 0.09% and 0.01 higher than the second MLP with softmax model trained on the same subset.The AUC indicates classifying ability between borrowers as good and bad, whereas h-measure is better at dealing with cost assumptions among credit classes.In addition, RF and XGBoost models indicated the best performance in terms of TPR, FPR and accuracy.RF model achieved TPR of 85.34%, and XGBoost model achieved FPR of 13.61% and accuracy of 93.81%, thereby outperforming MLP with sigmoid model by 2.8%, 7.5% and 5.7%, respectively.
Regarding NAP variable-selection method, this is better than TSFFS because it improved most evaluation metrics of the best models of TSFFS by 0.7% AUC, 2.5% TPR, 2.0% FPR and 1.6% accuracy.In terms of h-measure, NAP could not improve TSFFS, but it achieved comparable performance for MLP with softmax model.In addition, AUC proves that Deeper MLP with sigmoid model is the best method, indicating it has good separation ability among credit classes.As show in TSFFS subset, XGBoost model outperformed other models in terms of FPR and accuracy.
Overall, it was found that MLP neural networks with sigmoid activation and XGBoost model showed promising results over most evaluation metrics, indicating that these methods are an appropriate approach with NAP variable-selection method in credit scoring.

ROC Curve Analysis
The ability of models to distinguish between good and bad borrowers is evaluated using Receiver Operating Characteristic (ROC) curve analysis.The ROC curve is organized by plotting the TPR against FPR over various thresholds.Figure 6 illustrates the ROC curves for the models trained on the subset selected by NAP variable selection.In addition, the calculated TPR and FPR for classifiers can be expressed as an opportunity cost or loan loss for lending institutions [63].In other words, if lending institutions misclassify good borrowers as bad, refusing to grant loans, it will lead to an opportunity cost.In contrast, in a case where bad borrowers are classified as good borrowers, this creates a loss.For our result, ROC curve of MLP with sigmoid model is higher, which means lower false positive and negative rates than all other algorithms.In other words, if lending institutions used Deeper MLP with sigmoid activation function model trained on the subset selected by NAP variable-selection method to estimate borrowers' credit scores, their opportunity cost and loan losses would be lower than other models.
From ROC curve analysis, the best credit scoring model is to provide two objectives, which are to maximize TPR (correctly classifying all good borrowers) and minimize FPR (incorrectly classifying all bad borrowers).Since these two objectives cannot be fulfilled comprehensively, lending institutions use multi-class credit scoring as FICO credit scores to balance their profits and losses.

Empirical Comparison of Machine-Learning Models between FICO
One of the primary objectives of this study is to build a multi-class credit scoring model the same as FICO credit scores and compare them.In this set of experiments, to compare our results with FICO, we applied the percent of FICO's population which is described in Section 2.1.3to determine cut-off values for each credit category.In addition, we made 10 times re-sampling from test dataset to construct equivalent distribution as the FICO credit scores as shown in Table 2. Then averaged cumulative ECL was estimated for each model.Table 6 presents cumulative ECL of our built models on two subsets and FICO credit scores.For variable-selection algorithms, average cumulative ECL of NAP method outperformed TSFFS method the same as theoretical evaluation metrics.Concerning machine-learning methods, XGBoost model could not outperform other models in terms of

Empirical Comparison of Machine-Learning Models between FICO
One of the primary objectives of this study is to build a multi-class credit scoring model the same as FICO credit scores and compare them.In this set of experiments, to compare our results with FICO, we applied the percent of FICO's population which is described in Section 2.1.3to determine cut-off values for each credit category.In addition, we made 10 times re-sampling from test dataset to construct equivalent distribution as the FICO credit scores as shown in Table 2. Then averaged cumulative ECL was estimated for each model.Table 6 presents cumulative ECL of our built models on two subsets and FICO credit scores.For variable-selection algorithms, average cumulative ECL of NAP method outperformed TSFFS method the same as theoretical evaluation metrics.Concerning machine-learning methods, XGBoost model could not outperform other models in terms of cumulative ECL.Deeper MLP with sigmoid model, however, achieved the lowest cumulative ECL from C1 rating to C4.In addition, cumulative ECL of FICO credit scores was higher than most machine-learning models between C1 and C7 credit ratings.For both FICO scores and machine-learning models, the cumulative ECLs between C1 and C8 are equal because those models are evaluated by same distribution.This means that machine-learning models can distinguish bad borrowers into C8 credit ratings more than FICO credit scores.To summarize, if lending institutions approved loan requests of the C1 to C7 credit rating predicted by machine-learning models with NAP variable selection using SCF data, their cumulative ECL would be lower than FICO credit scores (Figure 7).Regarding borrowers who belong to C1-C7 credit rating, MLP with sigmoid activation function model trained on NAP subset indicated the lowest cumulative ECL, which is equal to 7.37%.For other scenarios, Deeper MLP with sigmoid activation function model achieves the lowest cumulative ECL.Nevertheless, this study shows that if lenders in the 2001s used their own credit scoring model built by machine-learning methods instead of FICO credit scores, their cumulative ECL would be lower.
ECL would be lower than FICO credit scores (Figure 7).Regarding borrowers who belong to C1-C7 credit rating, MLP with sigmoid activation function model trained on NAP subset indicated the lowest cumulative ECL, which is equal to 7.37%.For other scenarios, Deeper MLP with sigmoid activation function model achieves the lowest cumulative ECL.Nevertheless, this study shows that if lenders in the 2001s used their own credit scoring model built by machine-learning methods instead of FICO credit scores, their cumulative ECL would be lower.

Discussion
Credit risk is one of the main fundamental risks a bank or any other financial institution has to meet when operating in the markets.In particular, lending institutions could manage their sustainability as well as financial stability by controlling their credit risk.In order to substantially reduce the potential for credit loss, it is crucial to have a well-designed credit scoring model for bank client credit assessment.In the past, these have been developed by credit experts.Nowadays, machine-learning algorithms have been successfully applied for consumer credit scoring without credit experts.However, previous papers which considered to credit scoring model based on machine-learning approaches have not compared their model to human-based credit scoring model.Therefore, this study compared various machine-learning algorithms to FICO credit scores in order to fill the gap between experimental studies from literature.To do this, we developed a credit scoring model based on SCF dataset using regression-type machine-learning approaches.We also did data pre-processing which were variable creation, transformation and outlier detection.
Most importantly, this study contributed to investigating a more practical model and suggested an effective evaluation metric in order to provide comparisons on real-life application that focuses on consumer credit scoring services.From the results, machine-learning-based credit scoring models outperformed FICO credit scores in terms of cumulative ECL.Then it was observed that if lending institutions in the 2001s used their own credit scoring model constructed by machine-learning approaches with NAP variable-selection algorithm on SCF dataset instead of FICO credit scores, their actual credit losses would be lower and more sustainable.However, it is possible that the results of the empirical comparison may have very slight bias because the estimated PD for the overall population of FICO credit scores is in general different than for those who have debt in sampled SCF data.

Conclusions
One of the main focuses of lending institutions is an efficient credit scoring model.In the past, such a model has been developed by human experts, requiring a lot of resources and time.The machine-learning algorithm and artificial intelligence can be used to help the experts and reduce labour.This study compared the state-of-the-art machine-learning approaches with two variables selection algorithm and FICO credit scores by establishing the U.S. families' survey data as a practical benchmark data.

Figure 1 .
Figure 1.System design for credit scoring.SCF-Survey of Consumer Finances and RF-Random Forest.

Figure 1 .
Figure 1.System design for credit scoring.SCF-Survey of Consumer Finances and RF-Random Forest.
indicates the result of the hypothesis tests, from the left to the right, in which the significance level increases and p-value decreases.At the first step, we have retained 222 variables which are statistically and significantly different for both bad and good borrowers in terms of mean value or frequency.Those variables are represented by cyan points.Other non-significant variables are represented by red points.

Figure 3
indicates the result of the hypothesis tests, from the left to the right, in which the significance level increases and p-value decreases.At the first step, we have retained 222 variables which are statistically and significantly different for both bad and good borrowers in terms of mean value or frequency.Those variables are represented by cyan points.Other non-significant variables are represented by red points.

Figure 3 .
Figure 3.The results of hypothesis tests.Figure 3. The results of hypothesis tests.

Figure 3 .
Figure 3.The results of hypothesis tests.Figure 3. The results of hypothesis tests.

Figure 4 .
Figure 4.The result of feature selection using RF feature importance and correlation.

Figure 5 .
Figure 5.The variance inflation factor for selected variables.

Figure 4 .
Figure 4.The result of feature selection using RF feature importance and correlation.

Figure 4 .
Figure 4.The result of feature selection using RF feature importance and correlation.

Figure 5 .
Figure 5.The variance inflation factor for selected variables.

Figure 5 .
Figure 5.The variance inflation factor for selected variables.

Figure 6 .
Figure 6.ROC curve comparing the model performances on the subset selected by NAP variable selection method.

Figure 6 .
Figure 6.ROC curve comparing the model performances on the subset selected by NAP variable selection method.

Figure 7 .Figure 7 .
Figure 7. Cumulative ECL comparing machine-learning model performances on NAP subset and FICO.

Table 2 .
U.S distribution of FICO credit scores and probability of default by FICO credit scores from 2000 to 2002.

Table 4 .
Searching space of hyper-parameters.

Table 5 .
The result of machine-learning algorithms.

Table 6 .
Cumulative ECL for each credit rating, the comparison between FICO and machine-learning models.