A Partially Interpretable Adaptive Softmax Regression for Credit Scoring

: Credit scoring is a process of determining whether a borrower is successful or unsuccessful in repaying a loan using borrowers’ qualitative and quantitative characteristics. In recent years, machine learning algorithms have become widely studied in the development of credit scoring models. Although efﬁciently classifying good and bad borrowers is a core objective of the credit scoring model, there is still a need for the model that can explain the relationship between input and output. In this work, we propose a novel partially interpretable adaptive softmax (PIA-Soft) regression model to achieve both state-of-the-art predictive performance and marginally interpretation between input and output. We augment softmax regression by neural networks to make it adaptive for each borrower. Our PIA-Soft model consists of two main components: linear (softmax regression) and non-linear (neural network). The linear part explains the fundamental relationship between input and output variables. The non-linear part serves to improve the prediction performance by identifying the non-linear relationship between features for each borrower. The experimental result on public benchmark datasets shows that our proposed model not only outperformed the machine learning baselines but also showed the explanations that logically related to the real-world.


Introduction
Credit scoring is a numerical expression of a borrower's creditworthiness that is estimated by credit experts based on applicant information using statistical analysis or machine learning models. In recent years, many machine learning models have been developed to achieve higher predictive accuracy for classifying borrowers as bad or good [1,2]. However, the inability to explain these machine learning models is one of the notable disadvantages. Financial institutions usually want to understand decision-making process of machine learning models to trust them [3,4]. Therefore, there is still a need for credit scoring model that can improve the predictive performance and its interpretation [5,6]. Without model explanations, machine learning algorithms cannot be adopted by financial institutions and would likely not be accepted by consumers [7].
From a machine learning perspective, the credit scoring problem is considered an imbalanced binary classification task because the number of bad borrowers tends to be much lower than the number of good borrowers in real-life [8][9][10][11]. As bad borrowers occur infrequently, standard machine learning models usually misclassify the bad borrowers compared to the good borrowers.
In this work, we aim to overcome these tricky issues by proposing a novel partially interpretable adaptive softmax regression (PIA-Soft) model augmented by deep neural networks to make its estimated probabilities adaptive for each class (see Figure 1). We first compute a linear transformation of input variables based on the softmax regression to obtain logits for each borrower. Secondly, we also perform a neural network (nonlinear part) to augment logit of each class to make them adaptive for dealing with an imbalance problem. Finally, the summed linear and non-linear (output of neural network) transformations are fed into the softmax function to the probability of each class. The linear part partially explains the fundamental relationship between input and output variables, and the nonlinear part serves to improve the prediction performance by identifying the non-linear relationship between features for each borrower. The PIA-Soft architecture we propose is similar to the residual neural network model (ResNet), with the linear transformation acting as a residual block [12]. However, the advantage over ResNet architecture is that the PIA-Soft model can be partially explainable.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 2 of 20 In this work, we aim to overcome these tricky issues by proposing a novel partially interpretable adaptive softmax regression (PIA-Soft) model augmented by deep neural networks to make its estimated probabilities adaptive for each class (see Figure 1). We first compute a linear transformation of input variables based on the softmax regression to obtain logits for each borrower. Secondly, we also perform a neural network (non-linear part) to augment logit of each class to make them adaptive for dealing with an imbalance problem. Finally, the summed linear and non-linear (output of neural network) transformations are fed into the softmax function to the probability of each class. The linear part partially explains the fundamental relationship between input and output variables, and the nonlinear part serves to improve the prediction performance by identifying the nonlinear relationship between features for each borrower. The PIA-Soft architecture we propose is similar to the residual neural network model (ResNet), with the linear transformation acting as a residual block [12]. However, the advantage over ResNet architecture is that the PIA-Soft model can be partially explainable. To show achievement of the proposed model, we compare our model to high-performance machine learning benchmarks such as Logistic Regression, Random Forest, Ada-Boost, XGBoost, Neural Network, LightGBM, Catboost, and TabNet [13][14][15][16][17][18][19][20]. We apply our proposed model to over four benchmark real-world credit scoring datasets. The model performance on the test set is evaluated against three theoretical measures, an area under the curve (AUC), f-score, g-mean, and accuracy [21]. Our proposed model significantly outperformed machine learning models in terms of predictive performance. In order to evaluate the interpretation of PIA-Soft model, we compare our result to logistic regression because this model is the most popular white-boxing approach that is commonly used on credit scoring application. Here are some properties of logistic regression To show achievement of the proposed model, we compare our model to high-performance machine learning benchmarks such as Logistic Regression, Random Forest, AdaBoost, XGBoost, Neural Network, LightGBM, Catboost, and TabNet [13][14][15][16][17][18][19][20]. We apply our proposed model to over four benchmark real-world credit scoring datasets. The model performance on the test set is evaluated against three theoretical measures, an area under the curve (AUC), f-score, g-mean, and accuracy [21]. Our proposed model significantly outperformed machine learning models in terms of predictive performance. In order to evaluate the interpretation of PIA-Soft model, we compare our result to logistic regression because this model is the most popular white-boxing approach that is commonly used on credit scoring application. Here are some properties of logistic regression that make it a major benchmark-good predictive accuracy, high-level interpretability, and the modeling process is faster and easier [22]. Therefore, we can utilize it to verify the trustworthiness of our proposed model by comparing its unbiased estimated coefficients for input variables.
Appl. Sci. 2021, 11, 3227 3 of 20 In the end, the main contributions of this paper are included as follows:

•
To achieve high predictive accuracy, usually, model complexity is increased. Therefore, machine learning models often make a deal with the predictive performance and interpretable predictions. We propose a model with both high predictive ability and partially explainable.

•
In order to handle class imbalance problem without sampling techniques, our proposed model is designed.

•
We extensively evaluate PIA-Soft model on four benchmark credit scoring datasets.
The experimental results show that PIA-Soft achieves state-of-the-art performance in increasing the predictive accuracy, against machine learning baselines. • It has proven that our proposed model could explore the partial relationship between input and target variables according to experiments on real-world datasets.
This work is organized as follows: Section 2 presents previous research on the topics related to machine learning models for credit scoring. We introduce the concept of the methods explored in this paper and critically evaluate tools and methodologies available to the date. Section 3 describes our proposed model in more detail. Section 4 indicates the benchmark datasets and comparison of experimental results. This section presents the predictive performance and comparison of PIA-Soft with logistic regression for model interpretability. Finally, Section 5 concludes and discusses the general findings from this work.

Related Work
During the past decades, machine learning models have been widely used in many real-life applications such as speech recognition, object detection, healthcare, genomics, and many other domains [23]. In credit scoring application, the researchers have been applied many types of machine learning algorithms such as discriminant analysis, logistic regression, linear and quadratic programming, decision trees, and neural networks [1][2][3][4][5][6][7][8][9][10]. We review such machine learning classification algorithms that are proposed for credit scoring. We also summarize the strengths and weaknesses of current credit scoring models, which used machine learning models, and drew some practical issues that serve as a foundation in this work.

Benchmark Classification Algorithms
Advanced machine learning techniques, however, are quickly gaining applications throughout the financial services industry, transforming the treatment of large and complex datasets. Still there is a massive gap between their ability to build robust predictive models and their ability to understand and manage those models [25][26][27][28][29][30]. Logistic regression is a powerful technique that commonly used in practice because it satisfies the huge gap as a mentioned above. The only major disadvantage of logistic regression is that its predictive ability seems to be weaker than other state-of-the-art machine learning models.
Another benchmark machine learning model in this field is neural networks. Firstly, West [31] applied five different neural network architectures for the credit scoring problem. They showed the mixture-of-experts and radial basis function-based neural network models must consider for credit scoring models. Since then, many neural network models have been suggested to tackle the credit scoring problem such as the probabilistic neural network [32], partial logistic neural network model [33], artificial metaplasticity neural network [34], and hybrid neural networks [28]. The neural network models achieved the highest average correct classification rate compared to other traditional techniques, such as discriminant analysis and logistic regression [35]. Although the neural network models achieve a higher predictive accuracy of the borrowers' creditworthiness, their decision-making process is rarely understood because of the models' black-box nature.
Recently, many ensemble and hybrid techniques with high predictive performance have been proposed for credit scoring application [36][37][38][39][40]. The ensemble procedure applies to methods of combining classifiers, whereby multiple techniques are employed to solve the same problem in order to boost credit scoring performance. An earlier work is that Maher and Abbod [36], who introduced a new classifier combination technique based on the consensus approach of different machine learning algorithms during the ensemble modeling phase. Their proposed technique significantly improved prediction performance against baseline classifiers. Another work proposed an ensemble classification approach based on a supervised clustering algorithm [37]. They applied supervised clustering to partition the data samples of each class into several of clusters and construct a specific base classifier for each subset. After that, the outputs of these base classifiers are combined by weighted voting. The results showed that compared to other ensemble methods, this approach is able to generate base classifiers with higher diversity and local accuracy and improve the accuracy of credit scoring. In addition, using a combination of deep learning and ensemble techniques improved the predictive performance of credit scoring [38]. Many researchers have also proposed an effective imbalanced learning approach based on a multi-stage ensemble framework [39,40]. These frameworks usually aim to balance the data in the first stage, and the ensemble models learn to obtain a superior predicted result adapting to different imbalance ratios. For our proposed model, a neural network produces additional logit for each class to make them adaptive to deal with an imbalance problem during the training phase.

Explainable Credit Scoring Model
Another line of research is related to an explainable credit scoring model, which is to understand how a borrower's scoring is calculated. More recently, the state-of-the-art machine learning models have achieved human-level performance in many fields, making it very popular [3]. Although these models have reached high predictive performance, the inability to explain them decreases humans' trust. Therefore, explainable artificial intelligence (XAI) has become very popular in credit scoring problem. XAI aims to make the model understandable and trustworthy.
Many researchers have made great efforts to improve the model understandability and increase humans' trust. Ribeiro et al. [41] proposed the LIME technique, short for Local Interpretable Model-agnostic Explanations, in an attempt to explain any decision process performed by a black-box model. LIME explains any classifier's predictions in an interpretable and faithful manner, by learning an interpretable model locally around the prediction. The disadvantage of LIME, however, is that because LIME is based on surrogate models, it can critically reduce the quality of explanations provided. Another popular method for explaining black-box models is SHapley Additive eXplanations (SHAP) [42]; SHAP are Shapley values representing the feature importance measure for a local prediction and are calculated by combining insights from six local feature attribution methods. The Shapley value can be misinterpreted because the Shapley value of a variable value is not the difference of the predicted value after removing the variable from the dataset. Many researchers have applied these two methods with state-of-the-art machine learning algorithms for making explainable models in credit scoring application [4,7,[43][44][45].
In addition, Fair Isaac Corporation (FICO) announced the Explainable Machine Learning Challenge to aim generating new research in the credit scoring domain of model explainability [46]. The winners proposed Boolean Rules via Column Generation (BRCG), a new interpretable model for binary classification where Boolean rules in disjunctive normal form (DNF) or conjunctive normal form (CNF) are learned [47]. Although this model has achieved both good classification accuracy and explainability, the authors mentioned that limitations include performance variability and the affected solution quality for large datasets. However, with regards to credit scoring application, we first need to understand what kind of model the explainable model is [48]. Although the requirements of explainable model depends directly on its user, the explainable credit scoring model should answer the following questions: (1) loan officers often want to understand how the borrower's indicators, such as age, income, etc., affect borrower's credit score; (2) rejected loan applicants want to know why they could not satisfy the lender's requirements; (3) regulators want to understand the reasoning behind the general logic used by the model when making its predictions. In order to answer these two questions, it is important to measure the impact of each variable on the borrower's default probability. By determining the impact of variables on a borrower's default probability, we can explain the behavior of models by capturing the relationship between input variables and their direction. To provide these explanations marginally, we attempt to obtain a partial explanation of the model without depreciating its predictive performance.

Softmax Regression
Softmax regression is a generalization of logistic regression to handle multiple classes [49]. In this work, in order to produce a linear logit for each class, we use softmax regression for binary classification tasks. We assume that the classes were binary: Our hypothesis took the form: where θ (1) , θ (2) ∈ R m are the weight parameter of softmax regression. From here, our cost function will be In our proposed model, we will make a linear transformation or logit θ (j)T x as adaptable using neural networks.

Neural Networks
We apply a multilayer perceptron (MLP) as an adaptation model to update the logit of softmax regression. MLP is the most commonly used type of feed-forward artificial neural network that has been developed similar to human brain function; the basic concept of a single perceptron was introduced by Rosenblatt [17]. This network consists of three layers with completely different roles called input, hidden, and output layers. Each layer contains weight parameters that link a given number of neurons with the activation function and neurons in neighbor layers. The form of MLP with a single hidden layer can be represented as follows: where ω (1) , ω (2) are weight parameters, b (1) , b (2) are bias parameters and G and H are activation functions. MLP achieves the optimal weight and bias parameters by optimizing objective function using a backpropagation algorithm to construct a model as

A Partially Interpretable Adaptive Softmax Regression (PIA-Soft)
The overall architecture of adaptive softmax regression for credit scoring is as shown in Figure 2. We first compute a linear transformation of input variables and weight parameters of softmax regression to obtain a logit for each observation. We then perform a neural network to augment the logit to adapt them for each observation to deal with an imbalance problem. Finally, summed linear transformation and output of the deep neural network is then fed into the softmax function to estimate each class's probability.
MLP achieves the optimal weight and bias parameters by optimizing objective function using a backpropagation algorithm to construct a model as

A Partially Interpretable Adaptive Softmax Regression (PIA-Soft)
The overall architecture of adaptive softmax regression for credit scoring is as shown in Figure 2. We first compute a linear transformation of input variables and weight parameters of softmax regression to obtain a logit for each observation. We then perform a neural network to augment the logit to adapt them for each observation to deal with an imbalance problem. Finally, summed linear transformation and output of the deep neural network is then fed into the softmax function to estimate each class's probability.
where ℎ define linear transformation (softmax regression) and , defines non-linear transformation (neural network). In addition, we jointly optimize softmax regression and neural networks in the endto-end framework. Our loss function for adaptive softmax regression is constructed as follows: In addition, we jointly optimize softmax regression and neural networks in the endto-end framework. Our loss function for adaptive softmax regression is constructed as follows: Appl. Sci. 2021, 11, 3227 7 of 20

Dataset
Our adaptive softmax regression models is compared with benchmark machine learning algorithms in terms of four real-world credit datasets. Three datasets from UCI repository [50], namely, Australian and Taiwan, and other one dataset from FICO's explanation machine learning challenge [47], namely, FICO. A summary of all the datasets is presented in Table 1.

Machine Learning Baselines and Hyperparameter Setting
For the PIA-Soft model, we used the same neural network architecture for all datasets. The neural network contains two hidden layers with 32 neurons. For hyper-parameters: learning rate, batch size, and epoch number must be pre-defined to train a model. We set the learning rate to 0.001, epoch number for training to 3000 and use a mini-batch with 32 instances at each iteration. An Early Stopping algorithm is used for finding the optimal epoch number based on given other hyper-parameters.
For benchmark models, Logistic regression, which have been the most widely used method for binary classification task [13].
Random Forest classification [14], which is ensemble learning method defined as an aggregation of a multiple decision tree classifiers.
AdaBoost classification [15], which is boosting algorithm that focuses on classification problems and aims to combine a set of weak classifiers into a strong one. We use a base estimator as a Decision tree classification.
XGBoost classification [16], which is a boosting ensemble algorithm, optimizes the objective of function, size of the tree and the magnitude of the weights are controlled by standard regularization parameters. This method uses Classification and Regression Trees (CART).
LightGBM [17] and CatBoost [18] are fast, distributed, high-performance gradient boosting models based on decision tree algorithm, used for classification and many other machine learning tasks.
TabNet [19] model is similar to simpler tree-based models while benefiting from high performance, almost identical to deep neural networks.
We also use exactly identical architecture to the neural network benchmark with adaptive softmax regression. The hyper-parameters of these baseline classifiers are optimized by random search with 10 cross-validation methods over parameter settings, as shown in Table 2.
In addition, we apply the most widely used re-sampling techniques with machine learning baselines on the public datasets. The resampling techniques include: SMOTE: Synthetic Minority Oversampling Technique, which is the most popular method in this area, generates synthetic samples for the minority class by using k-nearest neighbor (KNN) algorithm [51].
ADASYN: Adaptive Synthetic Sampling [52] uses a weighted distribution for different minority class instances according to their level of difficulty in learning, where more synthetic data is generated for minority class instances that are harder to learn compared to those minority examples that are easier to learn. picks an instance from the minority class instances by using random sampling with replacement until dataset is balanced.

Comparison of Predictive Performance
This empirical evaluation aims to present that our proposed PIA-Soft model could lead to better performance than both the industry-benchmark machine learning models in different evaluation metrics. Table 3 displayed the performance of machine learning models on German dataset to compare them and make a reliable conclusion. For the German dataset (see Table 3), our model indicated the best performance in terms of AUC evaluation metric. Our model achieves 0.798 AUC, 0.781 accuracy, 0.795 f-score, and 0.795 g-mean. The AUC, F-score, and accuracy indicate classifying ability between borrowers as good and bad and g-mean is better at dealing with an imbalanced ratio among credit classes. It is found that with the German dataset, our proposed model shows better predictive performances for AUC evaluation metric, indicating that our model is a suitable approach to the small dataset in credit scoring. For other evaluation metrics, neural network model with ADASYN sampling technique achieved the highest performance.
In addition, our model achieved the similar performance compared to the state-of-theart machine learning benchmarks on the Australian dataset, as shown in Table 4. CabBoost model with no sampling technique showed the best performance for AUC metric as well as this model achieved the highest performance with SMOTE sampling method for other evaluation metrics. Our model provides an improvement over the Logistic regression, Random forest, AdaBoost, Neural Network, and TabNet models by around 0.07 AUC, 0.002 accuracy, and 0.004 g-mean.  For the Taiwan dataset (see Table 5), CatBoost model achieved the highest performances, which are 0.753 AUC, 0.734 accuracy, 0.734 F-score, and 0.734 g-mean. Our proposed model showed the third best performance by achieving 0.744 AUC, 0.725 accuracy, 0.726 F-score, 0.726 g-mean. Since this dataset is balanced, we do not use the sampling techniques.
Regarding the FICO dataset (see Table 6), our model achieved the best predictive performance for all evaluation metrics. Neural Network model with ROS sampling technique achieved the second best predictive performance on AUC metric. The logistic regression model with ROS sampling technique achieved the second best performance for other evaluation metrics. Our model improved the second best performance by around 0.008 AUC, 0.021 accuracy, 0.021 F-score, and 0.021 g-mean.
In the end, our model succeeds the best predictive performance over most of the datasets. Therefore, this experiments provides evidence that our proposed PIA-Soft model equipped with a neural network works better than benchmark machine learning models on public credit scoring datasets. The next part of the experiments will show the interpretability of PIA-Soft model.

Model Interpretability
In this section, we show how to interpret the PIA-Soft model. As we explained, our model produces linear and non-linear logits for each borrower. Figure 3 shows the predicted linear and non-linear logits for A and B borrowers from German dataset. For A borrower, since the logit for class-1 is higher than class-0, we can predict that this borrower belongs to class-1. According to the proportion of class-1's logit, the linear logit is larger than the non-linear logit, and we can only explain how the linear logit depends on the explanatory variables. In other words, we can explain and understand most of the borrower's score for borrower A. On the contrary, for borrower B, the linear logit is a very small percentage of the total logit; therefore, we cannot explain the most of the borrower's score. For this reason, our proposed PIA-Soft model can be partially interpretable. In terms of all datasets, the linear and non-linear logits for each borrower are show in Figures A1-A4.
In addition, our model can compute the impact on model output for each variable. Figure 3 shows the impact of variables for each class on German dataset. We can observe that if the amount of the most valuable available asset increases, the logit for class-0 (good borrower) increases more than the logit for class-1 (bad borrower). In other words, we can say that if the borrower has a large amount of valuable available assets, the borrower's credit risk is decreased. Figures A1-A4.
In addition, our model can compute the impact on model output for each variable. Figure 3 shows the impact of variables for each class on German dataset. We can observe that if the amount of the most valuable available asset increases, the logit for class-0 (good borrower) increases more than the logit for class-1 (bad borrower). In other words, we can say that if the borrower has a large amount of valuable available assets, the borrower's credit risk is decreased.  We also display how other variables affect credit score for all datasets in Figure 4 for German dataset. These estimated coefficients from the results of the PIA-Soft model are logically consistent with the real-life and logistic regression (see Figure 5). The logistic regression estimates coefficients for only class-1. Therefore, we compare weight parameters of the PIA-Soft model for class-1 to the logistic regression's coefficients. We also displayed the impact of variables for each class and the comparison of PIA-Soft model and Logistic regression on other datasets in Figures A5-A10. In the end, our experimental results show that PIA-Soft model suggests a promising direction for partially interpretable machine learning model that can combine the softmax regression and neural network by end-to-end training.  We also display how other variables affect credit score for all datasets in Figure 4 for German dataset. These estimated coefficients from the results of the PIA-Soft model are logically consistent with the real-life and logistic regression (see Figure 5). The logistic regression estimates coefficients for only class-1. Therefore, we compare weight parameters of the PIA-Soft model for class-1 to the logistic regression's coefficients. We also displayed the impact of variables for each class and the comparison of PIA-Soft model and Logistic regression on other datasets in Figures A5-A10. In the end, our experimental results show that PIA-Soft model suggests a promising direction for partially interpretable machine learning model that can combine the softmax regression and neural network by end-to-end training.  logically consistent with the real-life and logistic regression (see Figure 5). The logistic regression estimates coefficients for only class-1. Therefore, we compare weight parameters of the PIA-Soft model for class-1 to the logistic regression's coefficients. We also displayed the impact of variables for each class and the comparison of PIA-Soft model and Logistic regression on other datasets in Figures A5-A10. In the end, our experimental results show that PIA-Soft model suggests a promising direction for partially interpretable machine learning model that can combine the softmax regression and neural network by end-to-end training.

Discussion
For credit scoring application, the model interpretability is one of the most critical features, and financial institutions want to understand how the borrower's credit risk depends on the borrower's characteristics. Recently, machine learning models have been successfully used to establish credit scoring models with high predictive performance. However, the machine learning model's ambiguous decision-making process indicates the need to develop an explainable model with a high-predictive performance.
In this work, we aimed to propose an interpretable credit scoring model that can achieve state-of-the-art predictive performance using softmax regression and neural network models. Our proposed model consists of two main components: linear (softmax regression) and non-linear (neural network). The linear part explains the fundamental relationship between input and output variables. The non-linear part serves to improve the prediction performance by identifying the non-linear relationship between features for each borrower. In order to show the superiority of our proposed model, we compared our model to high-performance machine learning benchmarks on four public credit scoring datasets. In addition, in order to show our model can handle class imbalance problem without sampling techniques, we compare machine learning baselines with over sampling techniques. As bad borrowers occur infrequently, standard deep learning architectures tend to misclassify the minority (bad borrowers) classes compared to the majority (good borrowers) classes [11]. Therefore, we used the softmax function as an output of our model. Since the softmax computes the probability distributions of a list of potential outcomes and we update the logit (input of softmax function) for each class using neural network and linear models, our PIA-Soft model could handle the class imbalance problem.
Experimental results showed that our proposed model significantly outperformed machine learning models in terms of predictive performance. We also compare our proposed model to logistic regression to evaluate the model interpretation. From the result, the estimated coefficients of the PIA-Soft model are logically consistent with the real-life and logistic regression. Unlike logistic regression, our proposed model measures the impact of variables for each class, so we can estimate which class the borrower can move to faster based on each variable's change. For example, the "the duration of credit" variable has an insignificant effect on class 1 (bad borrower) and a substantial impact on class 0 (good) for German dataset.
Finally, our proposed model suggests a promising direction for a partially interpretable machine learning model that can combine the softmax regression and neural network by end-to-end training.
However, since we use bank clients' data to construct a credit scoring model, this sample data may differ from the overall population distribution. Therefore, there is a limitation that the trained machine learning models cannot be robust on overall population distribution. To solve this problem, we anticipate potential future work in this area that includes developing adaptive machine-learning algorithms for unseen data based on generative models such as variational auto-encoder, generative adversarial networks, etc.