Enhancing Supervised Model Performance in Credit Risk Classification Using Sampling Strategies and Feature Ranking

: For the financial health of lenders and institutions, one important risk assessment called credit risk is about correctly deciding whether or not a borrower will fail to repay a loan. It not only helps in the approval or denial of loan applications but also aids in managing the non-performing loan (NPL) trend. In this study, a dataset provided by the LendingClub company based in San Francisco, CA, USA, from 2007 to 2020 consisting of 2,925,492 records and 141 attributes was experimented with. The loan status was categorized as “Good” or “Risk”. To yield highly effective results of credit risk prediction, experiments on credit risk prediction were performed using three widely adopted supervised machine learning techniques: logistic regression, random forest, and gradient boosting. In addition, to solve the imbalanced data problem, three sampling algorithms, including under-sampling, over-sampling, and combined sampling, were employed. The results show that the gradient boosting technique achieves nearly perfect Accuracy , Precision , Recall , and F 1 score values, which are better than 99.92%, but its MCC values are greater than 99.77%. Three imbalanced data handling approaches can enhance the model performance of models trained by three algorithms. Moreover, the experiment of reducing the number of features based on mutual information calculation revealed slightly decreasing performance for 50 data features with Accuracy values greater than 99.86%. For 25 data features, which is the smallest size, the random forest supervised model yielded 99.15% Accuracy . Both sampling strategies and feature selection help to improve the supervised model for accurately predicting credit risk, which may be beneficial in the lending business.


Introduction
Machine learning techniques have several benefits in various applications, especially in the form of predicting a trend or outcome.Hence, machine learning models can accurately assess credit default probabilities and improve credit risk prediction [1].Focusing on financial services like personal loans, accurately predicting the risk of non-performing loans (NPLs) in peer-to-peer (P2P) lending is one crucial thing for lenders such as P2P lending platforms.When borrowers fail to repay (or default on) their loans, it brings about an NPL for the lenders.Generally, an NPL is a major task to overcome in order to reach the stability and profitability of not only financial institutions [2] but also P2P platforms.So, risk assessment measures, diversification strategies, and collection processes are always performed to minimize the NPL issue.These P2P platforms, which are widely used in many countries, are involved with higher risk than traditional lending, because they depend on individuals [3].However, there are many advantages superior to banking credit, i.e., lenders' and borrowers' direct interaction, detailed credit scoring [4], and the opportunity to gather and analyze large numbers of data which can be used to assess trustability and reduce risks [5].Therefore, several previous research works have been studied to build an efficient model to predict the risk of lending [6][7][8].Still, there are several challenges, including selecting important features, coping with imbalanced data, handling data quality, and experimenting with in-depth model evaluations.Based on lending datasets, they often contain imbalanced data, i.e., a higher proportion of good loans than risky ones.This possibly leads to the model's prediction bias.In addition, resolving missing values in the data needs mindful consideration of whether to impute, remove, or ignore them.In addition, selecting a relevant and informative input feature set is one substantial step for avoiding model overfitting or underfitting.Apart from that, to ensure the efficient model's real-world performance, one way is to validate it on many lending datasets.In summary, covering these above challenges may be able to result in successfully developing a reliable lending risk prediction model.
In this research, to overcome the challenges associated with building a machine learning model for this lending risk prediction problem, various approaches were implemented and contributed.Firstly, exploratory data analysis (EDA) to explore and clean the data was conducted, which aimed to adjust data quality before initiating the model creation process.Secondly, logistic regression (LG), random forest (RF), and gradient boosting (GB), which are supervised machine learning approaches, were used for model building experiments.Thirdly, over-sampling, under-sampling, and combined sampling techniques to mitigate the imbalanced data problem were comparatively employed.Lastly, an experiment on reducing feature number according to its importance computed by mutual information was also performed.
The remaining sections of this paper are organized as follows.A brief literature review about the machine learning approaches and imbalanced data handling techniques utilized in this study is provided in Section 2. The methodology such as material data description, data preparation, experimental setup, and performance evaluations is outlined in Section 3. In Section 4, the results and discussions are reported.Finally, in Section 5, the conclusion and future works are summed up.

Literature Review
Lately, various machine learning algorithms have been applied in the lending risk assessment problem [9], for example, logistic regression, variance in decision trees [10], neural networks and deep learning [11], as well as ensemble approaches [12][13][14].One important issue is the imbalanced data problem.Commonly, the number of good credit customers is much greater than that of bad ones.This problem needs to be mitigated, since many machine learning algorithms cannot well handle it, leading to biased predictive models.Consequently, many wrong predictions bring about lenders' financial losses.Therefore, variously proposed techniques to handle imbalanced data have been offered by researchers.Some examples are as follows.Ref. [15] offered the under-sampling method in their resampling ensemble model called REMDD for imbalanced credit risk evaluation in P2P lending.In the work [16], the ADASYN (adaptive synthetic sampling approach) [17] was adopted for reducing the class imbalance problem.Meanwhile, ref. [18] proposed quite balanced datasets, yielded by employing the under-sampling technique for creating models to predict the default risk of P2P lending.Focusing on datasets previously used in this research domain, the LendingClub dataset is one famously public dataset.It is from a lending platform in the United States.There are several LendingClub dataset versions which have been used in many works, as exemplified in Table 1.
In research communities, the lending prediction problem is currently active.One major challenge of this problem is how to effectively solve imbalanced data for machine learning model training.Due to a lot of attributes in the lending dataset, one challenge is how to effectively reduce data dimension.Recently, random forest classifiers combining with either the feature selection method [22] or the imbalanced data handling technique [23] showed good predictive results.Apart from that, the LendingClub dataset is still a widely used public dataset in numerous research studies.This presents several challenges such as its big volume, rapidly increasing volume size, high data dimensionality, missing data occurrence, and massive imbalanced data.To create efficient models for loan status prediction and eventually decrease risks within the lending system, rigorous data exploration should be performed to cope with such a messy dataset.

Machine Learning Approaches
In this study, three machine learning approaches, i.e., logistic regression, random forest, and gradient boosting, were applied to create models for predicting loan statuses.A brief overview of each algorithm is provided below.

Logistic Regression (LR)
Logistic regression [24] is based on a statistical approach primarily designed for solving binary classification problems, where the output has two categorical classes.The probability of an input belonging to a specific class using the logistic function (sigmoid function) is calculated via Equation (1).
where x 0 , x 1 , . . ., x n are the values of input features and the associated parameters updated during the learning process are represented as θ 0 , θ 1 , . . ., θ n .These parameters are then returned as the model for prediction.The output is in the range of 0 to 1, indicating the probability of the input belonging to the positive class.Logistic regression is simple to interpret and efficient, especially in situations where the relationship between the features and the binary output is assumed to be linear.Logistic regression is currently popular for building predictive analyses in financial research [25][26][27][28].

Random Forest (RF)
The random forest algorithm [29] is an ensemble learning method widely used for classification and regression tasks.The data are separated into M subsets for creating M decision trees with several parameters involved in the creation of decision trees such as the maximum depth of the tree, the minimum number of samples required to split an internal node, and the minimum number of samples required to be at a leaf node.A forest of decision trees is constructed during the training phase.Each decision tree is created using a subset of the training data and a random subset of features at each split, presenting diversity among the trees.Random forest is based on the bagging technique, in which multiple subsets of the training data (with replacement) are used to train individual trees, thereby reducing overfitting and improving generalization.Additionally, random feature selection at each split ensures that the trees are less correlated, resulting in a more robust ensemble.For classification tasks, the voting process involves counting the votes for each class from all the decision trees, and the class with the most votes is chosen as the final prediction.Mathematically, if M is the number of trees in the random forest and V ij is the vote count for class j by tree i, the final predicted class y pred is determined via Equation (2).
where y pred is the predicted class and argmax j returns the class j that maximizes the sum of votes across all trees.Random forest has been widely applied to various problems in the lending domain [30][31][32].

Gradient Boosting (GB)
Gradient boosting [33] is a machine learning algorithm that operates by sequentially improving the performance of weak learners, typically decision trees, to create a strong predictive model.The algorithm works in an iterative manner, adding new weak learners to correct the errors made by the existing ensemble.An initial prediction for each class is often set by assigning balanced probabilities to each class.Subsequently, the pseudo-residuals for data input i and class j, denoted as r ij , are calculated via Equation (3).
where y ij is the true class label for data input i and class j.F m−1 (x i ) is the predicted class probability for data input i from the model at iteration m − 1.The pseudo-residuals, r ij , represent the disparity between the true class labels y ij and the current predicted class probabilities.The iterative update of the class probabilities is derived via Equation ( 4): where η is the learning rate, controlling the contribution of each weak learner to the overall model.h m (x i ) represents the prediction made by the weak learner for the data input i at iteration m. γ m is the weight assigned to the output of the weak learner at iteration m.This weight is determined during the training process and is chosen to minimize the overall loss of the model.The final prediction for a given input in a classification task is determined by selecting the class with the highest cumulative probability of all weak learners.Mathematically, the predicted output is expressed as: where ŷi is the predicted class for data input i. F M (x i ) is the cumulative sum of contributions from all weak learners up to the final iteration (M) for data input i. Gradient boosting has recently gained popularity for risk prediction in the financial domain [10,[34][35][36][37]].

Resampling Imbalanced Data
Three widely used approaches to handle imbalanced data were applied, including the over-sampling, under-sampling, and combined sampling approaches.Their details are explained as follows.

Over-Sampling Approach
In the first approach, SMOTE (Synthetic Minority Over-sampling Technique) [38] was employed for generating synthetic data of the minority class to create a more balanced dataset.The basic idea behind SMOTE is to create synthetic data by interpolating between the existing data of the minority class.Let x i be a data point from the minority class and x zi be one of its k nearest neighbors (i.e., it is selected randomly).Also, let λ be a random number between 0 and 1.The synthetic data point x new is created using the formula: This equation represents a linear interpolation between the original minority class data point x i and one of its k nearest neighbors x zi .The parameter λ determines the amount of interpolation, and it is randomly chosen for each synthetic data point.In summary, the steps of SMOTE are as follows.
(1) Select a minority class data point x i .
(3) Randomly select one of the neighbors x zi .(4) Generate a random number λ between 0 and 1.
(5) Use the formula to create a synthetic instance x new .(6) Repeat steps (1)-(5) for the desired number of synthetic data points.This process helps balance the class distribution by creating synthetic data points along the line segments connecting existing minority class data points, consequently solving the class imbalance issue in the dataset.

Under-Sampling Approach
Under-sampling for handling imbalanced data problems involves reducing the size of the majority class to balance it with the minority class.In this approach, the data are randomly selected from the majority class to achieve a more balanced class distribution [39].Unlike SMOTE, which involves creating synthetic data, the random under-sampling simply removes examples from the majority class randomly.We assume x i is a data point from the majority class, N is the total number of data points in the majority class, and N new is the desired number of data points after under-sampling.The basic idea is to randomly select N new data points from the majority class without replacement.The processes of random under-sampling are as follows.
(1) Calculate the sampling ratio: ratio = N new N .
(2) For each data point x i in the majority class: (2.1)With probability ratio, keep x i .(2.2) With probability 1 − ratio, discard x i .This process is repeated until N new data points are selected, achieving the desired sample class distribution.

Combined Sampling Approach
For the combined sampling approach, we use SMOTEENN, which is a combination of over-sampling using SMOTE and under-sampling using edited nearest neighbors (ENN) [40].The goal is to address imbalanced data by first generating synthetic data points with SMOTE and then cleaning the dataset using edited nearest neighbors to remove potentially noisy examples.After applying SMOTE to generate synthetic data points, edited nearest neighbors is used to remove data points that are considered noisy or misclassified.
(1) Identify data points in the dataset that are misclassified.
(2) For each misclassified data point, check its k nearest neighbors.This process helps to improve the overall quality of the dataset by eliminating noisy points introduced during the over-sampling process.

Data Description and Preprocessing
In this work, financial data provided by the LendingClub Company from 2007 to 2020Q3 [41] were used.The data consist of 2,925,493 records which are divided into various loan statuses as in Table 2.The loan statuses are categorized as "Good" or "Risk" users.The "Fully Paid" status is categorized as "Good" users, whereas "Charged Off", "In Grace Period", "Late (16-30 days)", "Late (31-120 days)", and "Default" are grouped as "Risk" users.The "Current" status is not explicitly categorized as "Good" or "Risk" since it represents the current state of ongoing payments."Issued" is also not classified as it may refer to loans that are approved but not yet active."Does not meet the credit policy" users were excluded in this study.For the experiment, the data contain 1,497,783 samples labeled as "Good" and 391,882 samples labeled as "Risk", totaling 1,889,665 samples.This is a two-class dataset with an imbalance ratio (IR) equal to 3.82, which shows a slight class imbalance, as displayed in Figure 1.IR is the majority class size divided by the minority class size.A high IR value may affect model performance in some machine learning algorithms, i.e., the majority class is more correctly predicted than the minority class due to imbalanced training data causing model bias.However, the selected algorithms used in this paper like random forest and gradient boosting are quite robust for mild class imbalance.It is more helpful to solve imbalanced data before model training because of the increasing chance of model performance improvement.In addition, the imbalanced data handing methods used in this work are explained in Section 2.3.There are 141 attributes in the original data which contain many missing values, as illustrated in Figure 2, with high percentages.Some columns need to be dropped and transformed before training the models.The data were preprocessed through the following steps.
(1) Drop column "id" because it typically serves as a unique identifier for each row, and including it as a feature could lead the model to incorrectly learn patterns that are specific to certain ids rather than generalizing well to new data.(2) Drop "url" because it might not provide meaningful information for your model, or its content might be better represented in a different format.(3) Drop columns "pymnt_plan" and "policy_code" because every record in the "pymnt_plan" column has the value "n" and every record in the "policy_code" column has the value 1.These columns contain constant values, resulting in the model being unable to differentiate between different data inputs.(4) Drop columns that have missing values exceeding 50%.The selected dataset now comprises 101 columns, including 100 features and the loan status.(5) In the "int_rate" and "revol_util" columns, convert the percentage values from string format to float.(6) For categorical data, fill the missing values with the mode and transform them into numerical values.(7) For real value data, fill the missing values with the mean of the existing values.The summary of missing values on each attribute excluding the "id", "url", "pymnt_plan", and "policy_code" attributes.Now, the dataset comprises 100 features.Each feature was explored in the relationship with its target variable (class label) to rank the importance of features.Mutual information (MI) can identify informative features on both linear and non-linear relationships between features and target variables.In feature selection, a feature with a higher mutual information value is considered as more important and typically selected into a training feature set.The importance of these features can be represented by mutual information, as defined in Equation (7).

MI(X
where p(y), p(x, y), and p(x) represent the probabilities associated with the target variable Y and the joint and marginal distributions of features X and Y, respectively.The mutual information values for all features are presented in Figure 3.These were used to investigate the impact of feature selection on the performance of the models.The correlation matrix for the first 25 features with the highest mutual information and the class label is depicted in

Model Creations and Evaluations
An overview of the processes in this work is depicted in Figure 5.The raw dataset was explored for characteristics such as data types and missing values.Subsequently, the data were preprocessed to handle missing values.The dataset was then separated into training and testing sets.Two data splitting protocols were experimented with, i.e., hold-out cross-validation with a 70:30 ratio of training and testing sets and 4-fold cross-validation.Next, the training data were prepared in four versions based on imbalanced data handling methods, including original (no sampling), over-sampling, under-sampling, and combined sampling training data.Each training dataset version was used to create three models using logistic regression, random forest, and gradient boosting approaches.The testing dataset was employed to evaluate model performance by calculating Accuracy, Precision, Recall, F1 score, and Matthews Correlation Coefficient (MCC).In the context of imbalanced data, where one class may dominate the others, using macro-averaging for Precision, Recall, F1 score, and MCC can provide a more balanced evaluation across different classes.Then, the confusion matrix was displayed, which is a tabular representation commonly employed to assess the effectiveness of a classification algorithm.This matrix provides a concise overview of the model's performance by detailing the distribution of predicted and actual class labels (Figure 6).Denote that green and red cells stand for the number of correctly predicted and wrongly predicted samples, respectively.TP and TN are the sample numbers of correctly classified to positive and negative classes, respectively, while FP and FN are the sample numbers of wrongly classified to positive and negative classes, respectively.Subsequently, key performance metrics such as Accuracy, Precision, Recall, F1 score, and MCC were computed as follows: Accuracy is a fundamental metric that measures the overall correctness of a classification model by assessing the proportion of testing data that are correctly predicted out of the total testing data size.
Precision (Macro-Averaged) is a metric used to evaluate the Precision of a classification model when dealing with imbalanced datasets.In the context of macro-averaging, Precision is calculated individually for each class and then averaged across all classes.
where C is the number of classes and TP i and FP i are the true positives and false positives for class i.
Recall (Macro-Averaged) is a metric used to evaluate the Recall of a classification model in the context of imbalanced datasets.In macro averaging, Recall is calculated individually for each class and then averaged across all classes.
where FN i represents the false negatives for class i. F1 score (Macro-Averaged) is a metric that combines both Precision and Recall, offering a balanced assessment of a model's performance on imbalanced datasets.In macro averaging, the F1 Score is calculated individually for each class and then averaged across all classes.
where Precision i and Recall i are the Precision and Recall for class i.
The Matthews Correlation Coefficient (MCC) is one of the metrics suitable for evaluating binary classification models, especially models that were trained by imbalanced datasets, because true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) are all taken into account in its formula.The MCC is defined as: The MCC value ranges between −1 and 1.The best and worst MCC values are 1 and −1, respectively.When the MCC value is 0, this means that the model performance is not greater than that of random guessing.

Four-Fold Cross-Validation
The average confusion matrices for 4-fold cross-validation results are illustrated in Figure 9a-d.The performance metrics for logistic regression, random forest, and gradient boosting are shown in Tables 4, 5, and 6, respectively.Three imbalanced data handling approaches, including over-sampling, under-sampling, and combined sampling, can improve the performance of models trained by logistic regression, random forest, and gradient boosting algorithms.Considering only the performance of logistic regression models, models with the combined sampling approach outperform the others.For random forest and gradient boosting models, when the under-sampling approach was employed, they both showed better model performance compared to the other sampling approaches.In general, from all 4-fold cross-validation results, gradient boosting models with the under-sampling method gave the superior performance.The additionally depicted comparisons of MCC and F1 score are shown in Figure 10 and Figure 11, respectively.
Overall result summation from both experiments of the two cross-validation methods indicates that the gradient boosting algorithm with an appropriate data solving technique for supervised model training offers the very impressive ability of resulting in models that correctly classify both "Good" and "Risk" instances.
Next, the feature selection method was applied, i.e., computing and ranking the mutual information (MI) values of each feature, in order to reasonably select important features of the smaller feature size k of the training set.So, the training data with the best imbalanced data handling technique for each model were further explored by preparing a smaller number of k features via their MI values to assess the trade-offs between the different important feature sizes and their impact on model performance.The features were ranked based on their computed values of mutual information.Three numbers of feature size, i.e., k = 25, 50, and 100, were experimented with.The results of logistic regression, random forest, and gradient boosting models on both 70:30 hold-out cross-validation and 4-fold cross-validation with three different feature sizes are shown in Figures 12, 13, and 14, respectively.Generally, the performance of all three supervised models was reduced slightly.For k = 25 and 50 important features as training data, gradient boosting models showed better results than logistic regression and random forest models.Focusing on k = 50 important features, random forest and gradient boosting models yielded five performance values, i.e., Accuracy, Precision, Recall, F1 score, and MCC, greater than 99%, whereas logistic regression models gave four performance values, excepting MCC, higher than 95%.For k = 25 important features, gradient boosting models still yielded Accuracy, Precision, Recall, and F1 score values not less than 99%, but MCC values reduced to around 97.5%.These show that when the number of features was reduced by half (k = 50), the performance values were reduced by only less than 1%.Although the number of features was approximately reduced by 75% (k = 25), the performance values were reduced by only less than 1-2%.Apart from that, the performance of gradient boosting models using k = 100 important features was better than that of the others on both 70:30 hold-out cross-validation and 4-fold cross-validation experiments.In order to additionally display our results compared with previous research, the performance comparison of the proposed methods with other existing works on various versions of LendingClub data is shown in Figure 15.Based on Accuracy, the proposed methods outperform the others.

Conclusions and Future Work
This study provided a very efficient solution to the problem of credit risk prediction.To investigate the improved predictive model results that could be better than those from previous works, three popular machine learning methods, including logistic regression, random forest, and gradient boosting, were employed.Additionally, the imbalanced data problem was resolved by experimenting with various sampling strategies: under-sampling, over-sampling, and combined sampling.Based on our best model performance outcomes, the over-sampling as well as under-sampling methods robustly manage class-imbalanced data, especially when the training model uses the gradient boosting method.In addition, the feature numbers of the data were reduced by selecting only important features for the training set according to their ranks computed by mutual information.Another experiment was performed using two reduced feature sets, the half size as well as the one-fourth size of its original feature size.The resulting model performance was just barely decreased.Remarkably, both random forest and gradient boosting models created by the reduced feature sets with the half size showed impressive Accuracy values, higher than 99%.
This comprehensive analysis enhances better understanding of credit risk prediction using a supervised learning method combined with various imbalanced data solving strategies.Furthermore, the importance of features based on mutual information was addressed in order to increase model performance with the smaller feature size of training data.Our proposed method and results offer a simple way to select important features with the reduced size by ranking the mutual information values of each feature.In spite of this method not providing the most optimal size with the best performance, it can apply to other large credit risk data with different feature sets.This approach does not significantly decrease performance, but there might be better methods available.In future work, it may be beneficial to further investigate parameter optimization, particularly in handling imbalanced data, and explore alternative feature selection methods beyond mutual information, such as correlation and symmetrical uncertainty, to improve model performance.In addition, ensemble techniques could offer performance improvement of those small feature sizes.Apart from that, real-time data streams and dynamic model updating may increase the adaptability of credit risk prediction systems.

( 2 . 1 )
If the majority of the neighbors have a different class label, remove the misclassified data point.

Figure 1 .
Figure 1.Class imbalance of our experimental dataset from LendingClub dataset.

Figure 4 .
Each cell in the table shows the correlation between two variables.It is often used to understand the relationships between different variables in a dataset.The values range from −1 to 1, where −1 indicates a perfect negative correlation, 1 indicates a perfect positive correlation, and 0 indicates no correlation.

Figure 3 .
Figure 3.A summary of mutual information (MI) across the 100 features used.

Figure 4 .
Figure 4. Correlation matrix on the first 25 highest mutual information features.

Figure 7 .
Figure 7.Comparison of training data sizes on various sampling methods.

4. 1 .
Hold-Out Cross-Validation with 70:30 Ratio of Training and Testing Sets

Figure 8 .
Confusion matrices of 70:30 hold-out cross-validation results.(a) No sampling testing data (original data).(b) Over-sampling testing data.(c) Under-sampling testing data.(d) Combined sampling testing data.

Figure 9 .
Average confusion matrices of four-fold cross-validation results.(a) No sampling testing data (original data).(b) Over-sampling testing data.(c) Under-sampling testing data.(d) Combined sampling testing data.

Figure 11 .
Figure 11.Average F1 score comparison for different sampling methods.

Figure 12 .
Figure 12.Logistic regression model performance on five metrics for k different numbers of features, i.e., k = 25, 50, and 100.

Figure 13 .
Figure 13.Random forest model performance on five metrics for k different numbers of features, i.e., k = 25, 50, and 100.

Figure 14 .
Figure 14.Gradient boosting model performance on five metrics for k different numbers of features, i.e., k = 25, 50, and 100.

Figure 15 .
Figure 15.Accuracy of proposed method compared with existing works on various versions of LendingClub data.Denote that ˆand * symbols stand for the different dataset or experiment in the same work.

Table 2 .
Dataset from LendingClub company from 2007 to 2020Q3 and loan status distribution.

Table 3 .
The performance of three different machine learning techniques with various sampling approaches in the 70:30 hold-out cross-validation experiment.The superscript numbers in the brackets denote the performance ranking based on the evaluation measure in each column.

Table
Performance metrics for logistic regression with different sampling approaches in 4-fold cross-validation (4-fold cv) experiments.

Table 5 .
Performance metrics for random forest with different sampling approaches in 4-fold crossvalidation (4-fold cv) experiments.

Table
Performance metrics for gradient boosting with different sampling approaches in 4-fold cross-validation (4-fold cv) experiments.
Figure 10.Average MCC comparison for different sampling methods.