You are currently viewing a new version of our website. To view the old version click .
Mathematics
  • Article
  • Open Access

28 February 2024

Resampling Techniques Study on Class Imbalance Problem in Credit Risk Prediction

,
,
,
and
1
School of Statistics and Mathematics, Yunnan University of Finance and Economics, No. 237, LongQuan Rd., Kunming 650221, China
2
School of Computer Science, University of Nottingham Ningbo China, Ningbo 315100, China
3
School of Business, Ningbo University, 818 Fenghua Road Ningbo, Ningbo 315211, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Quantitative Finance with Mathematical Modelling

Abstract

Credit risk prediction heavily relies on historical data provided by financial institutions. The goal is to identify commonalities among defaulting users based on existing information. However, data on defaulters is often limited, leading to a concentration of credit data where positive samples (defaults) are significantly fewer than negative samples (nondefaults). It poses a serious challenge known as the class imbalance problem, which can substantially impact data quality and predictive model effectiveness. To address the problem, various resampling techniques have been proposed and studied extensively. However, despite ongoing research, there is no consensus on the most effective technique. The choice of resampling technique is closely related to the dataset size and imbalance ratio, and its effectiveness varies across different classifiers. Moreover, there is a notable gap in research concerning suitable techniques for extremely imbalanced datasets. Therefore, this study aims to compare popular resampling techniques across different datasets and classifiers while also proposing a novel hybrid sampling method tailored for extremely imbalanced datasets. Our experimental results demonstrate that this new technique significantly enhances classifier predictive performance, shedding light on effective strategies for managing the class imbalance problem in credit risk prediction.

1. Introduction

Credit, as defined by financial institutions such as banks and lending companies [1], represents a vital loan certificate issued to individuals or businesses. This certification mechanism plays a pivotal role in ensuring the smooth functioning of the financial sector, contingent upon comprehensive evaluations of creditworthiness. The evaluation process inherently gives rise to concerns regarding credit risk, encompassing the potential default risk associated with borrowers. Assessing credit risk entails the utilization of credit scoring, a method aimed at distinguishing between “good” and “bad” customers [2]. This process is often referred to as credit risk prediction in numerous studies [3,4,5,6,7]. Presently, the predominant approaches to classifying credit risk involve traditional statistical models and machine learning models, typically addressing binary or multiple classification problems.
Credit data often exhibit a high number of negative samples and a scarcity of positive samples (default samples), a phenomenon known as the class imbalance (CI) problem [8]. Failure to address this issue may result in significant classifier bias [9], diminished accuracy and recall [10], and weak predictive capabilities, ultimately leading to financial institutions experiencing losses due to customer defaults [11]. For instance, in a dataset comprising 1000 observations labeled as normal customers and only 10 labeled as default customers, a classifier could achieve 99% accuracy without correctly identifying any defaults. Clearly, such a classifier lacks the robustness required. To mitigate the CI problem, various balancing techniques are employed, either at the dataset level or algorithmically. Dataset-level approaches include random oversampling (ROS), random undersampling (RUS), and the synthetic minority oversampling technique (SMOTE) [12], while algorithmic methods mainly involve cost-sensitive algorithms. Additionally, ensemble algorithms [13] and deep learning techniques, such as generative adversarial networks (GANs) [14], are gradually gaining traction for addressing CI issues.
Indeed, there is no one-size-fits-all solution to the CI problem that universally applies to all credit risk prediction models [15,16,17]. On the one hand, the efficacy of approaches is constrained by various dataset characteristics such as size, feature dimensions, user profiles, and imbalance ratio (IR). Notably, higher IR and feature dimensions often correlate with poorer classification performance [18]. On the other hand, existing balancing techniques exhibit their own limitations. For instance, the widely used oversampling technique, SMOTE, has faced criticism for its failure to consider data distribution comprehensively. It solely generates new minority samples along the path from the nearest minority class to the boundary. Conversely, some undersampling methods are deemed outdated as they discard a substantial number of majority class samples, potentially leading to inadequately trained models due to small datasets. Additionally, cost-sensitive learning hinges on class weight adjustment, which lacks interpretability and scalability [11].
In this study, we address the dataset size and IR considerations, and evaluate the performance of oversampling, undersampling, and combined sampling methods across various machine learning classifiers. Notably, we include CatBoost [19], a novel classifier that has been under-represented in prior comparisons. Furthermore, we introduce a novel hybrid resampling framework, strategic hybrid SMOTE with double edited nearest neighbors (SH-SENN), which demonstrates superior performance in handling extremely imbalanced datasets and substantially enhances the predictive capabilities of ensemble learning classifiers.
The contributions made in this paper can be summarized as follows: First, most credit risk prediction studies within the CI context utilize the widely adopted benchmark credit dataset, which lacks real data representation. Real data tend to be noisier and more complex. This study integrates benchmark and real private datasets, enhancing the practical relevance of our proposed NEW framework. It is likely to practically create value for financial institutions. Second, although advanced ensemble classifiers such as LightGBM [20] and CatBoost, which have emerged in recent years, are gradually being adopted in credit risk prediction due to their advantages of speed and ability to handle categorical variables more effectively, they still fail to provide new solutions to the CI problem. Moreover, there are few studies assessing their adaptability to traditional balancing methods. Our study confirms that SH-SENN significantly enhances the performance of such new classifiers, especially on real datasets with severe CI issues. We contribute to the real-world applicability of these new classifiers. Third, the fundamental concept of SH-SENN in dealing with extreme IR datasets is to improve data quality by oversampling to delineate clearer decision boundaries and by undersampling multiple times to address boundary noise and overlapping samples of categories, thereby enhancing the classifier’s predictive power. The sampling strategy is tailored to IR considerations rather than blindly striving for category balance. Hence, SH-SENN is suitable for large, extremely unbalanced credit datasets with numerous features. For instance, credit data from large financial institutions often comprise millions of entries and hundreds of features, posing challenges beyond binary classification. SH-SENN effectively enhances dataset quality to tackle such complex real-world problems.
In Section 2, we provide a comprehensive review of different resampling techniques (RTs), highlighting their effectiveness, advantages, and disadvantages in addressing credit risk prediction problems. Following this, in Section 3, we employ various RTs on four distinct datasets, comparing and analyzing their performances across (1) different machine learning classifiers and (2) datasets varying in size and IR. Concurrently, we introduce and demonstrate the effectiveness of a novel approach, SH-SENN. Lastly, the research findings are summarized in the concluding section.

3. Methodology

3.1. Framework and Evaluation Metrics

To explore the performance of various sampling techniques and machine learning models across diverse datasets, we meticulously cleaned, encoded, and feature-screened each dataset, partitioning them into training and test sets, as shown in Figure 1. To maintain consistent prior probabilities, the test and training sets were crafted to possess identical IR as the original dataset [10]. Notably, the test set remained untouched by any balancing techniques to ensure its purity.
Figure 1. The framework of RTs’ comparison. The dataset at the start place represents any one of four datasets.
The training set underwent RTs via ADASYN (ADA), SMOTE (SMO), BorderlineSMOTE (BSMO), SVMSMOTE (SSMO), cluster centroids (CCs), ENN, SMOTETomek (STOM), and SMOTEENN (SENN), resulting in nine distinct training sets alongside the original unprocessed training set. Each training set was further divided into five validation subsets for cross-validation, facilitating the determination of classifier hyperparameters. Classifiers were categorized into three groups: individual classifiers (logistic regression (LR), k-nearest neighbors (KNN), support vector machine (SVM), naïve Bayes (NB), decision tree (DT)), ensemble classifiers (random forest (RF), XGBoost (XGB), LightGBM (LGBM), CatBoost (CAT)), and balanced classifiers (balanced bagging classifier (BBC), balanced random forest (BRF)). The criterion for selecting classifiers is based on the summary of [43] by reviewing 281 credit-risk-models-related articles (our experiment did not include deep learning algorithms such as artificial neural network [44] and convolutional neural network, although the summary mentioned them, and the reason for their exclusion is their complexity compared with individual classifiers and the higher risk of overfitting they entail; however, their performance does not surpass that of other classifiers [45]). Subsequently, all training sets were subjected to evaluation using these 11 classifiers. We employed the tree-structured Parzen estimator (TPE) for Bayesian optimization to identify optimal hyperparameters and fit the models. Finally, the test set was introduced to the trained classifiers to derive their final evaluation scores. The evaluation metrics comprised as follows:
Recall = TP TP + FN ,
F 1 score = 2 TP 2 TP + FP + FN ,
AUC = 1 2 ( 1 + TP TP + FN FP FP + TN ) ,
Accuracy = TP + TN TP + FP + TN + FN ,
G - mean = Recall × TNR , TNR = TN TN + FP ,
while TP, FP, FN, and TN are calculated by Table 1.
Table 1. Confusion matrix.
Accuracy and recall stand as pivotal metrics in credit risk scenarios, yet their efficacy can be significantly impacted by a CI problem. We retained these metrics to scrutinize the effectiveness of various advanced sampling techniques. Additionally, as previously stated, we assert that minority class samples hold equal importance to majority class samples, and that the ramifications of overlooking a substantial number of good customers mirror those of missing a bad customer. Hence, we also introduce G-means, F1-score, and AUC as evaluation metrics, as suggested in [10,46]. Specifically, AUC will be utilized for subsequent hypothesis testing, as indicated by [10,31].

3.2. Datasets

We chose two benchmark credit datasets, following the recommendation in [10]: the German credit dataset and the Taiwan credit dataset, along with two application-oriented private datasets sourced from Prosper Company and Lending Club Company, as shown in Table 2. These benchmark datasets, extensively utilized in credit risk research over decades, were readily accessible, facilitating the replication of our experiment. In contrast, private datasets, notably those from Lending Club, have gained popularity in the last 5 years due to their richer feature sets and larger data volumes [47], enabling the training of more complex models, each distinct in terms of size, feature dimension, and IR. Notably, the LC dataset stands out as an example of an extremely imbalanced and large dataset. For this dataset, we intend to employ the SH-SENN technique.
Table 2. Dataset description.

3.3. SH-SENN

The process is depicted in the accompanying Figure 2. First, we initiate SMOTE strategic oversampling, wherein minority class samples are sampled to represent 10% to 90% of the majority class. Subsequently, we conduct a search for neighboring samples, which are then removed. Within the 10 new datasets obtained, the ENN proximity strategy is further adjusted to identify new neighboring samples and undergo a subsequent deletion process. Finally, we evaluate the effectiveness of these 10 datasets on the validation set, select the optimal strategy, and apply it to the test set. The key departure from the traditional SMOTEENN lies in our utilization of diverse sampling strategies coupled with cross-verification during the SMOTE process. Moreover, following the initial ENN phase, the dataset undergoes re-evaluation, and adjustments are made to the strategy for selecting the nearest neighbor during the second ENN iteration.
Figure 2. The process of SH-SENN. (a) Original dataset. (b) SMOTE oversampling to 10–90%. (c) Delete the nearest misclassification examples. (d) Getting a clearer boundary and repeat ENN as step (c).
Similar to SENN, SH-SENN also utilizes SMOTE for oversampling, followed by ENN for undersampling. However, there are two significant distinctions between them: First, SENN employs a single ENN undersampling process, whereas SH-SENN utilizes double ENN, meaning ENN is applied again after the regular SENN procedure. Second, while SENN is a mature and combined RT that can be directly applied to any imbalanced dataset, SH-SENN is a framework designed to address imbalance issues. Its approach varies according to the degree of imbalance within datasets. The strategy involves determining the proportion of minority class samples oversampled to majority class samples after the initial SENN step, denoted as α = N n m i n / N m a j , where N n m i n represents the number of minority class samples after sampling, and N m a j is the number of majority class samples. Typically, α falls within the range of [0.1, 1]. SH-SENN emphasizes treating the strategy as a hyperparameter of the classifier, participating in cross-validation to identify the most suitable α (note: this α is the outcome of the initial ENN participation). Subsequently, after determining the α value, ENN is employed for a second time. This treats the resampled dataset as new data, where ENN undersampling, adept at handling boundary noise, is utilized again to refine the decision boundary. From a technical standpoint, SH-SENN’s primary advantage over SENN lies in its ability to further reduce the introduction of new noise and interference items brought about by SENN. This advantage is particularly evident in extremely imbalanced datasets. These will be validated and presented in subsequent experiments. Lastly, SH-SENN is also an extensible framework; it will be explored in the final section.

3.4. Hypothesis Test

3.4.1. Friedman Test and Nemenyi’s Post Hoc Test

Given that the AUC scores of various RTs across different datasets and classifiers do not adhere to a normal distribution, we employed the Friedman test [48] to assess the hypothesis and determine if there exists a significant difference. Once the null hypothesis is rejected, Nemenyi’s post hoc [48] test is subsequently conducted to delve deeper into the differences between specific pairings.
For n RTs and N observations, the test results of each RT on each observation are ranked from good to bad, and the ordinal values 1 , 2 , 3 , , n are given. If there is no significant difference in the performance of multiple classifiers, then they are equally divided into ordinal values. Suppose the average ordinal value of the k th RTs is k i , then k i obeys normal distribution and
τ χ 2 = n 1 n × 12 N n 2 1 i = 1 n k i n + 1 2 ,
where τ χ 2 is chi-square statistic. There is an improved statistic τ F :
τ F = N 1 τ χ 2 N n 1 τ χ 2 ,
where τ F is subject to the F distribution with a freedom of n 1 and ( n 1 ) ( N 1 ) .
The difference between the different RTs can then be represented by Nemenyi’s post hoc test, and a Friedman test plot will be generated. If there is no overlapping part of the two-line segments, it means that there is a significant difference between these two classifiers.

3.4.2. Kruskal–Wallis Test and Mann–Whitney U Test

The Kruskal–Wallis test [49] is a nonparametric test used to compare three or more independent samples. It can be used to check whether the results of different datasets and different classifiers come from the same population. It uses the H statistic to test the following:
H = 12 N ( N + 1 ) j = 1 k R j 2 n j 3 ( N + 1 ) ,
where k is the number of groups, N is the total number of samples, R j is the rank sum of the jth group, and n j is the number of samples from each group. If the null hypothesis is rejected, the Mann–Whitney U test [50] is used to perform pair-to-pair grouping tests between small samples that do not satisfy the normal distribution. The result shows whether a particular RT has different scores due to different datasets or different classifiers.

4. Results

4.1. Overall Comparison of RTs

Overall, RTs are significantly different when predicting credit risk, as shown in Table 3. Table 4 shows the pairs’ comparison among RTs with the biggest difference. SENN and ENN emerge as the most effective among all RTs, as shown in Figure 3. Comparatively, when not utilizing any RTs, SENN showcases notable enhancements, boosting AUC, recall, F1-score, and G-mean by up to 20%, 20%, 30%, and 40%, respectively. For instance, in the LC dataset, the G-mean without RTs stands at 0.1168, while after SENN utilization, the G-mean escalates to 0.5715. This demonstrates a notable improvement in both accuracy and recall rates, leading to a more balanced state.
Table 3. Friedman test for RTs.
Table 4. Post hoc test for RTs.
Figure 3. The critical difference diagram of RTs on all datasets.

4.2. Comparison of Resampling Techniques in Terms of Different Datasets

In the results for the German dataset (Table 5), it is evident that all RTs, along with the ensemble balanced classifiers BBC and BRF, significantly contribute to the enhancement of the recall score. This indicates a notable improvement in the classifiers’ ability to identify defaulting customers. Particularly noteworthy is the combined effect of CC, which emerges as the top-ranking approach. This can be attributed to the fact that, with an IR of 2.33, the training set still retains 240 minority class samples after undersampling. Consequently, it is sufficiently equipped to handle the test set comprising only 60 minority class samples, thus resulting in a relatively low prediction difficulty. However, according to the Friedman test, no significant difference is observed between several resampling methods. Furthermore, there is no discernible disparity with the NONE dataset.
Table 5. The AUC score of the German dataset by RTs on different classifiers.
The outcomes obtained from Taiwan (Table 6) and Prosper (Table 7) underscore a considerable degree of variability among the different RTs. Through Nemenyi’s post hoc tests, we confirm that SENN and ENN emerge as the most effective resampling methods, as Figure 4 and Figure 5 demonstrate. While their efficacy may vary across different classifiers, there is no doubt regarding their applicability across all classifiers. Conversely, CC, which demonstrates dominance in small datasets, ranks last in medium and large datasets. Surprisingly, leaving the dataset untouched (NONE) outperforms the mass removal of most class samples with CC.
Table 6. The AUC score of the Taiwan dataset by RTs on different classifiers.
Table 7. The AUC score of the Prosper dataset by RTs on different classifiers.
Figure 4. The critical difference diagram of RTs on the Taiwan dataset.
Figure 5. The critical difference diagram of RTs on the Prosper dataset.
In LC datasets characterized by high IR, there is substantial variability among different resampling methods (Table 8). Notably, CC, SMOTE, ADA, and SSMO are found to be ineffective, failing to surpass the performance of NONE (Figure 6). This is particularly evident in terms of F1-score, highlighting the inability of these techniques to enhance model stability and achieve a balanced trade-off between recall and precision. Only SENN emerges as a reliable technique under such circumstances.
Table 8. The AUC score of the LC dataset by RTs on different classifiers.
Figure 6. The critical difference diagram of RTs on the LC dataset.
Further examination of the relationship between different RTs and datasets can be conducted through the Kruskal–Wallis test (Table 9). The hypothesis posits that all techniques exhibit significant differences attributable to variations in datasets. For instance, in the case of SENN, its application yields significantly different effects across all subgroups, except for the absence of a significant difference between the German and LC datasets (Table 10). Moreover, the disparity between Prosper and Taiwan at the 1% significance level suggests that SENN’s efficacy varies depending on the dataset size, while the significant difference between Prosper and LC at the 1% level indicates that SENN’s effectiveness is influenced by the IR of the dataset.
Table 9. Kruskal–Wallis test for RTs in terms of different datasets.
Table 10. Pos hoc test for SENN with different datasets.

4.3. Comparison of Resampling Techniques in Terms of Different Classifiers

We categorized all classifiers into three groups: individual classifiers, ensemble classifiers, and balanced classifiers. The Kruskal–Wallis test reveals no significant difference between the different types of classifiers except for NONE (Table 11). This suggests that the utilization of RTs can effectively bridge the performance gap between individual and ensemble classifiers, significantly enhancing the effectiveness of weaker classifiers. However, the extent of enhancement provided by different RTs does not exhibit a significant difference.
Table 11. Kruskal–Wallis test for RTs in terms of different classifiers.
According to the results of the two independent samples Mann–Whitney U test on grouped variables, the p-value of single classifiers and balanced classifiers on NONE is 0.025 **, indicating statistical significance at the 5% level (Table 12). This signifies a significant difference between single classifiers and balanced classifiers when no RTs are utilized. The magnitude of the difference, as indicated by a Cohen’s d value of 0.692, falls within the medium range. Conversely, ensemble classifiers and balanced classifiers exhibit a smaller magnitude of difference. This implies that balanced classifiers can effectively address CI problems in the absence of RTs, making them a valuable balancing technique to consider.
Table 12. Pos hoc test for NONE with classifier groups.

4.4. Experiments with SH-SENN

Significant outcomes were observed with SH-SENN on datasets characterized by high IR (Table 13). Irrespective of the strategy ratio α employed, SH-SENN consistently enhances the prediction performance of ensemble classifiers and balanced classifiers. This enhancement reaches its peak when α approaches 0.5, followed by a diminishing trend. However, for individual classifiers, SH-SENN proves to be less effective than SENN when α is below 0.4. Notably, SH-SENN outperforms SENN when α falls between 0.4 and 0.7. Overall, the results highlight SH-SENN (0.5) as the most advantageous and effective new RT, as shown in Figure 7.
Table 13. The AUC score of SENN and SH-SENN by different classifiers on the LC dataset.
Figure 7. The box plot of the AUC score for SENN and SH-SENN on the LC dataset. The red dots are outliers, indicating the lowest or highest AUC value that each strategy can achieve.
The Friedman test results reveal that SH-SENN exhibits varying degrees of significant differences and advantages over NONE, ADA, SMO, BSMO, SSMO, CC, and STOM (Table 14). Notably, the AUC can reach a maximum of 0.784. Post hoc tests confirm that although there is no significant difference between SH-SENN (0.5) and the previous champion SENN (Figure 8), it still secures the top rank among all RTs with a substantial advantage.
Table 14. Pos hoc test for SH-SENN with RTs on the LC dataset.
Figure 8. The critical difference diagram of RTs with SH-SENN(0.5) on the LC dataset.
Under various IR conditions, we included the Prosper dataset in the experiment to compare the performance of SH-SENN. This dataset shares the same size as the LC dataset and has an IR of 4.97, representing a normal imbalanced dataset. With positive samples comprising more than 20% of negative samples in the Prosper dataset’s training set, the strategy ratio α for this round ranged from 0.3 to 0.9 (if α was less than 0.2, undersampling will be conducted initially; however, SMOTE as the first step for SH-SENN is an oversampling method).
Results were compared with SENN, which ranked first on the Prosper dataset in a previous result. The findings reveal that (Table 15), under the interference of SH-SENN, all classifiers except LR and RF achieved higher AUC values. This suggests that SH-SENN can outperform SENN even in normal IR datasets, albeit with a smaller improvement compared with extreme IR datasets.
Table 15. The AUC score of SENN and SH-SENN by different classifiers on the Prosper dataset.
A significant difference from the results in the LC dataset was that the SH-SENN (0.9) strategy yielded the best results, contrary to the previous SH-SENN (0.5) strategy. The average AUC of SH-SENN (0.9) across all classifiers reached 0.9465, slightly higher than SENN’s average AUC of 0.9464 (Figure 9). This discrepancy arises because minority samples in normally imbalanced datasets contain more information than extremely imbalanced minority samples, and there is less noise when oversampling to 90% of the majority sample. Experiments demonstrate that the strategy ratio of SH-SENN should decrease with an increasing IR to achieve better results.
Figure 9. The box plot of the AUC score for SENN and SH-SENN on the Prosper dataset. The red dots are outliers, indicating the lowest or highest AUC value that each strategy can achieve.
Similarly, the Friedman test results indicate significant differences between SH-SENN (0.9) and NONE, ADA, SMO, BSMO, SSMO, CC, and STOM in the Prosper dataset. Subsequent post hoc tests revealed that SH-SENN (0.9) surpassed all other RTs by a considerable margin (Figure 10). These two sets of experiments collectively demonstrate that SH-SENN can achieve comparable effectiveness to other RTs across various IR datasets, albeit requiring different strategy ratios (Table 16). Moreover, the impact of SH-SENN on classifier enhancement is more pronounced in high IR datasets compared with those with common IR levels.
Figure 10. The critical difference diagram of RTs with SH-SENN(0.9) on Prosper dataset.
Table 16. Pos hoc test for SH-SENN with RTs on the Prosper dataset.

4.5. Discussion and Limitation

Overall, the experimental results align with our discussion in Section 2. We contend that the CI problem does not directly impact the classifier’s predictive ability but rather obscures the decision boundary, thereby weakening its performance. Certain RTs, such as SMOTE, ADASYN, and cluster centroids, address the CI problem by either oversampling or undersampling to bring the classes closer to equilibrium. However, they merely generate new minority class samples or remove majority class samples without specifically optimizing the decision boundary or addressing the issue of overlapping samples from different classes. Consequently, these techniques prove ineffective across various datasets. SVMSMOTE and BorderlineSMOTE achieve superior results because they focus on generating minority class samples along the border, thereby enhancing the clarity of the decision boundary. Moreover, in SMOTEENN, the ENN method supplements the limitations of the SMOTE-only oversampling approach, which tends to marginalize data distribution. Unlike SMOTETomek, which combines oversampling and undersampling but removes entire pairs, SMOTEENN selectively eliminates examples that do not align with neighboring categorizations. As a result, ENN preserves more information than Tomek link, rendering SMOTEENN more stable and effective across diverse datasets. While SMOTETomek excels in datasets with a pronounced class overlap, it may yield better results in credit datasets characterized by higher feature dimensions and greater diversity in types.
From the dataset perspective, small datasets characterized by limited sample sizes, data structures, and small training sets exhibit no significant disparities among different resampling RTs. Even with oversampling, the potential for expanding the dataset’s information content remains limited. Conversely, medium and large datasets demonstrate similarities in their RT selection. This can be attributed to the classifiers’ capacity to learn sufficient predictive information when the training set is relatively ample. However, further improvement in prediction ability necessitates optimizing decision boundaries in addition to resampling, making combined sampling more effective. On the contrary, large datasets with a high IR demand cautious treatment. These datasets feature high-dimensional features, complex data structures, and sparse minority class samples. Consequently, classifiers may struggle to recognize positive samples unseen during training, resulting in low recall scores. Oversampling the minority class to excessively high proportions, such as with a SMOTE ratio of 1:1 or 1:0.9, may lead to an accumulation of minority class samples without clarifying the decision boundary, potentially generating new noise. In such scenarios, random oversampling may outperform SMOTE [30]. To address extreme CI, it is advisable to employ a smaller ratio oversampling strategy to prevent an abundance of minority class samples from becoming noise. Emphasizing the handling of boundary points and class overlapping is crucial. SH-SENN stands out as a superior technique due to its strategic oversampling approach and dual handling of boundary points, which contribute to achieving better results compared with other RTs.
From the classifier’s standpoint, ensemble classifiers tend to benefit more from RTs compared with individual classifiers due to their enhanced learning capabilities. This is particularly evident in the boosting family of classifiers, such as XGBoost. When employing oversampling or integrated sampling RTs, the ensemble classifier can capitalize on the increased effective information within the dataset, thereby maximizing its predictive performance. Conversely, when using RTs like cluster centroids, which drastically reduce the amount of information, the ensemble classifier’s performance remains robust, and even regular individual classifiers can achieve satisfactory results. While ensemble classifiers offer hyperparameters to adjust category weights, predicting these weights for a few sample categories in real-world scenarios is impractical. Furthermore, presetting these weights is not feasible, considering the test set’s consistent IR with the training set in this study to ensure comparable effects of different RTs and avoid inconsistent prediction difficulty in the test set. For instance, suppose a bank encounters 10 defaulting credit customers monthly, with the bank approving 10 applicants daily. These defaulters may be distributed over 30 days or concentrated on a single day. Processing the training set beforehand allows for the construction of a robust classifier prior to model fitting.
Lastly, concerning the credit risk prediction problem, real credit data often exhibit high feature dimensions and diverse data types. Existing studies frequently concentrate solely on comparing balancing methods and integrating them with classification models, overlooking the practical significance of aiding credit institutions in addressing the CI problem. Their findings tend to be purely theoretical and algorithmic, failing to address the core issue. For instance, clustering-based undersampling may perform well only with specific datasets and models. If a credit institution modifies user profiles, incorporates new features, or includes audio and video features, the original sampling approach may no longer be suitable for the updated dataset. Thus, applicability becomes a concern. The SH-SENN sampling technique proposed in this study is not a rigid algorithm but rather a versatile framework. The undersampling technique within this framework is ENN, but it can be substituted with other undersampling methods to create new algorithms while remaining within the framework’s conceptual scope. Additionally, SH-SENN employs the more representative Lending Club dataset for experimentation and achieves promising results. This illustrates the framework’s potential for extension to analogous credit datasets and its applicability to classification tasks facing CI challenges. For instance, food safety regulation represents a critical concern in Africa, where establishing an early warning system to oversee food quality is imperative. Given the high stakes involved in safeguarding human life, the testing of positive samples becomes more exacting, necessitating robust and balanced datasets for constructing monitoring models. Even a marginal enhancement in model performance achieved by methods like SH-SENN could potentially safeguard the health of numerous individuals. Similar applications can also be extrapolated to medical disease detection and the machinery industry.
SH-SENN also possesses potential limitations. In comparison with other RTs, SH-SENN requires a longer time for resampling due to its utilization of double ENN. This extended duration arises from the necessity for both SMOTE and ENN to formulate their neighbors to facilitate distance-based computations. Consequently, this process consumes more time, particularly in large datasets with high-dimensional features. Moreover, our experiments did not explore the compatibility of SH-SENN with deep learning algorithms such as artificial neural networks. As previously mentioned, deep learning algorithms currently do not offer significant advantages in credit risk prediction but will be questioned by stakeholders because of its black-box property. Nonetheless, deep learning undoubtedly represents a crucial avenue for future research. We expect that combining SH-SENN with deep learning algorithms will yield superior results.

5. Conclusions

We conducted a comprehensive comparison and analysis of various RTs in the experiment, introducing a novel RT tailored for extremely imbalanced datasets. The conclusion can be summarized as follows:
  • SMOTEENN significantly enhances dataset quality before classifier training, consistently improving prediction performance across all selected credit datasets. Compared with the original training set without SMOTEENN preprocessing, AUC values see an increase of 2–4%, and recall values show enhancements of up to 30%. The effectiveness of RTs varies significantly depending on dataset size and IR. For small-sized datasets with a low IR, the choice of RT does not yield significant differences. However, for medium and large-sized datasets with varying IRs, RTs capable of managing decision boundaries and class coincidence points yield superior results. Notably, SMOTEENN stands out in this regard. RTs particularly boost ensemble classifiers over single classifiers. Balanced ensemble classifiers can perform reasonably well without preprocessing RTs, although their predictive power is not as stable as classifiers with RTs applied in advance. Moreover, the choice of classifiers for credit approval should consider interpretability, as high-performance ensemble classifiers may lack interpretability, posing challenges for acceptance by stakeholders. Thus, classifiers like logistic regression and decision trees, known for their interpretability and fast execution, remain widely used despite their potentially poorer performance. Computational cost and interpretability should be key factors when selecting RTs.
  • For SH-SENN, first, the new SH-SENN demonstrates outstanding performance in handling extremely imbalanced large datasets. This is attributed to its focus on addressing decision boundary points and noise points after oversampling. In real-world scenarios, credit data are often intricate, featuring time-varying attributes and numerous sparse categorical variables. Datasets exhibiting high IR and significant noise, such as those from Lending Club Inc., are commonplace. SH-SENN emerges as the optimal solution meeting these realistic requirements. Second, as credit datasets grow increasingly complex, there arises a need for high-performing classifiers to replace traditional scorecard and logistic regression (LR) algorithms. Emerging classifiers like CatBoost prove to be suitable candidates for credit datasets owing to their improved handling of categorical variables. However, they still fall short in effectively addressing CI concerns alone. CatBoost also requires complementary techniques to enhance its efficacy. Our experiments demonstrate that SH-SENN significantly enhances the predictive capabilities of ensemble classifiers compared with individual classifiers. SH-SENN outperforms all other strategies ranging from 0.1 to 0.9, including established techniques like SMOTEENN. This enhancement results in a 1–5% improvement in AUC for various classifiers. Notably, the improvement for CatBoost alone can nearly reach 2%. Such enhancements are highly appealing for credit bureaus, where even a 1% improvement can potentially help them avoid millions of dollars in losses.
In light of our research, future directions include employing additional model evaluation metrics for comprehensive comparisons, exploring relationships between IR and RTs using simulated datasets, and investigating RT combinations tailored for high-performance ensemble classifiers.

Author Contributions

Conceptualization, Z.Z. and T.C.; methodology, Z.Z. and T.C.; validation, Z.Z. and T.C.; formal analysis, Z.Z.; investigation, Z.Z., T.C. and S.D.; resources, T.C., J.L. and A.G.B.; data curation, Z.Z.; writing—original draft preparation, Z.Z.; writing—review and editing, Z.Z. and T.C.; visualization, Z.Z. and S.D.; supervision, T.C.; project administration, T.C.; funding acquisition, T.C. and A.G.B. All authors have read and agreed to the published version of the manuscript.

Funding

This Project is supported by Yunnan University of Finance and Economics Scientific Research Fund Project of China (Grant number 2021B01). This project is also supported by Ningbo Natural Science Foundation, China (Project ID 2023J194), by the Ningbo Government, China (Project ID 2021B-008-C), and by University of Nottingham Ningbo China (UNNC) Education Foundation (Project ID LDS202303).

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found at https://archive.ics.uci.edu (accessed on 15 September 2023), https://www.prosper.com/credit-card (accessed on 15 September 2023), and https://www.lendingclub.com (accessed on 15 September 2023).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
CIclass imbalance
RTresampling techniques
LRlogistic regression
KNNk-nearest neighbors
NBnaïve Bayes
SVCsupport vector machines classifier
DTdecision tree
RFrandom forest classifier
XGBXGBoost classifier
LGBMLightGBM classifier
CATCatBoost classifier
BBCBalancedBaggingClassifier
BRFBalancedRandomForestClassifier
ADAtraining set applied with ADASYN
SMOtraining set applied with SMOTE
BSMOtraining set applied with BorderlineSMOTE
SSMOtraining set applied with SVMSMOTE
CCtraining set applied with cluster centroids
ENNtraining set applied with ENN
STOMtraining set applied with SMOTETomek
SENNtraining set applied with SMOTEENN
NONEtraining set without any balance techniques
SH-SENNtraining set applied with SH-SENN

References

  1. Henley, W.; Hand, D.J. A k-nearest-neighbour classifier for assessing consumer credit risk. J. R. Stat. Soc. 1996, 45, 77–95. [Google Scholar] [CrossRef]
  2. Abellán, J.; Castellano, J.G. A comparative study on base classifiers in ensemble methods for credit scoring. Expert Syst. Appl. 2017, 73, 1–10. [Google Scholar] [CrossRef]
  3. Tsai, C.F.; Wu, J.W. Using neural network ensembles for bankruptcy prediction and credit scoring. Expert Syst. Appl. 2008, 34, 2639–2649. [Google Scholar] [CrossRef]
  4. Andrés Alonso, J.M.C. Machine Learning in Credit Risk: Measuring the Dilemma between Prediction and Supervisory Cost; Banco de España: Madrid, Spain, 2020. [Google Scholar]
  5. Ding, S.; Cui, T.; Bellotti, A.; Abedin, M.; Lucey, B. The role of feature importance in predicting corporate financial distress in pre and post COVID periods: Evidence from China. Int. Rev. Financ. Anal. 2023, 90, 102851. [Google Scholar] [CrossRef]
  6. Wang, L. Imbalanced credit risk prediction based on SMOTE and multi-kernel FCM improved by particle swarm optimization. Appl. Soft Comput. 2022, 114, 108153. [Google Scholar] [CrossRef]
  7. Moscato, V.; Picariello, A.; Sperlí, G. A benchmark of machine learning approaches for credit score prediction. Expert Syst. Appl. 2021, 165, 113986. [Google Scholar] [CrossRef]
  8. García, V.; Marqués, A.I.; Sánchez, J.S. Exploring the synergetic effects of sample types on the performance of ensembles for credit risk and corporate bankruptcy prediction. Inf. Fusion 2019, 47, 88–101. [Google Scholar] [CrossRef]
  9. Haixiang, G.; Li, Y.; Shang, J.; Mingyun, G.; Yuanyue, H.; Gong, B. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 2016, 73, 220–239. [Google Scholar] [CrossRef]
  10. García, V.; Marqués, A.I.; Sánchez, J.S. An insight into the experimental design for credit risk and corporate bankruptcy prediction systems. J. Intell. Inf. Syst. 2015, 44, 159–189. [Google Scholar] [CrossRef]
  11. Niu, K.; Zhang, Z.; Liu, Y.; Li, R. Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending. Inf. Sci. 2020, 536, 120–134. [Google Scholar] [CrossRef]
  12. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. Artif. Intell. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  13. Cui, T.; Li, J.; John, W.; Andrew, P. An ensemble based Genetic Programming system to predict English football premier league games. In Proceedings of the 2013 IEEE Symposium Series on Computational Intelligence (SSCI2013), Singapore, 16–19 April 2013; pp. 138–143. [Google Scholar] [CrossRef]
  14. Fiore, U.; De Santis, A.; Perla, F.; Zanetti, P.; Palmieri, F. Using generative adversarial networks for improving classification effectiveness in credit card fraud detection. Inf. Sci. 2019, 479, 448–455. [Google Scholar] [CrossRef]
  15. Jiang, C.; Lu, W.; Wang, Z.; Ding, Y. Benchmarking state-of-the-art imbalanced data learning approaches for credit scoring. Expert Syst. Appl. 2023, 213, 118878. [Google Scholar] [CrossRef]
  16. Ding, S.; Cui, T.; Zhang, Y. Incorporating the RMB internationalization effect into its exchange rate volatility forecasting. N. Am. J. Econ. Financ. 2020, 54, 101103. [Google Scholar] [CrossRef]
  17. Ding, S.; Cui, T.; Zheng, D.; Du, M. The effects of commodity financialization on commodity market volatility. Resour. Policy. 2021, 73, 102220. [Google Scholar] [CrossRef]
  18. Zhu, R.; Guo, Y.; Xue, J.H. Adjusting the imbalance ratio by the dimensionality of imbalanced data. Pattern Recognit. Lett. 2020, 133, 217–223. [Google Scholar] [CrossRef]
  19. Dorogush, A.V.; Ershov, V.; Gulin, A. CatBoost: Gradient boosting with categorical features support. arXiv 2018, arXiv:1810.11363. [Google Scholar]
  20. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 4–9 December 2017; pp. 3149–3157. [Google Scholar]
  21. Caouette, J.; Altman, E.; Narayanan, P.; Nimmo, R. Managing Credit Risk: The Great Challenge for the Global Financial Markets, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2011; pp. 349–365. [Google Scholar] [CrossRef]
  22. Khan, A.A.; Chaudhari, O.; Chandra, R. A review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation. Expert Syst. Appl. 2024, 244, 122778. [Google Scholar] [CrossRef]
  23. Xia, Y.; Liu, C.; Liu, N. Cost-sensitive boosted tree for loan evaluation in peer-to-peer lending, Electron. Commer. Res. Appl 2017, 24, 30–49. [Google Scholar]
  24. Liu, X.Y.; Wu, J.; Zhou, Z.H. Exploratory undersampling for class-Imbalance learning. IEEE Trans. Syst. Man Cybern. Part B 2009, 39, 539–550. [Google Scholar] [CrossRef]
  25. Liu, B.; Chen, K. Loan risk prediction method based on SMOTE and XGBoost. Comput. Mod. 2020, 2, 26–30. [Google Scholar]
  26. Zięba, M.; Tomczak, J.M. Boosted SVM with active learning strategy for imbalanced data. Soft Comput. 2015, 19, 3357–3368. [Google Scholar] [CrossRef]
  27. Ding, S.; Cui, T.; Wu, X.; Du, M. Supply chain management based on volatility clustering: The effect of CBDC volatility. Res. Int. Bus. Financ. 2022, 62, 101690. [Google Scholar] [CrossRef]
  28. Yen, S.J.; Lee, Y.S. Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 2009, 36, 5718–5727. [Google Scholar] [CrossRef]
  29. García, V.; Marqués, A.I.; Sánchez, J.S. Improving Risk Predictions by Preprocessing Imbalanced Credit Data. In Neural Information Processing; Huang, T., Zeng, Z., Li, C., Leung, C.S., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 68–75. [Google Scholar]
  30. Xiao, J.; Wang, Y.; Chen, J.; Xie, L.; Huang, J. Impact of resampling methods and classification models on the imbalanced credit scoring problems. Inf. Sci. 2021, 569, 508–526. [Google Scholar] [CrossRef]
  31. Brown, I.; Mues, C. An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst. Appl. 2012, 39, 3446–3453. [Google Scholar] [CrossRef]
  32. Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
  33. Ma, X.; Sha, J.; Wang, D.; Yu, Y.; Yang, Q.; Niu, X. Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning. Electron. Commer. Res. Appl. 2018, 31, 24–39. [Google Scholar] [CrossRef]
  34. Kou, G.; Chen, H.; Hefni, M.A. Improved hybrid resampling and ensemble model for imbalance learning and credit evaluation. J. Manag. Sci. Eng. 2022, 7, 511–529. [Google Scholar] [CrossRef]
  35. Haibo, H.; Yang, B.; Garcia, E.A.; Shutao, L. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 1–8 June 2008; pp. 1322–1328. [Google Scholar] [CrossRef]
  36. Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Advances in Intelligent Computing; Huang, D.S., Zhang, X.P., Huang, G.B., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887. [Google Scholar]
  37. Nguyen, H.M.; Cooper, E.W.; Kamei, K. Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradig. 2009, 3, 4–21. [Google Scholar] [CrossRef]
  38. Han, J.; Kamber, M.; Pei, J. (Eds.) 3—Data Preprocessing. In Data Mining, 3rd ed.; Morgan Kaufmann: Boston, MA, USA, 2012; pp. 83–124. [Google Scholar] [CrossRef]
  39. Wilson, D.L. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Trans. Syst. Man Cybern. 1972, SMC-2, 408–421. [Google Scholar] [CrossRef]
  40. Batista, G.E.A.P.A.; Bazzan, A.L.C.; Monard, M.C. Balancing Training Data for Automated Annotation of Keywords: A Case Study. WOB 2003, 3, 1–9. [Google Scholar]
  41. Tomek, I. Two Modifications of CNN. IEEE Trans. Syst. Man Cybern. 1976, SMC-6, 769–772. [Google Scholar] [CrossRef]
  42. Batista, G.E.A.P.A.; Prati, R.C.; Monard, M.C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
  43. Zhang, X.; Yu, L. Consumer credit risk assessment: A review from the state-of-the-art classification algorithms, data traits, and learning methods. Expert Syst. Appl. 2024, 237, 121484. [Google Scholar] [CrossRef]
  44. Chai, E.; Wei, Y.; Cui, T.; Ren, J.; Ding, S. An Efficient Asymmetric Nonlinear Activation Function for Deep Neural Networks. Symmetry 2022, 14, 1027. [Google Scholar] [CrossRef]
  45. Dastile, X.; Celik, T.; Potsane, M. Statistical and machine learning models in credit scoring: A systematic literature survey. Appl. Soft Comput. 2020, 91, 106263. [Google Scholar] [CrossRef]
  46. Ferri, C.; Hernández-Orallo, J.; Modroiu, R. An experimental comparison of performance measures for classification. Pattern Recognit. Lett. 2009, 30, 27–38. [Google Scholar] [CrossRef]
  47. Markov, A.; Seleznyova, Z.; Lapshin, V. Credit scoring methods: Latest trends and points to consider. J. Financ. Data Sci. 2022, 8, 180–201. [Google Scholar] [CrossRef]
  48. Pereira, D.; Afonso, A.; Medeiros, F. Overview of Friedman’s Test and Post-hoc Analysis. Commun. Stat.-Simul. Comput. 2015, 44, 2636–2653. [Google Scholar] [CrossRef]
  49. McKight, P.E.; Najab, J. Kruskal-Wallis Test. In The Concise Encyclopedia of Statistics; Springer: New York, NY, USA, 2008; pp. 288–290. [Google Scholar] [CrossRef]
  50. Meléndez, R.; Giraldo, R.; Leiva, V. Sign, Wilcoxon and Mann-Whitney Tests for Functional Data: An Approach Based on Random Projections. Mathematics 2021, 9, 44. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.