Next Article in Journal
Bifurcation of Limit Cycles from a Focus-Parabolic-Type Critical Point in Piecewise Smooth Cubic Systems
Previous Article in Journal
Differential Fault and Algebraic Equation Combined Analysis on PICO
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Resampling Techniques Study on Class Imbalance Problem in Credit Risk Prediction

1
School of Statistics and Mathematics, Yunnan University of Finance and Economics, No. 237, LongQuan Rd., Kunming 650221, China
2
School of Computer Science, University of Nottingham Ningbo China, Ningbo 315100, China
3
School of Business, Ningbo University, 818 Fenghua Road Ningbo, Ningbo 315211, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(5), 701; https://doi.org/10.3390/math12050701
Submission received: 29 January 2024 / Revised: 23 February 2024 / Accepted: 26 February 2024 / Published: 28 February 2024
(This article belongs to the Special Issue Quantitative Finance with Mathematical Modelling)

Abstract

:
Credit risk prediction heavily relies on historical data provided by financial institutions. The goal is to identify commonalities among defaulting users based on existing information. However, data on defaulters is often limited, leading to a concentration of credit data where positive samples (defaults) are significantly fewer than negative samples (nondefaults). It poses a serious challenge known as the class imbalance problem, which can substantially impact data quality and predictive model effectiveness. To address the problem, various resampling techniques have been proposed and studied extensively. However, despite ongoing research, there is no consensus on the most effective technique. The choice of resampling technique is closely related to the dataset size and imbalance ratio, and its effectiveness varies across different classifiers. Moreover, there is a notable gap in research concerning suitable techniques for extremely imbalanced datasets. Therefore, this study aims to compare popular resampling techniques across different datasets and classifiers while also proposing a novel hybrid sampling method tailored for extremely imbalanced datasets. Our experimental results demonstrate that this new technique significantly enhances classifier predictive performance, shedding light on effective strategies for managing the class imbalance problem in credit risk prediction.

1. Introduction

Credit, as defined by financial institutions such as banks and lending companies [1], represents a vital loan certificate issued to individuals or businesses. This certification mechanism plays a pivotal role in ensuring the smooth functioning of the financial sector, contingent upon comprehensive evaluations of creditworthiness. The evaluation process inherently gives rise to concerns regarding credit risk, encompassing the potential default risk associated with borrowers. Assessing credit risk entails the utilization of credit scoring, a method aimed at distinguishing between “good” and “bad” customers [2]. This process is often referred to as credit risk prediction in numerous studies [3,4,5,6,7]. Presently, the predominant approaches to classifying credit risk involve traditional statistical models and machine learning models, typically addressing binary or multiple classification problems.
Credit data often exhibit a high number of negative samples and a scarcity of positive samples (default samples), a phenomenon known as the class imbalance (CI) problem [8]. Failure to address this issue may result in significant classifier bias [9], diminished accuracy and recall [10], and weak predictive capabilities, ultimately leading to financial institutions experiencing losses due to customer defaults [11]. For instance, in a dataset comprising 1000 observations labeled as normal customers and only 10 labeled as default customers, a classifier could achieve 99% accuracy without correctly identifying any defaults. Clearly, such a classifier lacks the robustness required. To mitigate the CI problem, various balancing techniques are employed, either at the dataset level or algorithmically. Dataset-level approaches include random oversampling (ROS), random undersampling (RUS), and the synthetic minority oversampling technique (SMOTE) [12], while algorithmic methods mainly involve cost-sensitive algorithms. Additionally, ensemble algorithms [13] and deep learning techniques, such as generative adversarial networks (GANs) [14], are gradually gaining traction for addressing CI issues.
Indeed, there is no one-size-fits-all solution to the CI problem that universally applies to all credit risk prediction models [15,16,17]. On the one hand, the efficacy of approaches is constrained by various dataset characteristics such as size, feature dimensions, user profiles, and imbalance ratio (IR). Notably, higher IR and feature dimensions often correlate with poorer classification performance [18]. On the other hand, existing balancing techniques exhibit their own limitations. For instance, the widely used oversampling technique, SMOTE, has faced criticism for its failure to consider data distribution comprehensively. It solely generates new minority samples along the path from the nearest minority class to the boundary. Conversely, some undersampling methods are deemed outdated as they discard a substantial number of majority class samples, potentially leading to inadequately trained models due to small datasets. Additionally, cost-sensitive learning hinges on class weight adjustment, which lacks interpretability and scalability [11].
In this study, we address the dataset size and IR considerations, and evaluate the performance of oversampling, undersampling, and combined sampling methods across various machine learning classifiers. Notably, we include CatBoost [19], a novel classifier that has been under-represented in prior comparisons. Furthermore, we introduce a novel hybrid resampling framework, strategic hybrid SMOTE with double edited nearest neighbors (SH-SENN), which demonstrates superior performance in handling extremely imbalanced datasets and substantially enhances the predictive capabilities of ensemble learning classifiers.
The contributions made in this paper can be summarized as follows: First, most credit risk prediction studies within the CI context utilize the widely adopted benchmark credit dataset, which lacks real data representation. Real data tend to be noisier and more complex. This study integrates benchmark and real private datasets, enhancing the practical relevance of our proposed NEW framework. It is likely to practically create value for financial institutions. Second, although advanced ensemble classifiers such as LightGBM [20] and CatBoost, which have emerged in recent years, are gradually being adopted in credit risk prediction due to their advantages of speed and ability to handle categorical variables more effectively, they still fail to provide new solutions to the CI problem. Moreover, there are few studies assessing their adaptability to traditional balancing methods. Our study confirms that SH-SENN significantly enhances the performance of such new classifiers, especially on real datasets with severe CI issues. We contribute to the real-world applicability of these new classifiers. Third, the fundamental concept of SH-SENN in dealing with extreme IR datasets is to improve data quality by oversampling to delineate clearer decision boundaries and by undersampling multiple times to address boundary noise and overlapping samples of categories, thereby enhancing the classifier’s predictive power. The sampling strategy is tailored to IR considerations rather than blindly striving for category balance. Hence, SH-SENN is suitable for large, extremely unbalanced credit datasets with numerous features. For instance, credit data from large financial institutions often comprise millions of entries and hundreds of features, posing challenges beyond binary classification. SH-SENN effectively enhances dataset quality to tackle such complex real-world problems.
In Section 2, we provide a comprehensive review of different resampling techniques (RTs), highlighting their effectiveness, advantages, and disadvantages in addressing credit risk prediction problems. Following this, in Section 3, we employ various RTs on four distinct datasets, comparing and analyzing their performances across (1) different machine learning classifiers and (2) datasets varying in size and IR. Concurrently, we introduce and demonstrate the effectiveness of a novel approach, SH-SENN. Lastly, the research findings are summarized in the concluding section.

2. Background and Related Works

In the domain of credit risk prediction, accurately identifying potential defaulting users holds paramount importance [21]. Banks meticulously gather user characteristics and devise scoring systems to scrutinize customers and allocate loan amounts judiciously. Upon identifying potential risks, they may either reduce the loan quota or decline lending altogether. This dynamic is evident in the data, where positive samples (minority class) bear greater significance than negative samples (majority class). This poses a dilemma, as the classifier requires substantial information about the minority class to effectively identify positive samples, yet it inevitably tends to be more influenced by the majority class [22]. Consequently, oversampling and cost-sensitive algorithms have been favored in addressing credit risk prediction scenarios. The former directly enhances the proportion of samples in the minority class, while the latter factors in that misclassifying negatives is less detrimental than misclassifying positives [23].
To mitigate the risk of underfitting arising from the potential omission of vital information by undersampling techniques, algorithms such as EasyEnsemble and BalanceCascade [24] have been developed. These algorithms aim to minimize the probability of discarding crucial information during the undersampling process. EasyEnsemble combines the anti-underfitting capacity of boosting with the anti-overfitting capability of bagging. Conversely, to alleviate the risk of overfitting associated with oversampling, distance-based k-neighborhood methods for resampling are considered more effective. Notably, the synthetic minority oversampling technique (SMOTE) has garnered attention in recent years, particularly in credit scenarios characterized by an imbalance between “good and bad customers”.
In the realm of loan default prediction, researchers have utilized the SMOTE algorithm in various ways, emphasizing the criticality of information within the minority class. Studies suggest that SMOTE, or more boundary-point-oriented adaptive oversampling techniques like adaptive integrated oversampling, can yield superior results when modeling with such data [25]. Moreover, combining oversampling techniques with integrated learning has been proposed to mitigate overfitting risks. For instance, sampling combined with boosting methods and support vector machines, as well as a combination of adaptive integrated oversampling with support vector machines and boosting, have demonstrated promising results in empirical analyses [26].
Nonetheless, subsequent studies caution against excessively tightening criteria due to potential default risks, as rejecting numerous creditworthy users can significantly diminish bank earnings, sometimes surpassing losses incurred from a single defaulting user [11]. Over-reliance on oversampling techniques could exacerbate this inverse risk. However, this does not imply superiority of undersampling techniques, which exhibit distinct drawbacks, notably information loss from the majority class, particularly with clustering-based undersampling methods [27,28]. To harness the full potential of minority class samples while retaining information from majority class samples, comprehensive techniques combining oversampling and undersampling have emerged. Examples include SMOTE with Tomek links and SMOTE with edited nearest neighbors (ENNs), both of which have demonstrated enhancements in dataset quality and classifier performance [15]. In a comprehensive study conducted as early as 2012, ref. [29] designed a detailed examination of RTs. The study evaluated four undersampling, three oversampling, and one composite resampling technique across five datasets to ascertain the potential benefits for intelligent classifiers such as the multilayer perceptron (MLP) when using these techniques. The comparative analysis revealed that there is no one-size-fits-all solution with respect to the effectiveness of sampling techniques across all classifiers. However, it was observed that undersampling methods like neighborhood clean rule (NCL) and oversampling techniques like SMOTE and SMOTE + ENN consistently demonstrated stable performance. Notably, oversampling imparted a significant performance enhancement, particularly benefiting higher-performing intelligent classifiers.
On the other hand, prevailing class balancing experiments often strive to equalize the proportions of majority and minority classes, yet few studies have delved into addressing datasets exhibiting extreme imbalances. The IR, denoting the ratio of majority to minority samples, serves as a gauge for assessing the extent of class imbalance. Commonly used benchmark credit datasets typically exhibit IRs ranging from 2 to 10, such as the German credit dataset (IR: 2.33) and the Australia credit dataset (IR: 1.24), while certain private datasets may escalate to IRs of 10 to 30 [30]. Typically, larger sample sizes correlate with higher IRs. However, there exists no standardized criterion for defining extreme imbalance. An IR above 5 implies that merely 16.6% of positive samples are available, posing a considerable challenge for classifiers. Ref. [31] advocated for the use of gradient boosting and random forest algorithms to effectively handle datasets with extreme imbalance. Through experimentation with oversampling techniques, it was observed that an optimal class distribution should encompass 50% to 90% of the minority classes. In other words, it suffices to moderate the extreme imbalances to achieve a mild imbalance without necessitating an IR of 1. Conversely, ref. [18] employed simulation datasets to simulate varying IRs and found that higher IRs do not consistently lead to poorer classifier performance; rather, performance is significantly influenced by the feature dimensions of the dataset. Indeed, IR serves as one of the statistical features of the dataset, alongside feature dimension, dataset size, feature type, and resampling method, collectively impacting the final prediction outcome [30]. However, high IR alone does not inherently account for prediction difficulty; rather, it is the indistinct decision boundary stemming from too few minority class samples, overlapping due to resampling, and excessive noise that pose the primary challenges [15]. Thus, the primary objective of balancing techniques should focus on clarifying classification boundaries rather than merely striving for dataset balance. Ref. [15] echoes the sentiments of the aforementioned study, emphasizing the collective influence of IR on the efficacy of various RTs. Following a comparative analysis involving methods such as Tomek-link removal (Tomek), ENN, BorderlineSMOTE, adaptive integrated oversampling (ADASYN), and SMOTE + ENN, it was concluded that the complexity of RTs does not necessarily correlate with their ability to address datasets with higher IR. Importantly, it was observed that no single RT emerged as universally effective across all classification and CI problems.
RTs proactively address the CI problem during the data preprocessing stage. While numerous studies propose resolving the CI problem through adjustments within machine learning classifiers or by integrating balancing strategies directly into ensemble models, recent advancements in algorithms such as eXtreme Gradient Boosting (XGBoost) [32], LightGBM, and CatBoost offer hyperparameters capable of fine-tuning the weights of positive samples. Even amidst imbalanced datasets, these algorithms enable the objective function to prioritize information gleaned from minority class samples. Furthermore, incorporating resampling techniques within ensemble learning to balance each training subset yields models with heightened robustness compared with classifiers solely adjusting sample weights. For instance, bagging classifiers and random forests can be augmented with balancing techniques to ensure a portion of minority class samples in each training subset [33]. To compare the effectivenesses of various classifiers, ref. [31] conducted experiments across five datasets, incorporating various IRs. The study evaluated the performances of classifiers such as logistic regression, decision tree (C4.5), neural network, gradient boosting, k-nearest neighbors, support vector machines, and random forest, considering positive sample proportions ranging from 1% to 30%. Results from the experiments revealed that gradient boosting and random forest exhibited exceptional performance, particularly when handling datasets with extreme IR. Conversely, support vector machines, k-nearest neighbors, and decision tree (C4.5) struggled to effectively manage the CI problem. In conclusion, the study suggests that ensemble learning methods, specifically boosting and bagging, outperform individual classifiers when addressing imbalanced credit datasets, highlighting their efficacy in handling CI challenges.
However, the effectiveness of solely relying on model weights to address the CI problem diminishes if RTs are not applied to the dataset beforehand [34]. Moreover, the embedding of resampling techniques within ensemble models significantly escalates computational costs, rendering it less efficient and more constrained when handling large datasets [4]. To address this issue, ref. [30] conducted a comprehensive comparison between various pairs of classifiers and RTs. Their objective was to identify dependable combinations of advanced RTs and classifiers capable of handling datasets with differing IR levels effectively. By conducting paired experiments involving nine RTs and nine classifiers, their findings revealed that the combination of RUS and random subspace consistently achieved satisfactory performance across most cases. Following closely behind was the combination of SMOTE + ENN and logistic regression. Interestingly, these results deviate from previous studies that tended to favor ensemble classifiers. Ref. [30] argue that even simple classifiers can achieve commendable performance, provided that suitable RTs are employed.

2.1. Oversampling

The synthetic minority oversampling technique (SMOTE) [12] is a distance-based method utilized to create synthetic samples within the minority class. Its fundamental principle revolves around the concept that samples sharing the same label within the feature space are considered neighbors. Leveraging the original positions of minority class samples, SMOTE establishes connections between these samples and their neighboring data points, subsequently generating new minority class samples along the connecting lines:
x n e w = x + r a n d o m ( 0 , 1 ) × | x x n | ,
where x is an arbitrary minority class sample, x n is the nearest neighbor found according to a set sampling ratio n, and r a n d o m ( 0 , 1 ) is a random factor to prevent repeated adoption.
Adaptive integrated oversampling (ADASYN) [35]. It is similar to SMOTE in that new samples are synthesized through nearest neighbors. The weights of a few classes of samples are added to it. The method is as follows:
  • Calculate the total number of new samples to be synthesized:
    G = N m a j N min × r a n d o m ( 0 , 1 ) ,
    where N m a j and N m i n are the samples of majority and minority class, respectively.
  • Calculate the weights of the minority class w ^ :
    w ^ = w i i = 1 N m i n w i , w i = k m a j k ,
    where k is the number of nearest neighbors and k m a j is the number of majority class samples inside k.
  • Calculate the number of new samples that can be synthesized from the majority class samples g = w ^ × G . Generate new samples in the same way as SMOTE.
BorderlineSMOTE [36] represents an enhancement over the conventional SMOTE technique, addressing its limitation of synthesizing minority class samples without regard for data distribution [11]. Specifically, BorderlineSMOTE selects only those minority class samples situated on the decision boundary for synthesis. If the set of boundary points of class p samples is DANGER, and the sample points are p 1 , p 2 , p n , for each p i , its s nearest neighbors are calculated, and d i f j , j = 1 , 2 , s is the gap from p i to its nearest neighbor. Therefore, the new synthesis formula is as follows:
x n e w = p i + r a n d o m 0 , 1 × d i f j , j = 1 , 2 , s
SVMSMOTE [37]. It uses SVM in boundary point decision making for BorderlineSMOTE and is an improved strategy for BorderlineSMOTE. Its synthesis formula is as follows:
x n e w + = s v + + ρ ( n n [ i ] [ j ] s v i + ) ,
where X n e w + is the newly synthesized positive sample, s v + is the set of sample support vectors, ρ is the random number in the range [ 0 , 1 ] , n n is an array containing k positive nearest neighbors, and n n [ i ] [ j ] is the j t h positive nearest neighbor of s v + .

2.2. Undersampling

Cluster centroid [38]. It is a class of undersampling techniques based on the k-means clustering algorithm. Clustering centers k 1 ,   k 2 ,   k 3 , are generated by continuous iteration, where k 2 is the newly generated center after deleting a portion of the majority class samples closest to the center of k 1 ,   k 2 , and so on, until the number of positive and negative samples reaches equilibrium. It is not difficult to find that this method requires the majority class samples to be concentrated enough; otherwise, the sampling effect will become worse.
Edited nearest neighbors (ENN) [39]. It is based on the modified k-nearest neighbors rule, which finds the k-nearest neighbors of a certain majority class sample x i , which are { x 1 , x 2 , x 3 , x i , x n } , find the sample points related to it from these nearest neighbors, and remove the misclassified ones, which is an undersampling method. It can achieve the effect of removing the majority class and the minority class samples that are close to each other.

2.3. Combined Resampling

SMOTETomek [40]. It is a combination of SMOTE oversampling and Tomek links [41] of an integrated sampling technique. Tomek links can be formulated in a binary classification task as follows: x m a j and x m i n denote a majority class and a minority class sample, respectively, and d ( x m a j , x m i n ) denotes the distance between them. If there is no observation y, which is any x m a j or x m i n such that d ( x m a j , z ) < d ( x m a j , x m i n ) or d z , x m i n < d ( x m a j , x m i n ) , then d ( x m a j , x m i n ) is called a Tomek link. Oversampling using SMOTE first and then deleting all the Tomek links can make the decision boundary clearer.
SMOTEENN [42]. It is an integrated sampling technique that combines SMOTE oversampling and ENN undersampling. It resembles SMOTETomek in its approach, both prioritizing the synthesis of minority class samples initially. However, unlike SMOTETomek, it opts not to delete samples in pairs; instead, it selectively removes majority class samples encircled by minority class instances. Due to the requirement for computing neighbors, SMOTEENN carries higher computational costs.

3. Methodology

3.1. Framework and Evaluation Metrics

To explore the performance of various sampling techniques and machine learning models across diverse datasets, we meticulously cleaned, encoded, and feature-screened each dataset, partitioning them into training and test sets, as shown in Figure 1. To maintain consistent prior probabilities, the test and training sets were crafted to possess identical IR as the original dataset [10]. Notably, the test set remained untouched by any balancing techniques to ensure its purity.
The training set underwent RTs via ADASYN (ADA), SMOTE (SMO), BorderlineSMOTE (BSMO), SVMSMOTE (SSMO), cluster centroids (CCs), ENN, SMOTETomek (STOM), and SMOTEENN (SENN), resulting in nine distinct training sets alongside the original unprocessed training set. Each training set was further divided into five validation subsets for cross-validation, facilitating the determination of classifier hyperparameters. Classifiers were categorized into three groups: individual classifiers (logistic regression (LR), k-nearest neighbors (KNN), support vector machine (SVM), naïve Bayes (NB), decision tree (DT)), ensemble classifiers (random forest (RF), XGBoost (XGB), LightGBM (LGBM), CatBoost (CAT)), and balanced classifiers (balanced bagging classifier (BBC), balanced random forest (BRF)). The criterion for selecting classifiers is based on the summary of [43] by reviewing 281 credit-risk-models-related articles (our experiment did not include deep learning algorithms such as artificial neural network [44] and convolutional neural network, although the summary mentioned them, and the reason for their exclusion is their complexity compared with individual classifiers and the higher risk of overfitting they entail; however, their performance does not surpass that of other classifiers [45]). Subsequently, all training sets were subjected to evaluation using these 11 classifiers. We employed the tree-structured Parzen estimator (TPE) for Bayesian optimization to identify optimal hyperparameters and fit the models. Finally, the test set was introduced to the trained classifiers to derive their final evaluation scores. The evaluation metrics comprised as follows:
Recall = TP TP + FN ,
F 1 score = 2 TP 2 TP + FP + FN ,
AUC = 1 2 ( 1 + TP TP + FN FP FP + TN ) ,
Accuracy = TP + TN TP + FP + TN + FN ,
G - mean = Recall × TNR , TNR = TN TN + FP ,
while TP, FP, FN, and TN are calculated by Table 1.
Accuracy and recall stand as pivotal metrics in credit risk scenarios, yet their efficacy can be significantly impacted by a CI problem. We retained these metrics to scrutinize the effectiveness of various advanced sampling techniques. Additionally, as previously stated, we assert that minority class samples hold equal importance to majority class samples, and that the ramifications of overlooking a substantial number of good customers mirror those of missing a bad customer. Hence, we also introduce G-means, F1-score, and AUC as evaluation metrics, as suggested in [10,46]. Specifically, AUC will be utilized for subsequent hypothesis testing, as indicated by [10,31].

3.2. Datasets

We chose two benchmark credit datasets, following the recommendation in [10]: the German credit dataset and the Taiwan credit dataset, along with two application-oriented private datasets sourced from Prosper Company and Lending Club Company, as shown in Table 2. These benchmark datasets, extensively utilized in credit risk research over decades, were readily accessible, facilitating the replication of our experiment. In contrast, private datasets, notably those from Lending Club, have gained popularity in the last 5 years due to their richer feature sets and larger data volumes [47], enabling the training of more complex models, each distinct in terms of size, feature dimension, and IR. Notably, the LC dataset stands out as an example of an extremely imbalanced and large dataset. For this dataset, we intend to employ the SH-SENN technique.

3.3. SH-SENN

The process is depicted in the accompanying Figure 2. First, we initiate SMOTE strategic oversampling, wherein minority class samples are sampled to represent 10% to 90% of the majority class. Subsequently, we conduct a search for neighboring samples, which are then removed. Within the 10 new datasets obtained, the ENN proximity strategy is further adjusted to identify new neighboring samples and undergo a subsequent deletion process. Finally, we evaluate the effectiveness of these 10 datasets on the validation set, select the optimal strategy, and apply it to the test set. The key departure from the traditional SMOTEENN lies in our utilization of diverse sampling strategies coupled with cross-verification during the SMOTE process. Moreover, following the initial ENN phase, the dataset undergoes re-evaluation, and adjustments are made to the strategy for selecting the nearest neighbor during the second ENN iteration.
Similar to SENN, SH-SENN also utilizes SMOTE for oversampling, followed by ENN for undersampling. However, there are two significant distinctions between them: First, SENN employs a single ENN undersampling process, whereas SH-SENN utilizes double ENN, meaning ENN is applied again after the regular SENN procedure. Second, while SENN is a mature and combined RT that can be directly applied to any imbalanced dataset, SH-SENN is a framework designed to address imbalance issues. Its approach varies according to the degree of imbalance within datasets. The strategy involves determining the proportion of minority class samples oversampled to majority class samples after the initial SENN step, denoted as α = N n m i n / N m a j , where N n m i n represents the number of minority class samples after sampling, and N m a j is the number of majority class samples. Typically, α falls within the range of [0.1, 1]. SH-SENN emphasizes treating the strategy as a hyperparameter of the classifier, participating in cross-validation to identify the most suitable α (note: this α is the outcome of the initial ENN participation). Subsequently, after determining the α value, ENN is employed for a second time. This treats the resampled dataset as new data, where ENN undersampling, adept at handling boundary noise, is utilized again to refine the decision boundary. From a technical standpoint, SH-SENN’s primary advantage over SENN lies in its ability to further reduce the introduction of new noise and interference items brought about by SENN. This advantage is particularly evident in extremely imbalanced datasets. These will be validated and presented in subsequent experiments. Lastly, SH-SENN is also an extensible framework; it will be explored in the final section.

3.4. Hypothesis Test

3.4.1. Friedman Test and Nemenyi’s Post Hoc Test

Given that the AUC scores of various RTs across different datasets and classifiers do not adhere to a normal distribution, we employed the Friedman test [48] to assess the hypothesis and determine if there exists a significant difference. Once the null hypothesis is rejected, Nemenyi’s post hoc [48] test is subsequently conducted to delve deeper into the differences between specific pairings.
For n RTs and N observations, the test results of each RT on each observation are ranked from good to bad, and the ordinal values 1 , 2 , 3 , , n are given. If there is no significant difference in the performance of multiple classifiers, then they are equally divided into ordinal values. Suppose the average ordinal value of the k th RTs is k i , then k i obeys normal distribution and
τ χ 2 = n 1 n × 12 N n 2 1 i = 1 n k i n + 1 2 ,
where τ χ 2 is chi-square statistic. There is an improved statistic τ F :
τ F = N 1 τ χ 2 N n 1 τ χ 2 ,
where τ F is subject to the F distribution with a freedom of n 1 and ( n 1 ) ( N 1 ) .
The difference between the different RTs can then be represented by Nemenyi’s post hoc test, and a Friedman test plot will be generated. If there is no overlapping part of the two-line segments, it means that there is a significant difference between these two classifiers.

3.4.2. Kruskal–Wallis Test and Mann–Whitney U Test

The Kruskal–Wallis test [49] is a nonparametric test used to compare three or more independent samples. It can be used to check whether the results of different datasets and different classifiers come from the same population. It uses the H statistic to test the following:
H = 12 N ( N + 1 ) j = 1 k R j 2 n j 3 ( N + 1 ) ,
where k is the number of groups, N is the total number of samples, R j is the rank sum of the jth group, and n j is the number of samples from each group. If the null hypothesis is rejected, the Mann–Whitney U test [50] is used to perform pair-to-pair grouping tests between small samples that do not satisfy the normal distribution. The result shows whether a particular RT has different scores due to different datasets or different classifiers.

4. Results

4.1. Overall Comparison of RTs

Overall, RTs are significantly different when predicting credit risk, as shown in Table 3. Table 4 shows the pairs’ comparison among RTs with the biggest difference. SENN and ENN emerge as the most effective among all RTs, as shown in Figure 3. Comparatively, when not utilizing any RTs, SENN showcases notable enhancements, boosting AUC, recall, F1-score, and G-mean by up to 20%, 20%, 30%, and 40%, respectively. For instance, in the LC dataset, the G-mean without RTs stands at 0.1168, while after SENN utilization, the G-mean escalates to 0.5715. This demonstrates a notable improvement in both accuracy and recall rates, leading to a more balanced state.

4.2. Comparison of Resampling Techniques in Terms of Different Datasets

In the results for the German dataset (Table 5), it is evident that all RTs, along with the ensemble balanced classifiers BBC and BRF, significantly contribute to the enhancement of the recall score. This indicates a notable improvement in the classifiers’ ability to identify defaulting customers. Particularly noteworthy is the combined effect of CC, which emerges as the top-ranking approach. This can be attributed to the fact that, with an IR of 2.33, the training set still retains 240 minority class samples after undersampling. Consequently, it is sufficiently equipped to handle the test set comprising only 60 minority class samples, thus resulting in a relatively low prediction difficulty. However, according to the Friedman test, no significant difference is observed between several resampling methods. Furthermore, there is no discernible disparity with the NONE dataset.
The outcomes obtained from Taiwan (Table 6) and Prosper (Table 7) underscore a considerable degree of variability among the different RTs. Through Nemenyi’s post hoc tests, we confirm that SENN and ENN emerge as the most effective resampling methods, as Figure 4 and Figure 5 demonstrate. While their efficacy may vary across different classifiers, there is no doubt regarding their applicability across all classifiers. Conversely, CC, which demonstrates dominance in small datasets, ranks last in medium and large datasets. Surprisingly, leaving the dataset untouched (NONE) outperforms the mass removal of most class samples with CC.
In LC datasets characterized by high IR, there is substantial variability among different resampling methods (Table 8). Notably, CC, SMOTE, ADA, and SSMO are found to be ineffective, failing to surpass the performance of NONE (Figure 6). This is particularly evident in terms of F1-score, highlighting the inability of these techniques to enhance model stability and achieve a balanced trade-off between recall and precision. Only SENN emerges as a reliable technique under such circumstances.
Further examination of the relationship between different RTs and datasets can be conducted through the Kruskal–Wallis test (Table 9). The hypothesis posits that all techniques exhibit significant differences attributable to variations in datasets. For instance, in the case of SENN, its application yields significantly different effects across all subgroups, except for the absence of a significant difference between the German and LC datasets (Table 10). Moreover, the disparity between Prosper and Taiwan at the 1% significance level suggests that SENN’s efficacy varies depending on the dataset size, while the significant difference between Prosper and LC at the 1% level indicates that SENN’s effectiveness is influenced by the IR of the dataset.

4.3. Comparison of Resampling Techniques in Terms of Different Classifiers

We categorized all classifiers into three groups: individual classifiers, ensemble classifiers, and balanced classifiers. The Kruskal–Wallis test reveals no significant difference between the different types of classifiers except for NONE (Table 11). This suggests that the utilization of RTs can effectively bridge the performance gap between individual and ensemble classifiers, significantly enhancing the effectiveness of weaker classifiers. However, the extent of enhancement provided by different RTs does not exhibit a significant difference.
According to the results of the two independent samples Mann–Whitney U test on grouped variables, the p-value of single classifiers and balanced classifiers on NONE is 0.025 **, indicating statistical significance at the 5% level (Table 12). This signifies a significant difference between single classifiers and balanced classifiers when no RTs are utilized. The magnitude of the difference, as indicated by a Cohen’s d value of 0.692, falls within the medium range. Conversely, ensemble classifiers and balanced classifiers exhibit a smaller magnitude of difference. This implies that balanced classifiers can effectively address CI problems in the absence of RTs, making them a valuable balancing technique to consider.

4.4. Experiments with SH-SENN

Significant outcomes were observed with SH-SENN on datasets characterized by high IR (Table 13). Irrespective of the strategy ratio α employed, SH-SENN consistently enhances the prediction performance of ensemble classifiers and balanced classifiers. This enhancement reaches its peak when α approaches 0.5, followed by a diminishing trend. However, for individual classifiers, SH-SENN proves to be less effective than SENN when α is below 0.4. Notably, SH-SENN outperforms SENN when α falls between 0.4 and 0.7. Overall, the results highlight SH-SENN (0.5) as the most advantageous and effective new RT, as shown in Figure 7.
The Friedman test results reveal that SH-SENN exhibits varying degrees of significant differences and advantages over NONE, ADA, SMO, BSMO, SSMO, CC, and STOM (Table 14). Notably, the AUC can reach a maximum of 0.784. Post hoc tests confirm that although there is no significant difference between SH-SENN (0.5) and the previous champion SENN (Figure 8), it still secures the top rank among all RTs with a substantial advantage.
Under various IR conditions, we included the Prosper dataset in the experiment to compare the performance of SH-SENN. This dataset shares the same size as the LC dataset and has an IR of 4.97, representing a normal imbalanced dataset. With positive samples comprising more than 20% of negative samples in the Prosper dataset’s training set, the strategy ratio α for this round ranged from 0.3 to 0.9 (if α was less than 0.2, undersampling will be conducted initially; however, SMOTE as the first step for SH-SENN is an oversampling method).
Results were compared with SENN, which ranked first on the Prosper dataset in a previous result. The findings reveal that (Table 15), under the interference of SH-SENN, all classifiers except LR and RF achieved higher AUC values. This suggests that SH-SENN can outperform SENN even in normal IR datasets, albeit with a smaller improvement compared with extreme IR datasets.
A significant difference from the results in the LC dataset was that the SH-SENN (0.9) strategy yielded the best results, contrary to the previous SH-SENN (0.5) strategy. The average AUC of SH-SENN (0.9) across all classifiers reached 0.9465, slightly higher than SENN’s average AUC of 0.9464 (Figure 9). This discrepancy arises because minority samples in normally imbalanced datasets contain more information than extremely imbalanced minority samples, and there is less noise when oversampling to 90% of the majority sample. Experiments demonstrate that the strategy ratio of SH-SENN should decrease with an increasing IR to achieve better results.
Similarly, the Friedman test results indicate significant differences between SH-SENN (0.9) and NONE, ADA, SMO, BSMO, SSMO, CC, and STOM in the Prosper dataset. Subsequent post hoc tests revealed that SH-SENN (0.9) surpassed all other RTs by a considerable margin (Figure 10). These two sets of experiments collectively demonstrate that SH-SENN can achieve comparable effectiveness to other RTs across various IR datasets, albeit requiring different strategy ratios (Table 16). Moreover, the impact of SH-SENN on classifier enhancement is more pronounced in high IR datasets compared with those with common IR levels.

4.5. Discussion and Limitation

Overall, the experimental results align with our discussion in Section 2. We contend that the CI problem does not directly impact the classifier’s predictive ability but rather obscures the decision boundary, thereby weakening its performance. Certain RTs, such as SMOTE, ADASYN, and cluster centroids, address the CI problem by either oversampling or undersampling to bring the classes closer to equilibrium. However, they merely generate new minority class samples or remove majority class samples without specifically optimizing the decision boundary or addressing the issue of overlapping samples from different classes. Consequently, these techniques prove ineffective across various datasets. SVMSMOTE and BorderlineSMOTE achieve superior results because they focus on generating minority class samples along the border, thereby enhancing the clarity of the decision boundary. Moreover, in SMOTEENN, the ENN method supplements the limitations of the SMOTE-only oversampling approach, which tends to marginalize data distribution. Unlike SMOTETomek, which combines oversampling and undersampling but removes entire pairs, SMOTEENN selectively eliminates examples that do not align with neighboring categorizations. As a result, ENN preserves more information than Tomek link, rendering SMOTEENN more stable and effective across diverse datasets. While SMOTETomek excels in datasets with a pronounced class overlap, it may yield better results in credit datasets characterized by higher feature dimensions and greater diversity in types.
From the dataset perspective, small datasets characterized by limited sample sizes, data structures, and small training sets exhibit no significant disparities among different resampling RTs. Even with oversampling, the potential for expanding the dataset’s information content remains limited. Conversely, medium and large datasets demonstrate similarities in their RT selection. This can be attributed to the classifiers’ capacity to learn sufficient predictive information when the training set is relatively ample. However, further improvement in prediction ability necessitates optimizing decision boundaries in addition to resampling, making combined sampling more effective. On the contrary, large datasets with a high IR demand cautious treatment. These datasets feature high-dimensional features, complex data structures, and sparse minority class samples. Consequently, classifiers may struggle to recognize positive samples unseen during training, resulting in low recall scores. Oversampling the minority class to excessively high proportions, such as with a SMOTE ratio of 1:1 or 1:0.9, may lead to an accumulation of minority class samples without clarifying the decision boundary, potentially generating new noise. In such scenarios, random oversampling may outperform SMOTE [30]. To address extreme CI, it is advisable to employ a smaller ratio oversampling strategy to prevent an abundance of minority class samples from becoming noise. Emphasizing the handling of boundary points and class overlapping is crucial. SH-SENN stands out as a superior technique due to its strategic oversampling approach and dual handling of boundary points, which contribute to achieving better results compared with other RTs.
From the classifier’s standpoint, ensemble classifiers tend to benefit more from RTs compared with individual classifiers due to their enhanced learning capabilities. This is particularly evident in the boosting family of classifiers, such as XGBoost. When employing oversampling or integrated sampling RTs, the ensemble classifier can capitalize on the increased effective information within the dataset, thereby maximizing its predictive performance. Conversely, when using RTs like cluster centroids, which drastically reduce the amount of information, the ensemble classifier’s performance remains robust, and even regular individual classifiers can achieve satisfactory results. While ensemble classifiers offer hyperparameters to adjust category weights, predicting these weights for a few sample categories in real-world scenarios is impractical. Furthermore, presetting these weights is not feasible, considering the test set’s consistent IR with the training set in this study to ensure comparable effects of different RTs and avoid inconsistent prediction difficulty in the test set. For instance, suppose a bank encounters 10 defaulting credit customers monthly, with the bank approving 10 applicants daily. These defaulters may be distributed over 30 days or concentrated on a single day. Processing the training set beforehand allows for the construction of a robust classifier prior to model fitting.
Lastly, concerning the credit risk prediction problem, real credit data often exhibit high feature dimensions and diverse data types. Existing studies frequently concentrate solely on comparing balancing methods and integrating them with classification models, overlooking the practical significance of aiding credit institutions in addressing the CI problem. Their findings tend to be purely theoretical and algorithmic, failing to address the core issue. For instance, clustering-based undersampling may perform well only with specific datasets and models. If a credit institution modifies user profiles, incorporates new features, or includes audio and video features, the original sampling approach may no longer be suitable for the updated dataset. Thus, applicability becomes a concern. The SH-SENN sampling technique proposed in this study is not a rigid algorithm but rather a versatile framework. The undersampling technique within this framework is ENN, but it can be substituted with other undersampling methods to create new algorithms while remaining within the framework’s conceptual scope. Additionally, SH-SENN employs the more representative Lending Club dataset for experimentation and achieves promising results. This illustrates the framework’s potential for extension to analogous credit datasets and its applicability to classification tasks facing CI challenges. For instance, food safety regulation represents a critical concern in Africa, where establishing an early warning system to oversee food quality is imperative. Given the high stakes involved in safeguarding human life, the testing of positive samples becomes more exacting, necessitating robust and balanced datasets for constructing monitoring models. Even a marginal enhancement in model performance achieved by methods like SH-SENN could potentially safeguard the health of numerous individuals. Similar applications can also be extrapolated to medical disease detection and the machinery industry.
SH-SENN also possesses potential limitations. In comparison with other RTs, SH-SENN requires a longer time for resampling due to its utilization of double ENN. This extended duration arises from the necessity for both SMOTE and ENN to formulate their neighbors to facilitate distance-based computations. Consequently, this process consumes more time, particularly in large datasets with high-dimensional features. Moreover, our experiments did not explore the compatibility of SH-SENN with deep learning algorithms such as artificial neural networks. As previously mentioned, deep learning algorithms currently do not offer significant advantages in credit risk prediction but will be questioned by stakeholders because of its black-box property. Nonetheless, deep learning undoubtedly represents a crucial avenue for future research. We expect that combining SH-SENN with deep learning algorithms will yield superior results.

5. Conclusions

We conducted a comprehensive comparison and analysis of various RTs in the experiment, introducing a novel RT tailored for extremely imbalanced datasets. The conclusion can be summarized as follows:
  • SMOTEENN significantly enhances dataset quality before classifier training, consistently improving prediction performance across all selected credit datasets. Compared with the original training set without SMOTEENN preprocessing, AUC values see an increase of 2–4%, and recall values show enhancements of up to 30%. The effectiveness of RTs varies significantly depending on dataset size and IR. For small-sized datasets with a low IR, the choice of RT does not yield significant differences. However, for medium and large-sized datasets with varying IRs, RTs capable of managing decision boundaries and class coincidence points yield superior results. Notably, SMOTEENN stands out in this regard. RTs particularly boost ensemble classifiers over single classifiers. Balanced ensemble classifiers can perform reasonably well without preprocessing RTs, although their predictive power is not as stable as classifiers with RTs applied in advance. Moreover, the choice of classifiers for credit approval should consider interpretability, as high-performance ensemble classifiers may lack interpretability, posing challenges for acceptance by stakeholders. Thus, classifiers like logistic regression and decision trees, known for their interpretability and fast execution, remain widely used despite their potentially poorer performance. Computational cost and interpretability should be key factors when selecting RTs.
  • For SH-SENN, first, the new SH-SENN demonstrates outstanding performance in handling extremely imbalanced large datasets. This is attributed to its focus on addressing decision boundary points and noise points after oversampling. In real-world scenarios, credit data are often intricate, featuring time-varying attributes and numerous sparse categorical variables. Datasets exhibiting high IR and significant noise, such as those from Lending Club Inc., are commonplace. SH-SENN emerges as the optimal solution meeting these realistic requirements. Second, as credit datasets grow increasingly complex, there arises a need for high-performing classifiers to replace traditional scorecard and logistic regression (LR) algorithms. Emerging classifiers like CatBoost prove to be suitable candidates for credit datasets owing to their improved handling of categorical variables. However, they still fall short in effectively addressing CI concerns alone. CatBoost also requires complementary techniques to enhance its efficacy. Our experiments demonstrate that SH-SENN significantly enhances the predictive capabilities of ensemble classifiers compared with individual classifiers. SH-SENN outperforms all other strategies ranging from 0.1 to 0.9, including established techniques like SMOTEENN. This enhancement results in a 1–5% improvement in AUC for various classifiers. Notably, the improvement for CatBoost alone can nearly reach 2%. Such enhancements are highly appealing for credit bureaus, where even a 1% improvement can potentially help them avoid millions of dollars in losses.
In light of our research, future directions include employing additional model evaluation metrics for comprehensive comparisons, exploring relationships between IR and RTs using simulated datasets, and investigating RT combinations tailored for high-performance ensemble classifiers.

Author Contributions

Conceptualization, Z.Z. and T.C.; methodology, Z.Z. and T.C.; validation, Z.Z. and T.C.; formal analysis, Z.Z.; investigation, Z.Z., T.C. and S.D.; resources, T.C., J.L. and A.G.B.; data curation, Z.Z.; writing—original draft preparation, Z.Z.; writing—review and editing, Z.Z. and T.C.; visualization, Z.Z. and S.D.; supervision, T.C.; project administration, T.C.; funding acquisition, T.C. and A.G.B. All authors have read and agreed to the published version of the manuscript.

Funding

This Project is supported by Yunnan University of Finance and Economics Scientific Research Fund Project of China (Grant number 2021B01). This project is also supported by Ningbo Natural Science Foundation, China (Project ID 2023J194), by the Ningbo Government, China (Project ID 2021B-008-C), and by University of Nottingham Ningbo China (UNNC) Education Foundation (Project ID LDS202303).

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found at https://archive.ics.uci.edu (accessed on 15 September 2023), https://www.prosper.com/credit-card (accessed on 15 September 2023), and https://www.lendingclub.com (accessed on 15 September 2023).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
CIclass imbalance
RTresampling techniques
LRlogistic regression
KNNk-nearest neighbors
NBnaïve Bayes
SVCsupport vector machines classifier
DTdecision tree
RFrandom forest classifier
XGBXGBoost classifier
LGBMLightGBM classifier
CATCatBoost classifier
BBCBalancedBaggingClassifier
BRFBalancedRandomForestClassifier
ADAtraining set applied with ADASYN
SMOtraining set applied with SMOTE
BSMOtraining set applied with BorderlineSMOTE
SSMOtraining set applied with SVMSMOTE
CCtraining set applied with cluster centroids
ENNtraining set applied with ENN
STOMtraining set applied with SMOTETomek
SENNtraining set applied with SMOTEENN
NONEtraining set without any balance techniques
SH-SENNtraining set applied with SH-SENN

References

  1. Henley, W.; Hand, D.J. A k-nearest-neighbour classifier for assessing consumer credit risk. J. R. Stat. Soc. 1996, 45, 77–95. [Google Scholar] [CrossRef]
  2. Abellán, J.; Castellano, J.G. A comparative study on base classifiers in ensemble methods for credit scoring. Expert Syst. Appl. 2017, 73, 1–10. [Google Scholar] [CrossRef]
  3. Tsai, C.F.; Wu, J.W. Using neural network ensembles for bankruptcy prediction and credit scoring. Expert Syst. Appl. 2008, 34, 2639–2649. [Google Scholar] [CrossRef]
  4. Andrés Alonso, J.M.C. Machine Learning in Credit Risk: Measuring the Dilemma between Prediction and Supervisory Cost; Banco de España: Madrid, Spain, 2020. [Google Scholar]
  5. Ding, S.; Cui, T.; Bellotti, A.; Abedin, M.; Lucey, B. The role of feature importance in predicting corporate financial distress in pre and post COVID periods: Evidence from China. Int. Rev. Financ. Anal. 2023, 90, 102851. [Google Scholar] [CrossRef]
  6. Wang, L. Imbalanced credit risk prediction based on SMOTE and multi-kernel FCM improved by particle swarm optimization. Appl. Soft Comput. 2022, 114, 108153. [Google Scholar] [CrossRef]
  7. Moscato, V.; Picariello, A.; Sperlí, G. A benchmark of machine learning approaches for credit score prediction. Expert Syst. Appl. 2021, 165, 113986. [Google Scholar] [CrossRef]
  8. García, V.; Marqués, A.I.; Sánchez, J.S. Exploring the synergetic effects of sample types on the performance of ensembles for credit risk and corporate bankruptcy prediction. Inf. Fusion 2019, 47, 88–101. [Google Scholar] [CrossRef]
  9. Haixiang, G.; Li, Y.; Shang, J.; Mingyun, G.; Yuanyue, H.; Gong, B. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 2016, 73, 220–239. [Google Scholar] [CrossRef]
  10. García, V.; Marqués, A.I.; Sánchez, J.S. An insight into the experimental design for credit risk and corporate bankruptcy prediction systems. J. Intell. Inf. Syst. 2015, 44, 159–189. [Google Scholar] [CrossRef]
  11. Niu, K.; Zhang, Z.; Liu, Y.; Li, R. Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending. Inf. Sci. 2020, 536, 120–134. [Google Scholar] [CrossRef]
  12. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. Artif. Intell. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  13. Cui, T.; Li, J.; John, W.; Andrew, P. An ensemble based Genetic Programming system to predict English football premier league games. In Proceedings of the 2013 IEEE Symposium Series on Computational Intelligence (SSCI2013), Singapore, 16–19 April 2013; pp. 138–143. [Google Scholar] [CrossRef]
  14. Fiore, U.; De Santis, A.; Perla, F.; Zanetti, P.; Palmieri, F. Using generative adversarial networks for improving classification effectiveness in credit card fraud detection. Inf. Sci. 2019, 479, 448–455. [Google Scholar] [CrossRef]
  15. Jiang, C.; Lu, W.; Wang, Z.; Ding, Y. Benchmarking state-of-the-art imbalanced data learning approaches for credit scoring. Expert Syst. Appl. 2023, 213, 118878. [Google Scholar] [CrossRef]
  16. Ding, S.; Cui, T.; Zhang, Y. Incorporating the RMB internationalization effect into its exchange rate volatility forecasting. N. Am. J. Econ. Financ. 2020, 54, 101103. [Google Scholar] [CrossRef]
  17. Ding, S.; Cui, T.; Zheng, D.; Du, M. The effects of commodity financialization on commodity market volatility. Resour. Policy. 2021, 73, 102220. [Google Scholar] [CrossRef]
  18. Zhu, R.; Guo, Y.; Xue, J.H. Adjusting the imbalance ratio by the dimensionality of imbalanced data. Pattern Recognit. Lett. 2020, 133, 217–223. [Google Scholar] [CrossRef]
  19. Dorogush, A.V.; Ershov, V.; Gulin, A. CatBoost: Gradient boosting with categorical features support. arXiv 2018, arXiv:1810.11363. [Google Scholar]
  20. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 4–9 December 2017; pp. 3149–3157. [Google Scholar]
  21. Caouette, J.; Altman, E.; Narayanan, P.; Nimmo, R. Managing Credit Risk: The Great Challenge for the Global Financial Markets, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2011; pp. 349–365. [Google Scholar] [CrossRef]
  22. Khan, A.A.; Chaudhari, O.; Chandra, R. A review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation. Expert Syst. Appl. 2024, 244, 122778. [Google Scholar] [CrossRef]
  23. Xia, Y.; Liu, C.; Liu, N. Cost-sensitive boosted tree for loan evaluation in peer-to-peer lending, Electron. Commer. Res. Appl 2017, 24, 30–49. [Google Scholar]
  24. Liu, X.Y.; Wu, J.; Zhou, Z.H. Exploratory undersampling for class-Imbalance learning. IEEE Trans. Syst. Man Cybern. Part B 2009, 39, 539–550. [Google Scholar] [CrossRef]
  25. Liu, B.; Chen, K. Loan risk prediction method based on SMOTE and XGBoost. Comput. Mod. 2020, 2, 26–30. [Google Scholar]
  26. Zięba, M.; Tomczak, J.M. Boosted SVM with active learning strategy for imbalanced data. Soft Comput. 2015, 19, 3357–3368. [Google Scholar] [CrossRef]
  27. Ding, S.; Cui, T.; Wu, X.; Du, M. Supply chain management based on volatility clustering: The effect of CBDC volatility. Res. Int. Bus. Financ. 2022, 62, 101690. [Google Scholar] [CrossRef]
  28. Yen, S.J.; Lee, Y.S. Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 2009, 36, 5718–5727. [Google Scholar] [CrossRef]
  29. García, V.; Marqués, A.I.; Sánchez, J.S. Improving Risk Predictions by Preprocessing Imbalanced Credit Data. In Neural Information Processing; Huang, T., Zeng, Z., Li, C., Leung, C.S., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 68–75. [Google Scholar]
  30. Xiao, J.; Wang, Y.; Chen, J.; Xie, L.; Huang, J. Impact of resampling methods and classification models on the imbalanced credit scoring problems. Inf. Sci. 2021, 569, 508–526. [Google Scholar] [CrossRef]
  31. Brown, I.; Mues, C. An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst. Appl. 2012, 39, 3446–3453. [Google Scholar] [CrossRef]
  32. Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
  33. Ma, X.; Sha, J.; Wang, D.; Yu, Y.; Yang, Q.; Niu, X. Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning. Electron. Commer. Res. Appl. 2018, 31, 24–39. [Google Scholar] [CrossRef]
  34. Kou, G.; Chen, H.; Hefni, M.A. Improved hybrid resampling and ensemble model for imbalance learning and credit evaluation. J. Manag. Sci. Eng. 2022, 7, 511–529. [Google Scholar] [CrossRef]
  35. Haibo, H.; Yang, B.; Garcia, E.A.; Shutao, L. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 1–8 June 2008; pp. 1322–1328. [Google Scholar] [CrossRef]
  36. Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Advances in Intelligent Computing; Huang, D.S., Zhang, X.P., Huang, G.B., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887. [Google Scholar]
  37. Nguyen, H.M.; Cooper, E.W.; Kamei, K. Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradig. 2009, 3, 4–21. [Google Scholar] [CrossRef]
  38. Han, J.; Kamber, M.; Pei, J. (Eds.) 3—Data Preprocessing. In Data Mining, 3rd ed.; Morgan Kaufmann: Boston, MA, USA, 2012; pp. 83–124. [Google Scholar] [CrossRef]
  39. Wilson, D.L. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Trans. Syst. Man Cybern. 1972, SMC-2, 408–421. [Google Scholar] [CrossRef]
  40. Batista, G.E.A.P.A.; Bazzan, A.L.C.; Monard, M.C. Balancing Training Data for Automated Annotation of Keywords: A Case Study. WOB 2003, 3, 1–9. [Google Scholar]
  41. Tomek, I. Two Modifications of CNN. IEEE Trans. Syst. Man Cybern. 1976, SMC-6, 769–772. [Google Scholar] [CrossRef]
  42. Batista, G.E.A.P.A.; Prati, R.C.; Monard, M.C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
  43. Zhang, X.; Yu, L. Consumer credit risk assessment: A review from the state-of-the-art classification algorithms, data traits, and learning methods. Expert Syst. Appl. 2024, 237, 121484. [Google Scholar] [CrossRef]
  44. Chai, E.; Wei, Y.; Cui, T.; Ren, J.; Ding, S. An Efficient Asymmetric Nonlinear Activation Function for Deep Neural Networks. Symmetry 2022, 14, 1027. [Google Scholar] [CrossRef]
  45. Dastile, X.; Celik, T.; Potsane, M. Statistical and machine learning models in credit scoring: A systematic literature survey. Appl. Soft Comput. 2020, 91, 106263. [Google Scholar] [CrossRef]
  46. Ferri, C.; Hernández-Orallo, J.; Modroiu, R. An experimental comparison of performance measures for classification. Pattern Recognit. Lett. 2009, 30, 27–38. [Google Scholar] [CrossRef]
  47. Markov, A.; Seleznyova, Z.; Lapshin, V. Credit scoring methods: Latest trends and points to consider. J. Financ. Data Sci. 2022, 8, 180–201. [Google Scholar] [CrossRef]
  48. Pereira, D.; Afonso, A.; Medeiros, F. Overview of Friedman’s Test and Post-hoc Analysis. Commun. Stat.-Simul. Comput. 2015, 44, 2636–2653. [Google Scholar] [CrossRef]
  49. McKight, P.E.; Najab, J. Kruskal-Wallis Test. In The Concise Encyclopedia of Statistics; Springer: New York, NY, USA, 2008; pp. 288–290. [Google Scholar] [CrossRef]
  50. Meléndez, R.; Giraldo, R.; Leiva, V. Sign, Wilcoxon and Mann-Whitney Tests for Functional Data: An Approach Based on Random Projections. Mathematics 2021, 9, 44. [Google Scholar] [CrossRef]
Figure 1. The framework of RTs’ comparison. The dataset at the start place represents any one of four datasets.
Figure 1. The framework of RTs’ comparison. The dataset at the start place represents any one of four datasets.
Mathematics 12 00701 g001
Figure 2. The process of SH-SENN. (a) Original dataset. (b) SMOTE oversampling to 10–90%. (c) Delete the nearest misclassification examples. (d) Getting a clearer boundary and repeat ENN as step (c).
Figure 2. The process of SH-SENN. (a) Original dataset. (b) SMOTE oversampling to 10–90%. (c) Delete the nearest misclassification examples. (d) Getting a clearer boundary and repeat ENN as step (c).
Mathematics 12 00701 g002
Figure 3. The critical difference diagram of RTs on all datasets.
Figure 3. The critical difference diagram of RTs on all datasets.
Mathematics 12 00701 g003
Figure 4. The critical difference diagram of RTs on the Taiwan dataset.
Figure 4. The critical difference diagram of RTs on the Taiwan dataset.
Mathematics 12 00701 g004
Figure 5. The critical difference diagram of RTs on the Prosper dataset.
Figure 5. The critical difference diagram of RTs on the Prosper dataset.
Mathematics 12 00701 g005
Figure 6. The critical difference diagram of RTs on the LC dataset.
Figure 6. The critical difference diagram of RTs on the LC dataset.
Mathematics 12 00701 g006
Figure 7. The box plot of the AUC score for SENN and SH-SENN on the LC dataset. The red dots are outliers, indicating the lowest or highest AUC value that each strategy can achieve.
Figure 7. The box plot of the AUC score for SENN and SH-SENN on the LC dataset. The red dots are outliers, indicating the lowest or highest AUC value that each strategy can achieve.
Mathematics 12 00701 g007
Figure 8. The critical difference diagram of RTs with SH-SENN(0.5) on the LC dataset.
Figure 8. The critical difference diagram of RTs with SH-SENN(0.5) on the LC dataset.
Mathematics 12 00701 g008
Figure 9. The box plot of the AUC score for SENN and SH-SENN on the Prosper dataset. The red dots are outliers, indicating the lowest or highest AUC value that each strategy can achieve.
Figure 9. The box plot of the AUC score for SENN and SH-SENN on the Prosper dataset. The red dots are outliers, indicating the lowest or highest AUC value that each strategy can achieve.
Mathematics 12 00701 g009
Figure 10. The critical difference diagram of RTs with SH-SENN(0.9) on Prosper dataset.
Figure 10. The critical difference diagram of RTs with SH-SENN(0.9) on Prosper dataset.
Mathematics 12 00701 g010
Table 1. Confusion matrix.
Table 1. Confusion matrix.
Prediction
PositiveNegative
TruthPositiveTPFN
NegativeFPTN
Table 2. Dataset description.
Table 2. Dataset description.
SourceData SizeFeatureIR ( Majority / Minority )
German 11000122.33
Taiwan 129,995113.52
Prosper 2113,934294.97
Lending Club (LC) 3128,4004269.24
1 UCI Machine Learning Repository, https://archive.ics.uci.edu (accessed on 15 September 2023); 2 Prosper Company, https://www.prosper.com/credit-card (accessed on 15 September 2023); 3 Lending Club Company, https://www.lendingclub.com (accessed on 15 September 2023).
Table 3. Friedman test for RTs.
Table 3. Friedman test for RTs.
RTsMedianStatisticp-ValueCohen’s f Value
NONE0.66058.1350.000 *** 10.126
ADA0.674
SMO0.672
BSMO0.678
SSMO0.678
CC0.635
ENN0.691
STOM0.680
SENN0.697
1 *** represents 1% significance level.
Table 4. Post hoc test for RTs.
Table 4. Post hoc test for RTs.
PairsStatisticp-ValueCohen’s d Value
NONE vs. ENN5.0920.010 ***0.137
NONE vs. SENN7.2940.001 ***0.175
ADA vs. SENN5.3670.005 ***0.104
SMO vs. SENN5.5050.003 ***0.085
BSMO vs. CC4.1560.080 * 10.374
BSMO vs. SENN5.0370.011 ** 20.075
SSMO vs. CC5.3120.005 *** 30.341
CCvs. ENN6.9910.001 ***0.408
CC vs. STOM5.3120.005 ***0.394
CC vs. SENN9.1930.001 ***0.456
1 * represents 10% significance level. 2 ** represents 5% significance level. 3 *** represents 1% significance level.
Table 5. The AUC score of the German dataset by RTs on different classifiers.
Table 5. The AUC score of the German dataset by RTs on different classifiers.
ClassifiersMetricsNONEADASMOBSMOSSMOCCENNSTOMSENN
LRAUC0.61310.65120.66070.68210.66790.68330.64290.66070.6357
Recall0.33330.71670.70000.75000.70000.76670.75000.70000.8000
F10.42110.53420.54190.56600.54900.56790.52940.54190.5275
G-mean0.54550.64790.65950.67880.66710.67820.63390.65950.6141
KNNAUC0.61790.61430.64170.63570.63210.63330.62740.65360.6476
Recall0.40000.65000.68330.70000.65000.71670.73330.70000.8667
F10.44860.49370.52230.51850.50980.51810.51460.53500.5417
G-mean0.57820.61320.64030.63250.63190.62780.61840.65190.6094
SVCAUC0.63690.67260.65600.67140.67860.67020.62380.65950.6655
Recall0.36670.61670.58330.60000.65000.73330.73330.58330.8167
F10.46320.54810.52630.54550.55710.55350.51160.53030.5537
G-mean0.57670.67030.65190.66760.67800.66730.61410.65510.6481
NBAUC0.65240.67020.66900.70120.66190.71900.65950.67260.6440
Recall0.48330.78330.76670.76670.71670.76670.78330.76670.8167
F10.50880.55620.55420.58600.54430.60530.54650.55760.5355
G-mean0.63000.66060.66190.69810.65960.71750.64780.66600.6205
DTAUC0.58930.56190.66310.63570.60120.63100.63690.70000.6012
Recall0.50000.41670.63330.55000.46670.73330.76670.65000.5667
F10.44440.39680.53900.50000.44800.51760.52570.58210.4690
G-mean0.58250.54280.66240.62990.58590.62260.62350.69820.6002
RFAUC0.67020.65120.66900.66190.66190.69520.67380.67380.6690
Recall0.48330.46670.51670.51670.51670.78330.78330.53330.7167
F10.53210.50450.53440.52540.52540.58020.55950.54240.5513
G-mean0.64370.62450.65150.64580.64580.68960.66480.65900.6674
XGBAUC0.66070.64290.63450.62140.63210.68690.70240.63810.6857
Recall0.50000.50000.48330.45000.50000.81670.83340.48330.7500
F10.52170.50000.48740.46550.48780.57310.58820.49150.5696
G-mean0.64090.62680.61620.59730.61820.67450.69010.61900.6827
LGBMAUC0.64170.68330.65830.64170.63570.61900.67500.65600.6512
Recall0.43330.56670.51670.48330.50000.76670.80000.53330.7167
F10.48600.55740.52100.49570.49180.51110.56140.52030.5342
G-mean0.60690.67330.64290.62180.62110.60120.66330.64440.6479
CATAUC0.69640.68210.67380.69290.70360.72380.68330.70000.6571
Recall0.50000.55000.53330.55000.60000.83330.81670.55000.7500
F10.57140.55460.54240.56900.58540.60980.56980.57890.5422
G-mean0.66820.66920.65900.67800.69590.71550.67020.68370.6505
BBCAUC0.65950.67020.61670.64290.61430.67140.66670.66670.7024
Recall0.58330.53330.48330.45000.45000.70000.73330.53330.6333
F10.53030.53780.46770.49090.45760.55260.55000.53330.5846
G-mean0.65510.65610.60210.61320.59190.67080.66330.65320.6990
BRFAUC0.70710.65240.67020.68570.65830.68100.67620.68210.7083
Recall0.75000.48330.53330.55000.51670.78330.81670.55000.7167
F10.59210.50880.53780.55930.52100.56630.56320.55460.5931
G-mean0.70580.63010.65610.67210.64290.67320.66140.66920.7083
The number in bold in each row indicates the best performance in terms of different metrics.
Table 6. The AUC score of the Taiwan dataset by RTs on different classifiers.
Table 6. The AUC score of the Taiwan dataset by RTs on different classifiers.
ClassifiersMetricsNONEADASMOBSMOSSMOCCENNSTOMSENN
LRAUC0.64660.70530.70210.70690.70320.61310.69970.70290.7064
Recall0.33460.60290.57050.61570.56290.82060.53810.57270.6285
F10.45210.52890.52920.52930.53260.41940.53110.53020.5263
G-mean0.56640.69780.68960.70100.68910.57690.68080.69070.7021
KNNAUC0.64560.64190.64290.65030.65760.61420.68230.64560.6669
Recall0.36770.61490.57720.59680.55540.80630.59010.58250.6579
F10.44910.44230.44340.45220.46260.41990.49510.44670.4701
G-mean0.58270.64130.63950.64810.64960.58340.67610.64250.6668
SVCAUC0.65800.70030.70130.69950.70200.61960.69990.70190.7019
Recall0.36470.63750.60660.63680.60810.83650.50570.60660.6338
F10.47470.51590.52190.51480.52270.42470.53830.52270.5188
G-mean0.58900.69750.69490.69670.69570.58050.67240.69540.6986
NBAUC0.61310.69610.69770.69310.69760.61590.70170.69930.6961
Recall0.26830.68200.63900.67220.62620.83040.53650.64880.6692
F10.37870.50470.51190.50180.51340.42170.53510.51300.5061
G-mean0.50690.69590.69530.69280.69400.57730.68200.69750.6956
DTAUC0.60320.61250.61620.60350.62760.58610.66610.61150.6701
Recall0.38210.52070.50040.48760.49510.74760.63070.49510.6044
F10.38190.40650.40950.39390.42320.39620.47040.40370.4770
G-mean0.56120.60560.60520.59220.61340.56340.66510.60030.6669
RFAUC0.65440.67480.67630.67450.67920.61410.69730.67850.6977
Recall0.37450.54030.51850.53960.51090.80930.57870.52070.5893
F10.46640.48860.49320.48820.49890.41990.51980.49660.5187
G-mean0.59150.66130.65760.66090.65800.58230.68720.65990.6892
XGBAUC0.65450.67360.67680.67750.68970.59920.70620.68090.6971
Recall0.35950.55460.53050.58250.54560.84250.57350.53350.6036
F10.46760.48550.49270.48880.51220.41000.53570.49910.5158
G-mean0.58420.66300.66080.67080.67450.54760.69360.66480.6908
LGBMAUC0.66130.69330.69730.68580.70690.61270.71330.70090.6986
Recall0.37450.61420.59010.64730.58700.84780.57270.59160.6262
F10.48090.50860.51800.49450.53450.41980.54870.52350.5149
G-mean0.59590.68880.68900.68470.69670.56580.69930.69230.6948
CATAUC0.65840.68410.69250.68120.69520.61410.70920.69070.7017
Recall0.36770.56520.54630.59160.54790.85230.56900.54110.6164
F10.47540.50050.51670.49320.52110.42090.54220.51450.5210
G-mean0.59080.67370.67690.67520.67940.56600.69520.67430.6965
BBCAUC0.67780.64510.65420.64800.64820.61360.69000.65450.6972
Recall0.55240.49360.49130.48760.46650.76790.63380.48680.5780
F10.49210.44680.46010.45100.45170.41830.50150.46080.5196
G-mean0.66610.62710.63360.62780.62220.59390.68770.63260.6869
BRFAUC0.70500.67880.67710.67580.67740.61290.69110.67610.6962
Recall0.65940.54560.51700.53730.50640.80630.68880.51770.5712
F10.51980.49440.49480.49040.49650.41890.49770.49300.5192
G-mean0.70350.66560.65790.66140.65550.58160.69110.65730.6849
The number in bold in each row indicates the best performance in terms of different metrics.
Table 7. The AUC score of the Prosper dataset by RTs on different classifiers.
Table 7. The AUC score of the Prosper dataset by RTs on different classifiers.
ClassifiersMetricsNONEADASMOBSMOSSMOCCENNSTOMSENN
LRAUC0.93410.94680.94450.94680.94230.93870.94170.94470.9504
Recall0.86920.91300.89540.90300.91220.88650.88860.89570.9143
F10.92740.90890.92940.92640.89030.91770.92850.92950.9230
G-mean0.93180.94620.94330.94580.94180.93720.94020.94340.9497
KNNAUC0.82230.85610.86670.86550.84710.86810.84340.86670.8567
Recall0.65790.86420.84350.85140.79760.79240.71590.84350.8710
F10.76310.65960.70550.69490.68990.76470.76960.70560.6563
G-mean0.80570.85600.86640.86540.84560.86470.83370.86640.8566
SVCAUC0.92650.94690.94470.94540.94720.94220.93330.94460.9447
Recall0.85500.90720.89830.90250.91350.89360.87630.89830.9101
F10.91710.91930.92480.92060.90980.92140.91080.92460.9044
G-mean0.92380.94610.94360.94440.94660.94090.93160.94350.9441
NBAUC0.94220.94210.94220.94180.94060.94200.94410.94200.9445
Recall0.88860.88860.88890.88960.89070.88470.89910.88860.9043
F10.93090.93000.93010.92680.91940.93710.92080.92990.9131
G-mean0.94070.94050.94070.94030.93930.94020.94310.94050.9436
DTAUC0.95070.95270.95120.95010.94900.83470.95260.95140.9525
Recall0.91900.92770.92500.92450.92270.95260.92500.92560.9308
F10.91600.91010.90800.90380.90210.56690.91440.90790.9041
G-mean0.95020.95240.95080.94970.94860.82640.95220.95100.9523
RFAUC0.94870.95820.95580.95650.95860.90320.95250.95620.9600
Recall0.89880.92420.91800.92010.92500.95200.90850.91930.9287
F10.94330.94150.94180.94100.94180.71140.94330.94140.9417
G-mean0.94740.95760.95510.95580.95800.90180.95150.95550.9594
XGBAUC0.96080.96040.96070.96130.96340.80520.96230.96110.9618
Recall0.92450.92420.92480.92610.93130.97850.92870.92560.9298
F10.95380.95190.95270.95290.95360.51380.95290.95300.9484
G-mean0.96010.95970.96010.96060.96290.78640.96170.96040.9612
LGBMAUC0.95910.96150.96110.96220.96250.80450.96050.96140.9618
Recall0.92060.92980.92820.93080.93240.97980.92530.92900.9313
F10.95280.94710.94790.94850.94730.51250.95050.94820.9458
G-mean0.95830.96100.96050.96170.96210.78520.95980.96090.9614
CATAUC0.95810.95940.95850.96060.96110.80050.95930.95790.9608
Recall0.91870.92240.92030.92500.92580.97850.92270.91950.9271
F10.95140.95080.95050.95190.95240.50750.94970.94910.9485
G-mean0.95730.95870.95780.96000.96040.78050.95860.95720.9602
BBCAUC0.96310.95880.95620.95630.95870.87960.96290.95630.9586
Recall0.93580.92420.91820.91850.92450.95050.93550.91870.9261
F10.94340.94430.94310.94320.94330.65520.94310.94280.9400
G-mean0.96270.95820.95540.95560.95810.87680.96250.95560.9580
BRFAUC0.96180.95770.95620.95680.95850.90610.96170.95670.9586
Recall0.93680.92370.91870.92010.92480.95330.93680.91980.9258
F10.93550.94010.94200.94240.94220.71810.93520.94260.9407
G-mean0.96140.95710.95540.95610.95790.90490.96140.95600.9581
The number in bold in each row indicates the best performance in terms of different metrics.
Table 8. The AUC score of the LC dataset by RTs on different classifiers.
Table 8. The AUC score of the LC dataset by RTs on different classifiers.
ClassifiersMetricsNONEADASMOBSMOSSMOCCENNSTOMSENN
LRAUC0.57070.77660.77450.76590.76150.76860.59370.77450.7784
Recall0.14210.74860.73500.62840.57920.80870.18850.73500.7869
F10.24020.09810.10070.15110.21170.07860.29680.10070.0889
G-mean0.37680.77610.77340.75340.73930.76750.43390.77340.7783
KNNAUC0.50650.60670.61050.58790.58730.57410.52750.61050.6118
Recall0.01370.34700.35250.22130.21860.18310.05740.35250.3934
F10.02580.06550.06750.10140.10270.10190.09400.06750.0599
G-mean0.11680.54830.55330.45960.45710.42030.23930.55330.5715
SVCAUC0.51640.68260.68870.67870.70200.73130.54540.68870.7021
Recall0.03280.42620.44260.38800.43440.85790.09290.44260.4918
F10.06320.15090.14880.22140.24540.05880.15040.14880.1302
G-mean0.18110.63260.64330.61330.64900.72030.30450.64330.6699
NBAUC0.50000.64840.65210.64550.54520.63640.50000.65210.6551
Recall0.00000.57100.56830.42350.09840.71040.00000.56830.6175
F10.00000.05560.05730.08000.11900.04440.00000.05730.0540
G-mean0.00000.64380.64670.60610.31240.63210.00000.64670.6540
DTAUC0.66170.62130.63640.64700.63870.50090.68240.63640.6805
Recall0.33610.26780.29780.31420.29510.98360.37980.29780.4016
F10.30370.17820.19710.23120.23380.02810.31520.19710.1907
G-mean0.57600.51090.53890.55480.53840.13380.61170.53890.6207
RFAUC0.59830.56800.57480.56780.55820.50050.61940.57480.5872
Recall0.19670.13660.15030.13660.11751.00000.24040.15030.1776
F10.32580.23200.25060.22730.19630.02810.35480.25060.2539
G-mean0.44350.36950.38750.36940.34260.03270.48990.38750.4207
XGBAUC0.65660.64850.64840.66070.66170.50090.67380.64840.6822
Recall0.31420.29780.29780.32240.32511.00000.34970.29780.3689
F10.45190.43950.43690.46270.45080.02820.46550.43690.4390
G-mean0.56020.54550.54550.56750.56970.04350.59070.54550.6060
LGBMAUC0.66560.64050.64050.66260.66800.49960.67570.64050.6676
Recall0.33330.28140.28140.32790.33880.99730.35520.28140.3388
F10.44940.43010.43100.43400.44360.02810.44070.43100.4261
G-mean0.57670.53040.53040.57180.58130.04440.59490.53040.5810
CatboostAUC0.65410.65270.64870.65680.65390.50030.66980.64870.6677
Recall0.30870.30600.29780.31420.30871.00000.34150.29780.3388
F10.45660.45620.44760.46180.44840.02810.46470.44760.4321
G-mean0.55550.55300.54560.56040.55540.02590.58390.54560.5811
BBCAUC0.76410.59620.60130.63540.62100.50040.77980.60130.6350
Recall0.61480.19950.21040.27600.24590.98360.64750.21040.2814
F10.16180.23620.23990.33780.32320.02810.16760.23990.2721
G-mean0.74940.44500.45690.52400.49490.13000.76850.45690.5275
BRFAUC0.79010.57070.56800.56240.55830.50060.79450.56800.5929
Recall0.77050.14210.13660.12570.11751.00000.78420.13660.1885
F10.10320.23800.23090.21200.19910.02810.10270.23090.2738
G-mean0.78990.37680.36950.35440.34260.03440.79450.36950.4336
The number in bold in each row indicates the best performance in terms of different metrics.
Table 9. Kruskal–Wallis test for RTs in terms of different datasets.
Table 9. Kruskal–Wallis test for RTs in terms of different datasets.
RTsDatasetsMedianStatisticp-ValueCohen’s f Value
NONEGerman0.65224.5460.000 *** 10.289
Taiwan0.654
Pros0.951
LC0.654
Total0.66
ADAGerman0.65227.7120.000 ***0.302
Taiwan0.679
Pros0.958
LC0.64
Total0.674
SMOGerman0.66128.9130.000 ***0.304
Taiwan0.677
Pros0.956
LC0.64
Total0.672
BSMOGerman0.66226.7440.000 *** 10.303
Taiwan0.678
Pros0.956
LC0.647
Total0.678
SSMOGerman0.65827.7170.000 ***0.3
Taiwan0.69
Pros0.958
LC0.639
Total0.678
CCGerman0.68131.9280.000 ***0.281
Taiwan0.614
Pros0.88
LC0.501
Total0.635
ENNGerman0.66729.4980.000 ***0.29
Taiwan0.7
Pros0.953
LC0.67
Total0.691
STOMGerman0.66728.8050.000 ***0.304
Taiwan0.681
Pros0.956
LC0.64
Total0.68
SENNGerman0.65728.2580.000 ***0.303
Taiwan0.697
Pros0.959
LC0.668
Total0.697
1 *** represents 1% significance level.
Table 10. Pos hoc test for SENN with different datasets.
Table 10. Pos hoc test for SENN with different datasets.
Two Independent SamplesStatisticp-ValueCohen’s d Value
Group AGroup B
SENN-GermanSENN-Taiwan240.033 ** 21.399
SENN-GermanSENN-Pros00.000 *** 39.341
SENN-GermanSENN-LC641.6360.015
SENN-TaiwanSENN-Pros00.000 ***10.821
SENN-TaiwanSENN-LC940.056 * 10.85
SENN-ProsSENN-LC1210.000 ***6.496
1 * represents 10% significance level. 2 ** represents 5% significance level. 3 *** represents 1% significance level.
Table 11. Kruskal–Wallis test for RTs in terms of different classifiers.
Table 11. Kruskal–Wallis test for RTs in terms of different classifiers.
RTsClassifier GroupMedianStatisticp-ValueCohen’s f Value
NONESingle0.64110.3060.006 *** 20.079
Ensemble0.661
Balanced0.736
Total0.66
ADASingle0.6780.7620.6830.014
Ensemble0.678
Balanced0.661
Total0.674
SMOSingle0.6790.650.7220.02
Ensemble0.675
Balanced0.662
Total0.672
BSMOSingle0.6880.4920.7820.012
Ensemble0.676
Balanced0.662
Total0.678
SSMOSingle0.6881.4460.4850.021
Ensemble0.684
Balanced0.653
Total0.678
CCSingle0.6531.8570.3950.043
Ensemble0.617
Balanced0.643
Total0.635
ENNSingle0.6744.880.087 * 10.063
Ensemble0.7
Balanced0.735
Total0.691
STOMSingle0.6940.3040.8590.016
Ensemble0.68
Balanced0.671
Total0.68
SENNSingle0.6880.5180.7720.008
Ensemble0.691
Balanced0.7
Total0.697
1 * represents 10% significance level. 2 *** represents 1% significance level.
Table 12. Pos hoc test for NONE with classifier groups.
Table 12. Pos hoc test for NONE with classifier groups.
Two Independent SamplesStatisticp-ValueCohen’s d Value
Group AGroup B
NONE-SingleNONE-Ensemble870.040 ** 20.355
NONE-SingleNONE-Balanced310.025 **0.692
NONE-EnsembleNONE-Balanced300.076 * 10.361
1 * represents 10% significance level. 2 ** represents 5% significance level.
Table 13. The AUC score of SENN and SH-SENN by different classifiers on the LC dataset.
Table 13. The AUC score of SENN and SH-SENN by different classifiers on the LC dataset.
ClassifiersSENNSH-SENN (0.1)SH-SENN (0.2)SH-SENN (0.3)SH-SENN (0.4)SH-SENN (0.5)SH-SENN (0.6)SH-SENN (0.7)SH-SENN (0.8)SH-SENN (0.9)
LR0.77840.71740.74940.76510.7840.78480.78150.77840.77680.7767
KNN0.61180.590.60880.59610.61260.61210.60970.60520.60920.6073
SVM0.70210.67010.69560.70650.70710.70930.71610.71360.71220.6992
NB0.65510.54810.59340.61610.6310.65450.64990.65760.65730.6546
DT0.68050.680.68630.67450.65970.67030.66040.6460.68290.6329
RF0.58720.62370.63980.63010.63020.63150.6170.61650.60210.5984
XGB0.68220.68630.68340.68740.68610.69560.68890.68330.68330.6824
LGBM0.66760.68080.67680.67140.67540.67830.67420.66710.67810.6703
CAT0.66770.6850.68230.6850.6850.67970.68520.68250.67970.6745
BBC0.6350.71690.72700.7030.69330.69170.67770.68110.67390.6358
BRF0.59290.71940.69160.66170.63550.63080.61120.61390.6060.5996
The number in bold in each row indicates the best performance in terms of different metrics.
Table 14. Pos hoc test for SH-SENN with RTs on the LC dataset.
Table 14. Pos hoc test for SH-SENN with RTs on the LC dataset.
PairsStatisticp-ValueCohen’s d Value
NONE vs. SH-SENN (0.5)4.970.046 ** 20.654
ADA vs. SH-SENN (0.4)5.0270.040 **0.66
ADA vs. SH-SENN (0.5)5.7890.005 *** 30.731
ADA vs. SH-SENN (0.6)4.8570.060 * 10.598
SMO vs. SH-SENN (0.5)5.2810.022 **0.686
BSMO vs. SH-SENN (0.4)4.8010.068 *0.568
BSMO vs. SH-SENN (0.5)5.5630.010 **0.639
BSMO vs. SH-SENN (0.6)4.6310.096 *0.508
SSMO vs. SH-SENN (0.4)5.140.031 **0.693
SSMO vs. SH-SENN (0.5)5.9020.004 ***0.758
SSMO vs. SH-SENN (0.6)4.970.046 **0.637
SSMO vs. SH-SENN (0.8)4.6880.086 *0.618
CC vs. SH-SENN (0.2)5.1960.027 **1.404
CC vs. SH-SENN (0.3)4.8570.060 *1.358
CC vs. SH-SENN (0.4)5.930.004 ***1.356
CC vs. SH-SENN (0.5)6.6930.001 ***1.405
CC vs. SH-SENN (0.6)5.7610.006 ***1.311
CC vs. SH-SENN (0.7)4.8010.068 *1.284
CC vs. SH-SENN (0.8)5.4780.013 **1.296
STOM vs. SH-SENN (0.5)5.2810.022 **0.686
1 * represents 10% significance level. 2 ** represents 5% significance level. 3 *** represents 1% significance level.
Table 15. The AUC score of SENN and SH-SENN by different classifiers on the Prosper dataset.
Table 15. The AUC score of SENN and SH-SENN by different classifiers on the Prosper dataset.
ClassifiersSENNSH-SENN (0.3)SH-SENN (0.4)SH-SENN (0.5)SH-SENN (0.6)SH-SENN (0.7)SH-SENN (0.8)SH-SENN (0.9)
LR0.95040.93240.94100.94470.94640.94760.94920.9503
KNN0.85670.84640.86120.86210.86310.86300.85850.8568
SVM0.94470.91980.93060.93700.93850.94140.94140.9448
NB0.94450.94440.94490.94500.94500.94480.94430.9447
DT0.95250.94860.95170.95310.95320.95420.95280.9515
RF0.96000.94580.94830.95440.95700.95830.95980.9587
XGB0.96180.94690.95240.95650.95890.96040.96310.9626
LGB0.96180.94730.95140.95570.95810.96020.96110.9623
CAT0.96080.94620.94910.95340.95680.95830.95980.9612
BBC0.95860.94870.95270.95760.95870.95740.95870.9599
BRF0.95860.94890.95470.95840.95890.95960.95920.9585
The number in bold in each row indicates the best performance in terms of different metrics.
Table 16. Pos hoc test for SH-SENN with RTs on the Prosper dataset.
Table 16. Pos hoc test for SH-SENN with RTs on the Prosper dataset.
PairsStatisticp-ValueCohen’s d Value
ADA vs. SH-SENN (0.3)5.0980.028 ** 20.374
BSMO vs. SH-SENN (0.3)4.6860.071 * 10.401
SSMO vs. SH-SENN (0.3)5.130.026 **0.324
CC vs. SH-SENN (0.7)4.8450.050 *1.588
CC vs. SH-SENN (0.8)5.6050.007 *** 31.575
CC vs. SH-SENN (0.9)6.3010.001 ***1.575
SENN vs. SH-SENN (0.3)6.3650.001 ***0.403
SENN vs. SH-SENN (0.4)4.5910.087 *0.229
1 * represents 10% significance level. 2 ** represents 5% significance level. 3 *** represents 1% significance level.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, Z.; Cui, T.; Ding, S.; Li, J.; Bellotti, A.G. Resampling Techniques Study on Class Imbalance Problem in Credit Risk Prediction. Mathematics 2024, 12, 701. https://doi.org/10.3390/math12050701

AMA Style

Zhao Z, Cui T, Ding S, Li J, Bellotti AG. Resampling Techniques Study on Class Imbalance Problem in Credit Risk Prediction. Mathematics. 2024; 12(5):701. https://doi.org/10.3390/math12050701

Chicago/Turabian Style

Zhao, Zixue, Tianxiang Cui, Shusheng Ding, Jiawei Li, and Anthony Graham Bellotti. 2024. "Resampling Techniques Study on Class Imbalance Problem in Credit Risk Prediction" Mathematics 12, no. 5: 701. https://doi.org/10.3390/math12050701

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop