4.2. Evaluation Metrics
To comprehensively evaluate both the baseline classifiers (logistic regression, random forest, and XGBoost) and the proposed ensemble framework (Stacking), a set of six well-established performance metrics was employed: accuracy, recall, F1-Score, Area Under the Receiver Operating Characteristic Curve (AUC-ROC), G-Mean, and Kolmogorov–Smirnov (KS) statistic. These metrics collectively capture both overall correctness and the discriminative capability of the models under class-imbalanced conditions.
- (1)
Multi-dimensional Evaluation Metrics
Accuracy, recall, and the F
1-Score were adopted as core scalar indicators of classification performance. Accuracy measures the overall proportion of correctly classified instances. Recall quantifies the model’s ability to identify actual defaulters, and the F
1-Score provides a balanced summary by combining precision and recall through their harmonic mean. Their definitions are as follows:
- (2)
Curve-based Evaluation Metrics
To further assess discriminative power, the AUC–ROC was employed. The ROC curve illustrates the trade-off between the True Positive Rate (TPR) and the False Positive Rate (FPR) across varying classification thresholds as the following:
The AUC–ROC aggregates this relationship into a single scalar value:
where higher values indicate stronger separation between defaulters and non-defaulters.
- (3)
Geometric Mean (G-Mean)
The G-Mean metric evaluates the balance between sensitivity (recall) and specificity (true negative rate). It is particularly useful in imbalanced datasets, ensuring that the model performs well for both positive and negative classes as follows:
where
. A higher G-Mean indicates that the classifier maintains consistent accuracy across both majority and minority classes.
- (4)
KS Statistic
The Kolmogorov–Smirnov (KS) statistic measures the maximum distance between the cumulative distributions of predicted scores for defaulters and non-defaulters as follows:
where
and
denote the cumulative distribution functions of scores for non-defaulters and defaulters, respectively. A larger KS value reflects better discriminatory ability, with values above 0.3 generally regarded as acceptable in practical credit scoring applications.
Overall, these six complementary metrics provide a balanced and interpretable assessment of model performance, enabling a robust comparison between single learners and the proposed stacking ensemble.
4.3. Exploratory Data Analysis and Pre-Processing Results
This section presents the exploratory data analysis (EDA) conducted specifically on the GMSC dataset, along with the corresponding pre-processing strategies applied to improve data quality and ensure robust model training. The GMSC dataset comprises eleven predictive features (
–
, see
Table 1) in addition to the binary target label. The EDA results reveal several important characteristics, including pronounced class imbalance, the existence of outliers, and heterogeneous patterns of missing values across features. These empirical observations directly guided the design of the subsequent pre-processing procedures and modeling pipeline, ensuring that the derived features are both statistically sound and operationally meaningful for credit default prediction.
Target distribution. As shown in
Figure 3a, the dataset exhibits a pronounced class imbalance, where non-defaulters substantially outnumber defaulters. To alleviate this skew, the Synthetic Minority Oversampling Technique (SMOTE) was applied to the training set during pre-processing, generating synthetic minority samples to achieve a more balanced class distribution while preventing data leakage.
Age. The distribution of
Age (
Figure 3b) approximates a normal shape but contains clear anomalies. Implausible records include borrowers with age
as well as ages above 96 years, with the maximum observed at 109. Such outliers are considered noise and are removed.
Table 2 further shows the default rates across age intervals: individuals aged 18–40 exhibit the highest default rate (10.08%), which declines monotonically with increasing age.
Credit utilization (Revolving Utilization Of Unsecured Lines). As a percentage, this variable should theoretically lie in
, with values exceeding one indicating overdrawn accounts. To facilitate visualization, the horizontal axis in
Figure 3c was scaled by multiplying by 10,000. A substantial fraction of data points fall above one, which may stem from missing denominators or reporting errors. We stratified the anomalous region into
,
,
,
, and
.
Table 3 reports default rates across these bins, with the peak (57.14%) observed in
, suggesting potential error thresholds. Consequently, values exceeding 20 are treated as outliers.
Debt ratio. Empirical evidence indicates that values of
debt ratio above two are highly anomalous. As shown in
Figure 3d and
Table 4, the data distribution and corresponding default rates indicate that default risk peaks in the interval
but stabilizes thereafter. Accordingly, values greater than two are considered outliers.
Real estate loans (Number Real Estate Loans Or Lines). The distribution (
Figure 4a) shows that values
are clear anomalies. Hence, an upper cutoff of 50 was applied.
Dependents (Number Of Dependents). This variable contains 3924 missing values (2.62%). Considering the relatively small proportion of missing entries, both deletion and simple imputation were initially feasible options. To determine the most appropriate strategy, we conducted a comparative experiment evaluating the effects of direct deletion versus median imputation under the same modeling pipeline.
Monthly income (Monthly Income). This feature exhibits 29,731 missing values (19.82%). Since the proportion approaches 20%, simple deletion would severely distort the dataset. Therefore, to preserve the predictive information contained in this variable, a random forest regression–based imputation method was employed to estimate the missing values using the remaining observed features.
Delinquency counts (Number of times 90 days late,Number of time 30–59 days past due not worse,Number of time 60–89 days past due not worse). As illustrated in
Figure 4b and
Table 5 and
Table 6, extreme values exceeding 90 were implausible and treated as anomalies. These adjustments ensure that the processed dataset was both statistically reliable and suitable for subsequent model training.
As detailed in the preceding EDA analysis, all thresholds for outlier detection and discretization were determined empirically based on the observed feature distributions and default-rate patterns. Following these observations, a systematic pre-processing pipeline was developed to enhance data quality and ensure stable model training.
To determine the most reliable missing-value strategy, a controlled experiment was conducted comparing two alternatives: (i) deletion of samples containing missing entries, and (ii) median imputation, both implemented under an identical pipeline comprising binning, WOE transformation, and subsequent model construction. Predictive performance was assessed using ROC–AUC and complementary metrics. As shown in
Table 7, the deletion-based approach consistently outperformed median imputation, indicating that removing limited missing samples yields more stable downstream performance. Consequently, this study adopts the deletion method for variables with few missing entries, while applying a random forest based imputation for variables with high missingness.
Following the exploratory analysis, additional data-cleaning procedures were applied to remove inconsistencies and implausible records. An unused index column (Unnamed: 0) was deleted, and duplicate rows were dropped. Rule-based filtering was used to exclude unrealistic values, including samples with non-positive ages and extreme delinquency counts (number of time 30–59 days past due not worse, number of time 60–89 days past due not worse, or number of times 90 days late ) and excessively large values of number real estate loans or lines (≥50). The variable Monthly Income was imputed via a random forest regressor trained on remaining numerical features, while records missing number of dependents were removed. These pre-processing steps ensured that the dataset was internally consistent and suitable for subsequent modeling.
4.5. Model Performance Evaluation
Experiment I: Baseline Performance Comparison. In this experiment, testing was conducted separately on two benchmark datasets, and the results are presented in
Table 8 and
Table 9. All reported values represent the mean ± 95% confidence interval obtained over five independent runs under identical experimental settings. The best results in each column are highlighted in bold.
The experimental findings indicate that the proposed Stacking Ensemble consistently achieves the highest overall performance across both datasets. For the GMSC dataset, the ensemble model attains the best scores in AUC–ROC (0.8725), F1-Score (0.9761), and KS (0.587), demonstrating enhanced discriminative capability in distinguishing defaulters from non-defaulters. The random forest exhibits the highest recall (0.9880), highlighting its sensitivity to minority (default) samples, while logistic regression provides stable AUC and precision values. The XGBoost model also demonstrates strong predictive capability, confirming the effectiveness of boosting-based methods for structured financial data. Its competitive performance complements the strengths of logistic regression and random forest, jointly contributing to the ensemble’s stable and balanced predictive behavior. A similar trend is observed on the German Credit dataset, where the proposed approach achieves superior results in AUC–ROC (0.8951), accuracy (0.892), and KS (0.6012). The relatively narrow confidence intervals across all models indicate that the observed performance differences are statistically consistent rather than caused by random variation.
Overall, the Stacking Ensemble effectively integrates the complementary strengths of base learners—including linear (logistic regression), bagging-based (random forest), and boosting-based (XGBoost) algorithms—providing stable and reliable predictive performance across two credit default datasets.
The fitted meta-learner coefficients give an overview of how important the different models are within the ensemble. To further examine their relative contributions, we analyzed the learned coefficients across the two benchmark datasets, as summarized in
Table 10. For the GMSC dataset, the logistic regression base learner receives a negative weight (
), whereas both random forest (
) and XGBoost (
) obtain strong positive coefficients. This suggests that the meta-level model counterbalances the linear learner’s overconfident predictions by emphasizing the nonlinear corrections provided by the tree-based models. In contrast, for the German Credit dataset, all base learners exhibit positive coefficients (
,
, and
for logistic regression, random forest, and XGBoost, respectively), with random forest again showing the largest contribution. This indicates that after domain-constrained pre-processing—such as WOE transformation and monotonic binning—the data become more linearly separable, allowing all models to contribute synergistically to the final prediction. Overall, the results confirm that while stacking ensembles consistently benefit from nonlinear learners, the relative importance and sign of each base model remain data-dependent. The meta-learner adaptively allocates weights according to the intrinsic structure and separability of each dataset, achieving a well-balanced trade-off between interpretability and predictive performance.
Experiment II: Model Uncertainty Estimation. Traditional classifiers output a probability score for each class, yet such scores often fail to express the model’s confidence in its predictions. This limitation becomes particularly critical in financial risk management. For example, a predicted probability of 0.6 may carry different implications depending on whether it is consistently estimated or varies widely across models. Uncertainty estimation thus complements conventional performance metrics by quantifying prediction reliability and stability.
Gal and Ghahramani [
47] introduced dropout-based Bayesian approximations for uncertainty estimation, leveraging predictive variance or entropy as proxies. Inspired by their framework, this study applied uncertainty estimation in a structured-data setting to enhance model interpretability and robustness.
- (1)
Random Forest Ensembles
For random forest models, uncertainty was quantified by the dispersion of predicted probabilities across individual trees. A high degree of consensus among trees implies low uncertainty, whereas divergent outputs indicate high uncertainty.
Let
M denote the number of independently trained random forest sub-models. For a test sample
x, the
m-th sub-model outputs the probability
of belonging to the positive class. The mean prediction and standard deviation are given by the following equations:
where
reflects the consensus probability, while
quantifies the model’s predictive uncertainty. A higher
implies greater disagreement among base learners and lower prediction confidence. The distribution of
values across the dataset is presented in
Figure 6.
- (2)
XGBoost Model
Gradient-boosting models such as XGBoost iteratively reduce residual errors through sequential additive learning. However, they typically output point estimates without explicit uncertainty quantification. To assess prediction reliability, this study adopted a bootstrapped ensemble inference strategy for XGBoost. Multiple models were trained on resampled subsets of the training data, and the variability among their predicted probabilities represented epistemic uncertainty.
Let
T denote the number of bootstrap models. For each sample
x, the
t-th model yields
, and the aggregated mean and dispersion are as the following:
where
denotes the average predicted probability, while
captures model disagreement.
Figure 7 illustrates the resulting uncertainty distribution, where a sharp concentration near zero indicates that most XGBoost predictions are highly confident. Compared with random forest, XGBoost exhibits a slightly narrower uncertainty spread, reflecting smoother probabilistic calibration and stronger regularization effects.
- (3)
Logistic Regression and Stacking Mode
Unlike ensemble methods, single models such as logistic regression cannot exploit variance across sub-models. Instead, their uncertainty is quantified using entropy of the predicted probability
:
The Stacking ensemble in this study adopted logistic regression as the meta-learner. Its uncertainty estimation follows the same entropy-based approach:
where
denotes the meta-learner’s predicted probability.
Figure 8a,b illustrate the distribution of uncertainty scores for logistic regression and Stacking, respectively.
Uncertainty histograms reveal that lower median and mean entropy values correspond to higher model confidence. Moreover, a peak density near zero indicates that most predictions are highly certain. Combining classification metrics (
Table 8 and
Table 9) with uncertainty analyses (
Figure 6,
Figure 7 and
Figure 8), the Stacking ensemble demonstrated superior overall performance. It not only achieved the highest AUC, KS, and F1-score but also maintained stable and reliable uncertainty distributions, outperforming both random forest and logistic regression. Therefore, the Stacking model was selected as the final classifier for subsequent applications.
Experiment III: Model Training Efficiency. To assess the computational efficiency of the proposed domain-constrained stacking framework, the training time of all compared models was measured on both datasets. The results, summarized in
Table 11, reveal that simple models such as logistic regression and XGBoost exhibit very fast convergence due to their lightweight parameter structures and gradient-based optimization. Random forest requires slightly longer training time owing to iterative tree construction.
The Stacking Ensemble represents the total training time of the proposed two-layer pipeline, including the base learners (LR, RF, and XGBoost) as well as the meta-level logistic regression. On the larger GMSC dataset, the entire pipeline completes within 2.55 s, whereas on the smaller German Credit dataset, the total time is approximately 1.23 s, reflecting the smaller data scale and lower model complexity. Overall, the results indicate that the proposed framework maintains high computational efficiency and demonstrates promising applicability in real-world credit risk modeling scenarios.
Experiment IV: IV Threshold Sensitivity Analysis. To further evaluate the robustness of the domain-constrained feature selection strategy, a sensitivity analysis was conducted by varying the IV threshold used for variable retention. Since IV quantifies the predictive power of individual features under monotonic Weight-of-Evidence encoding, different thresholds directly affect the number of selected variables and the final model performance. The stacking ensemble was re-trained under multiple IV cutoffs to investigate the stability of classification metrics.
Figure 9 illustrates the effect of IV thresholds on the GMSC dataset. As the IV threshold increases from 0.1 to 0.9, the number of retained variables gradually decreases from five to one. The model achieves its best overall discrimination performance around the threshold of 0.2, with both AUC and KS reaching peak values (0.8725 and 0.5879, respectively). Beyond this point, the removal of moderately informative variables leads to a decline in predictive capability, indicating that overly aggressive filtering may reduce model diversity.
A similar experiment was performed on the German Credit dataset (
Figure 10). The results exhibit a consistent pattern, where moderate thresholds (0.09–0.11) yield the most balanced trade-off between AUC (0.8951) and KS (0.6012). When the threshold is too low, irrelevant features introduce noise; when it is too high, useful but weak predictors are discarded. These findings validate that the IV-based filtering mechanism is robust and that the model performance remains stable within a reasonable threshold range.
Overall, the sensitivity results confirm that the proposed domain-constrained stacking framework maintains stable classification capability under varying IV selection rules. This robustness demonstrates that the model is not overly sensitive to specific threshold choices, ensuring its applicability across heterogeneous credit datasets.
Experiment V: Ablation Study on Feature Encoding Strategies. To further quantify the contribution of domain-informed feature transformations, an ablation study was conducted using both the GMSC and German Credit datasets. The experiment compared three configurations representing increasing levels of domain constraint: a baseline using raw numerical and label-encoded categorical variables, a version with unconstrained WOE transformation, and the proposed monotonic WOE encoding that enforces binning consistency with credit risk domain logic. All models were trained under the same stacking ensemble framework consisting of logistic regression, random forest, and XGBoost as base learners to ensure fair comparison.
As shown in
Table 12, the domain-constrained monotonic WOE encoding consistently achieved the best predictive performance across both datasets. For the GMSC dataset, the AUC improved from 0.8450 to 0.8725 and the KS statistic increased from 0.5594 to 0.5879, demonstrating a clear enhancement in discriminatory power. Similarly, for the German Credit dataset, the AUC increased from 0.8849 to 0.8951 and the KS statistic rose from 0.5603 to 0.6012. These consistent improvements confirm that incorporating domain knowledge through monotonic binning not only strengthens interpretability but also enhances model generalization and stability. Overall, the ablation results validate that the domain-constrained feature transformation serves as a critical component of the proposed framework, bridging traditional credit scoring principles with modern ensemble learning.