Skip to Content
MCAMathematical and Computational Applications
  • Article
  • Open Access

10 March 2026

Predictive Modelling of Credit Default Risk Using Machine Learning and Ensemble Techniques

and
1
African Institute for Mathematical Sciences, Cape Town 7950, South Africa
2
Department of Statistics and Operations Research, University of Limpopo, Polokwane 0727, South Africa
*
Author to whom correspondence should be addressed.

Abstract

This study develops a hybrid framework integrating ensemble learning with explainable artificial intelligence to address the methodological challenge of balancing predictive accuracy and interpretability in credit risk model comparison. Using the German Credit Dataset, we implemented a comprehensive preprocessing pipeline, including feature encoding, scaling, and SMOTE for class imbalance handling. Four base models, logistic regression, Random Forest, XGBoost, and Multilayer Perceptron, were combined through a Stacked Ensemble with a logistic regression meta learner. The ensemble demonstrated strong performance, achieving an AUC of 0.761, precision of 0.783, recall of 0.806, and an F1 score of 0.794, which represented the highest scores among all models tested. Notably, Random Forest (AUC = 0.749) surpassed XGBoost (AUC = 0.733), challenging conventional algorithmic hierarchies. SHAP analysis provided transparent global and local interpretability, identifying Current Account status (SHAP = 0.153), Loan Duration (0.064), and Savings Account (0.063) as dominant predictor variables. Class-imbalance handling and threshold optimisation enhanced practical utility by reducing false positives from 39 to 16, thereby aligning with financial risk priorities. The framework provides a reproducible methodological pipeline for systematically comparing credit scoring approaches, demonstrating how predictive performance can be evaluated alongside interpretability considerations within a benchmark dataset context.

1. Introduction

Credit risk refers to the likelihood that a borrower fails to meet their contractual repayment obligations. This represents one of the most significant threats to the financial services industry [1]. Incorrect assessments of credit risk in the lending process result in lenders losing a substantial amount of money, an increased rate of non-performing loans, and, in severe cases, systemic failure within the banking industry. Bad loans present a significant challenge in global credit risk management. Although credit scoring methodologies must address model transparency, this study focuses on methodological development and comparison using established benchmark data.
Traditional methods for assessing credit risk have been grounded in statistical analysis, such as logistic regression and linear discriminant analysis. These methods remain popular because they pass regulatory scrutiny and are accepted as transparent and interpretable [2]. However, all of these models assume linearity and independence of predictors, which are seldom applicable in the context of borrower behaviour subject to non-linear social and economic trends [3].
Credit risk modelling has evolved due to the availability of new computing technologies, large financial datasets, and increased speed of machine learning algorithms. Newer algorithms that can discover complex relationships in borrower data include Random Forests, Gradient Boosting Machines, Extreme Gradient Boosting, and Multilayer Perceptron Neural Networks [4]. Although they are more precise in their predictions compared to traditional models, they raise concerns about interpretability, explainability, and trust [5].
These challenges necessitate modelling platforms that are both understandable and capable of handling real-world complexity. This study offers a hybrid architecture based on the stacking ensemble approach to integrate simple and complex machine learning models and systematically test them. The Shapley Additive Explanations model is employed as a solution to understand and control machine learning black-boxes while upholding ethical guidelines [6,7].

1.1. Rationale

The credit-scoring models required by financial institutions must deliver fair, transparent, and correct predictions on various types of data. This is largely due to increasing pressure on the financial sector to implement automated decisions as ML solutions become more prevalent [4]. A technically sound model will not pass suitability tests by regulators or gain public trust unless it can justify its predictions.
Recent studies have shown that explainable AI can preserve transparency while maintaining accuracy [5,8]. Additionally, ensemble learning has been shown to significantly improve predictive performance, particularly when data are imbalanced and noisy [9]. However, few researchers such as Begenau et al. [5], Farboodi and Veldkamp [8] and Ribeiro et al. [6] have systematically reviewed Stacked Ensemble models using a combination of multiple learners and post-hoc interpretability methods. This will be the focus of the current study.

1.2. Review of Literature

Credit risk modelling has historically relied on traditional statistical methods such as logistic regression and linear discriminant analysis. These methods remain popular because they are transparent and easy to justify to regulators, but they rely on restrictive linearity and independence assumptions that are often violated in real borrower behaviour [2,10]. As financial datasets have grown in size and complexity, these classical models have shown limitations in capturing non-linear relationships and interactions [11].
Machine learning approaches have therefore gained prominence, with studies consistently demonstrating that algorithms such as Random Forests, Gradient Boosting Machines and deep neural networks outperform traditional models in predictive accuracy [4,12]. However, these models introduce a major drawback: they are considerably less interpretable, creating challenges for transparency, risk governance and regulatory compliance [13]. The improvement in accuracy often comes at the expense of explainability, which remains a critical requirement in credit decision-making.
Recent methodological surveys indicate a paradigm shift toward hybrid approaches that balance predictive accuracy with regulatory compliance. Tang et al. [14] systematically reviewed fairness-aware machine learning in credit scoring, finding that interpretability techniques like SHAP and LIME are increasingly deployed not merely for explanation but also for bias detection and mitigation. Similarly, Barocas et al. [13] documented the growing regulatory expectation that financial institutions implement “explainability by design” rather than treating interpretability as an afterthought. These developments create both opportunities and challenges for credit risk modelling, necessitating frameworks that integrate state-of-the-art predictive performance with robust, auditable decision processes.

1.2.1. Recent Advances in Tabular Machine Learning for Finance

The application of machine learning to structured, tabular financial data has seen significant methodological evolution in recent years. While deep learning has dominated image and text domains, tabular data characteristics of credit applications present distinct challenges, including mixed data types, missing values, and heterogeneous feature spaces [15]. Recent comparative studies reveal that tree-based ensemble methods, particularly gradient boosting variants, consistently outperform deep learning approaches on tabular datasets, challenging assumptions about neural network superiority across all data modalities [12]. In credit scoring specifically, Yang et al. [12] demonstrated that convolutional neural networks offer no significant advantage over carefully tuned gradient boosting machines, reinforcing the importance of algorithm selection grounded in data characteristics rather than technical trends.

1.2.2. Ensemble Learning Developments in Credit Risk

Ensemble methods have evolved beyond traditional bagging and boosting to include sophisticated stacking and blending techniques that leverage model diversity. Recent work by Kou et al. [4] provides a comprehensive meta-analysis of ensemble techniques in credit scoring, finding that Stacked Ensembles consistently outperform individual models when base learners exhibit complementary error patterns. However, they note increasing complexity–accuracy trade-offs that necessitate careful validation. The emergence of automated machine learning frameworks has further democratised ensemble construction, though concerns remain about interpretability and regulatory compliance [16]. Notably, Lessmann et al. [11] found that simpler ensembles often match or exceed the performance of more complex counterparts when applied to moderate-sized credit datasets, suggesting diminishing returns to complexity.
Ensemble learning, including bagging, boosting and stacking, has emerged as a strong alternative because it combines the strengths of multiple models to enhance predictive performance [11,17]. Stacking in particular has shown strong results in credit scoring but tends to increase model complexity, making the decision-making process even harder to interpret. Very few studies provide clear interpretability frameworks for Stacked Ensembles, despite their growing popularity.

1.2.3. Interpretability in Financial Machine Learning

The interpretability imperative in financial services has driven the rapid development of explainable AI (XAI) techniques post-hoc to complex models. SHAP (Shapley Additive Explanations) has emerged as a dominant framework due to its game-theoretic foundations and compatibility with diverse model architectures [7]. Recent applications in credit scoring demonstrate SHAP’s utility for both global feature importance analysis and individual decision explanations, addressing regulatory requirements under frameworks like the EU’s General Data Protection Regulation (GDPR) Right to Explanation [18]. However, Demirgüç-Kunt et al. [19] argues persuasively for inherently interpretable models in high-stakes domains like finance, creating an ongoing tension between accuracy and transparency. Hybrid approaches that combine interpretable models with local explanations represent a promising middle ground, gaining traction in financial applications [16]. Techniques such as LIME [6] provide complementary local approximations but lack SHAP’s theoretical coherence for global feature importance.

1.2.4. Fairness and Ethical Considerations

Fairness and ethical considerations further complicate credit scoring. Machine learning models can reproduce or amplify historical inequalities through proxy variables or biased data distributions [14]. Although interpretability can support fairness analysis, the combination of fairness, class-imbalance handling and ensemble explainability remains underexplored.

1.2.5. Gaps in Current Literature

Despite these advances, several gaps persist in credit risk modelling research. First, few studies systematically combine class-imbalance handling, threshold optimisation, and ensemble interpretability within a single framework. Second, comparative studies often neglect proper statistical testing of performance differences, reporting numerical advantages without significance assessment [20]. Third, implementation transparency and reproducibility remain under-documented in many publications, limiting practical adoption. Finally, there is limited research examining the stability of feature importance across different modelling paradigms in credit scoring contexts. This study aims to address these gaps by developing a comprehensive, statistically validated, and interpretable ensemble framework with full methodological transparency.

1.2.6. Positioning of Current Study

Overall, the literature demonstrates a clear gap. There is limited research that develops a class-imbalance-aware Stacked Ensemble model for credit risk, interprets the combined system using SHAP, and evaluates performance, explainability and fairness. This study aims to fill this gap by designing an interpretable ensemble framework that balances predictive strength with transparency and regulatory suitability.

1.3. Contribution of the Study

This study builds and evaluates a class-imbalance-aware Stacked Ensemble framework for credit default prediction, incorporating SHAP-based explanations for model transparency. Using the German Credit Dataset as a methodological testbed, we demonstrate how predictive performance, interpretability, and methodological rigour can be systematically compared within a unified framework for credit scoring methodology evaluation.
The study introduces a stacking framework that incorporates class-imbalance handling to reflect the unequal consequences of misclassification in credit risk. It provides one of the few applications of SHAP to a full stacked ensemble, offering interpretability at both the base-learner and meta-learner levels. The work further contributes by implementing a complete, reproducible pipeline for comparing ensemble credit scoring approaches with interpretability, providing a methodological benchmark for similar evaluations.

2. Methods

2.1. Research Design

This study employs a comparative experimental design to systematically evaluate machine learning models for credit default prediction. The design framework comprises three sequential phases. The first phase involves model development, including individual training of four base classifiers, logistic regression, Random Forest, XGBoost, and multilayer perceptron alongside one Stacked Ensemble, each with optimised hyperparameters. The second phase consists of comparative evaluation through rigorous performance assessment using multiple metrics such as accuracy, precision, recall, and AUC-ROC, complemented by statistical significance testing via McNemar’s and Friedman’s tests. The third phase focuses on interpretation, employing SHAP (Shapley Additive Explanations) values to analyse model explainability at both global feature importance and local individual prediction levels.
The design ensures controlled comparison by applying identical preprocessing pipelines, validation strategies, and evaluation criteria across all models. This methodological consistency isolates algorithmic differences as the primary variable under investigation while minimising confounding factors related to data preparation or evaluation procedures. The comparative approach enables direct benchmarking of ensemble methods against individual classifiers and facilitates analysis of trade-offs between predictive performance, computational complexity, and model interpretability within the credit scoring domain.

2.2. Data Source and Preprocessing

The German Credit Dataset Lessmann et al. [11] was used for this study, containing 1000 credit instances with 20 predictive variables, including demographic, financial, and credit history variables. The binary target variable indicates default (1) or non-default (0). The dataset was divided using a single stratified 70–30 train–test split, ensuring both sets maintained the original 30:70 default/non-default ratio. This approach, while common in credit scoring literature, provides a single performance estimate that may have higher variance than repeated splits or cross-validation.
The dataset comprises 1000 credit applications with a 30% default rate (300 defaults, 700 non-defaults). Key predictor variables include borrower demographics (age, employment status), financial indicators (current account status, savings account, credit amount), and credit history information. All categorical variables were appropriately encoded (binary variables as 0/1, ordinal variables maintaining natural ordering) prior to analysis. Missing values were minimal (<2%) and handled via median/mode imputation.
A structured preprocessing pipeline was implemented to ensure data quality and model readiness:
  • Data Cleaning: Missing values were handled via median imputation for numerical predictor variables and mode imputation for categorical features. Outliers in numerical features were treated using the interquartile range (IQR) method.
  • Feature Encoding: Categorical variables were transformed using one-hot encoding for nominal features:
    X encoded = OneHotEncode ( X categorical )
    This creates binary columns for each category while avoiding artificial ordinal relationships.
  • Feature Scaling: Numerical features were standardised using Z-score normalisation to ensure consistent scaling across features:
    Z = X μ σ
    where μ is the mean, and σ is the standard deviation. This transformation centres data around zero with unit variance, improving convergence for gradient-based algorithms.
  • Class Balancing: To address the inherent class imbalance in credit default data, we applied the Synthetic Minority Over-sampling Technique (SMOTE) [21]:
    x new = x i + λ · ( x z i x i )
    where λ U ( 0 , 1 ) is a random number between 0 and 1, and x z i is a randomly selected nearest neighbour from the minority class. This generates synthetic minority samples along line segments in feature space.

2.3. Predictive Models

We evaluated four distinct classifier types to capture a range of modelling approaches.

2.3.1. Logistic Regression (LR)

Logistic regression serves as an interpretable baseline model widely adopted in credit scoring due to its probabilistic framework and regulatory acceptance [2]. The model estimates the probability of default using the sigmoid function, which maps a linear combination of predictor variables to the [0,1] interval:
P ( Y = 1 | X ) = 1 1 + e ( β 0 + i = 1 p β i X i )
where β 0 is the intercept term, β i are feature coefficients, and X i are the predictor variables. The coefficients provide direct interpretability of feature effects on default probability. In credit risk applications, the coefficients provide directly interpretable measures of how each borrower characteristic influences default likelihood. This transparency is particularly valuable for regulatory compliance and for explaining decisions to applicants. However, the model assumes linearity in the log-odds, which may not fully capture complex interactions present in borrower behaviour.

2.3.2. Random Forest (RF)

Random Forest constitutes an ensemble learning method that constructs multiple decision trees through bootstrap aggregation (bagging) to mitigate overfitting—a common concern in credit scoring with limited historical data [15]. Each tree is grown using a random subset of both observations (via bootstrapping) and predictor variables, introducing diversity that enhances generalisation. The final prediction for a given applicant is determined by majority voting across all trees:
y ^ = mode { h 1 ( X ) , h 2 ( X ) , , h T ( X ) }
where h t ( X ) represents the prediction of the t-th decision tree, and T is the total number of trees in the forest. Each tree is trained on a random subset of features and data samples. For credit risk assessment, Random Forest offers several advantages: it handles mixed data types (categorical and numerical) naturally, requires minimal preprocessing, and provides built-in measures of variable importance [22]. The method is particularly effective at capturing non-linear relationships and interactions between borrower characteristics without explicit specification, though it sacrifices some interpretability compared to simpler models.

2.3.3. Extreme Gradient Boosting (XGBoost)

Extreme Gradient Boosting (XGBoost) represents a high-performance implementation of gradient boosting that sequentially builds decision trees to correct errors from previous iterations [23]. This approach has demonstrated particular effectiveness in modern credit risk prediction due to its sophisticated handling of class imbalance and complex predictor interactions characteristic of financial data. The objective function combines a differentiable loss term with regularisation to control model complexity:
L ( t ) = i = 1 n l ( y i , y ^ i ( t 1 ) + f t ( x i ) ) + Ω ( f t )
where l is the differentiable loss function, f t is the tree built at iteration t, and the regularisation term Ω ( f t ) = γ T + 1 2 λ w 2 controls model complexity through γ (minimum loss reduction), T (number of leaves), and λ (L2 regularisation). In credit scoring contexts, XGBoost’s early stopping mechanism helps prevent overfitting, while its handling of missing values aligns with real-world credit application data, where some information may be incomplete. The algorithm also provides measures of predictor importance, though these are less immediately interpretable than logistic regression coefficients.

2.3.4. Multilayer Perceptron (MLP)

The Multilayer Perceptron (MLP) represents a class of feedforward artificial neural networks particularly well-suited to credit risk prediction owing to its capacity to capture complex, non-linear relationships between borrower characteristics and default probability [24]. In credit scoring contexts, MLPs can model intricate interactions between financial indicators that may be overlooked by linear approaches.
The architecture comprises an input layer, which receives the p predictor variables, one or more hidden layers that transform inputs through weighted connections, and an output layer that generates the estimated default probability. The activation of neuron j in layer l is computed as:
a j ( l ) = σ i w j i ( l ) a i ( l 1 ) + b j ( l )
where a j ( l ) is the activation of neuron j in layer l, w j i ( l ) are the connection weights, b j ( l ) are bias terms, and σ is the ReLU activation function σ ( z ) = max ( 0 , z ) .

2.4. Hyperparameter Tuning and Validation Strategy

Table 1 details the specific hyperparameter configurations selected through systematic optimisation for each model. Logistic regression was finalised with an L2 penalty and a regularisation strength (C) of 1.0, indicating a standard level of constraint was optimal for this dataset. Random Forest performed best with 100 estimators and a maximum tree depth of 10, suggesting a balance between model complexity and the prevention of overfitting. XGBoost was tuned to a learning rate of 0.1, a maximum depth of 6, and used 90 percent of the data for each boosting round (subsample = 0.9), parameters which facilitate strong performance while maintaining generalisation. The Multilayer Perceptron’s optimal architecture comprised a single hidden layer of 100 neurons with an L2 regularisation (alpha) of 0.001, a structure capable of capturing non-linearities without excessive complexity.
Table 1. Hyperparameter tuning: search ranges and final selected values.

2.5. Implementation and Computational Environment

All analyses were conducted in Python 3.9, utilising a suite of specialised libraries for machine learning, statistical analysis, and data visualisation. The core machine learning framework was built using scikit-learn 1.2.2, which provided the implementations for logistic regression, Random Forest, and the Multilayer Perceptron. The Extreme Gradient Boosting algorithm was implemented using XGBoost version 1.7.4. Model interpretability was achieved through SHAP version 0.42.1 for generating both global and local explanations.
Data manipulation and preprocessing were performed with pandas 2.0.3 and NumPy 1.24.3 for efficient array operations. Scientific computing functions relied on SciPy 1.10.1. For handling class imbalance, we employed the Synthetic Minority Over-sampling Technique (SMOTE) from the imbalanced-learn library version 0.10.1. Statistical comparisons beyond standard metrics were facilitated by scikit-posthocs 0.7.0, which provided the Friedman test with Nemenyi post-hoc analysis. Visualisation of results was created using Matplotlib 3.7.1 and Seaborn 0.12.2 for enhanced aesthetic presentation.
The computational environment consisted of a standard workstation with an Intel Core i3 processor and 32 GB of RAM, running Ubuntu 22.04 LTS. To ensure reproducibility, we maintained consistent algorithmic settings across all experiments and documented all preprocessing transformations. The complete analytical pipeline, including data preprocessing, model training, evaluation, and visualisation scripts, has been structured to facilitate replication and extension of this study.

2.6. Statistical Analysis

A rigorous statistical analysis was conducted to ensure robust evaluation of model performance and to determine the significance of observed differences between classifiers. This analysis comprised two principal components: uncertainty quantification for performance metrics and formal hypothesis testing for classification outcomes.

2.6.1. Uncertainty Quantification via Bootstrapping

Confidence intervals for all reported performance metrics (Accuracy, Precision, Recall, and AUC) were estimated using non-parametric bootstrapping with 1000 resamples drawn from the held-out test set. This approach makes no distributional assumptions and provides a reliable measure of sampling variability. For a given metric θ ^ calculated from the original test set of size n, the bootstrap procedure generates B = 1000 replicate datasets by sampling with replacement. The distribution of the metric θ ^ b * across these replicates approximates the sampling distribution of θ ^ . The 95% confidence interval was then derived using the percentile method:
CI 95 % = θ ^ ( 0.025 ) * , θ ^ ( 0.975 ) * ,
where θ ^ ( 0.025 ) and θ ^ ( 0.975 ) are the 2.5 and 97.5 percentiles of the bootstrap distribution. This method offers a clear indication of the stability and reliability of each performance estimate, thereby facilitating more informed comparisons between models.

2.6.2. Comparative Hypothesis Testing via McNemar’s Test

To assess whether observed differences in classification outcomes between model pairs were statistically significant, McNemar’s test was applied in Table 2. This test is particularly suitable for paired nominal data and is widely used when comparing two classifiers on the same test set [25]. The test focuses on discordant pairs, that is, instances where the classifiers disagree. The contingency table is structured as follows.
Table 2. Contingency table for McNemar’s test.
In Table 2, n 01 represents the count of instances where Model A correctly classified a sample but Model B did not. Conversely, n 10 represents the count where Model B was correct while Model A was incorrect. These two cells contain the discordant pairs that are relevant for the test, as they directly capture the disagreement between the two classifiers. Under the null hypothesis that both classifiers have the same error rate, the test statistic follows a chi-squared distribution with one degree of freedom:
χ 2 = | n 01 n 10 | 1 2 n 01 + n 10
The continuity correction is applied to improve the approximation. A significant p-value (typically p < 0.05 ) leads to rejection of the null hypothesis, indicating a statistically significant difference in classification performance between the two models. This test was applied to key model comparisons, such as the Stacked Ensemble versus the best-performing base model (Random Forest), to objectively determine whether the ensemble’s improvement exceeded what could be expected by chance. Together, the use of bootstrap confidence intervals and McNemar’s test provides a comprehensive statistical framework that addresses both the uncertainty in performance estimates and the significance of comparative differences, thereby strengthening the validity of the conclusions drawn in this study.

2.6.3. Multiple Classifier Comparison via Friedman’s Test with Nemenyi Post-Hoc Analysis

To complement the pairwise comparisons provided by McNemar’s test, we employed Friedman’s non-parametric test followed by Nemenyi’s post-hoc analysis for a comprehensive comparison across all five competing models. This approach addresses the multiple comparisons problem inherent in evaluating several classifiers simultaneously, while making no assumptions about the underlying distributions of performance metrics [26].
Friedman’s test is particularly suitable for comparing multiple models on the same dataset, as it accounts for the dependency introduced by evaluating all classifiers on identical data partitions [20]. The procedure ranks the models across multiple performance metrics (Accuracy, Precision, Recall, F1 Score, and AUC) based on their relative performance (with rank 1 assigned to the best performer and rank 5 to the worst for each metric). Let r i j denote the rank of the j-th model on the i-th performance metric, where j = 1 , , k with k = 5 models and i = 1 , , m with m = 5 performance metrics. The average rank for model j across all performance metrics is calculated as:
R j = 1 m i = 1 m r i j
The Friedman test statistic is then computed as:
χ F 2 = 12 m k ( k + 1 ) j = 1 k R j 2 k ( k + 1 ) 2 4
Under the null hypothesis that all models perform equivalently (i.e., their average ranks are equal), χ F 2 follows a chi-squared distribution with k 1 degrees of freedom. A significant p-value ( p < 0.05 ) indicates that at least one model differs substantially from the others.
When the Friedman test rejects the null hypothesis, we proceed with the Nemenyi post-hoc test to identify which specific pairwise differences are statistically significant. The critical difference (CD) between two average ranks for significance at level α is:
CD α = q α k ( k + 1 ) 6 m
where q α is the critical value from the Studentised range distribution with k models and infinite degrees of freedom. For α = 0.05 and k = 5 , q 0.05 2.728 . Two models are considered to perform significantly differently if the absolute difference between their average ranks exceeds CD 0.05 .
This combined Friedman–Nemenyi approach provides a rigorous statistical framework that complements the McNemar pairwise tests. While McNemar’s test examines specific model pairs with high sensitivity to disagreement patterns, Friedman’s test offers a holistic view of performance rankings across all models, controlling for family-wise error rates in multiple comparisons [27]. The results of this comprehensive analysis are presented in Section 3.

2.7. Ensemble Framework

2.7.1. Stacked Ensemble

To leverage complementary strengths of individual models, we constructed a two-level stacking architecture following recent advances in heterogeneous ensemble learning [16]. Base model predictions were combined using a logistic regression meta-learner, an approach shown to effectively capture non-linear decision boundaries while maintaining interpretability [28]:
P final = σ α 0 + m = 1 M α m P m
where P m represents the predicted probability from base model m, α m are the meta-learner coefficients estimated via maximum likelihood on the validation set, and M = 4 is the number of base models.

2.7.2. Class-Imbalance Handling via Weighting

Recognising the asymmetric financial impact of misclassification in credit risk (where false negatives are more costly than false positives), we developed a class-imbalance-aware variant. Class weights were incorporated into the loss functions:
w j = N 2 · N j
where N is the total number of training samples, and N j is the number of samples in class j. This weighting scheme increases the influence of minority class (default) instances during training [28].

2.8. Model Evaluation and Threshold Selection

2.8.1. Performance Metrics

Models were evaluated on a hold-out test set using comprehensive metrics:
  • Accuracy = T P + T N T P + T N + F P + F N ;
  • Precision = T P T P + F P (measure of false positive rate);
  • Recall = T P T P + F N (measure of false negative rate);
  • F1-Score = 2 · P r e c i s i o n · Recall Precision + Recall (harmonic mean);
  • ROC-AUC: Area under the Receiver Operating Characteristic curve, measuring overall discriminative ability.

2.8.2. Threshold Optimisation

The default 0.5 classification threshold was optimised for each model using Youden’s J statistic [29]:
J = Sensitivity + Specificity 1
The threshold that maximises J was selected, representing the optimal trade-off between identifying true defaults (sensitivity) and correctly classifying non-defaults (specificity) for credit risk applications.

2.9. Model Interpretability

SHAP Analysis

For model transparency and explanation, we employed Shapley Additive Explanations (SHAP) [7]. SHAP values provide a unified approach to predictor variable importance based on cooperative game theory:
ϕ i = S F { i } | S | ! ( | F | | S | 1 ) ! | F | ! [ f S { i } ( x S { i } ) f S ( x S ) ]
where F is the complete set of predictor variables, S is a subset of predictor variables excluding i, and the difference [ f S { i } ( x S { i } ) f S ( x S ) ] represents the marginal contribution of predictor variable i when added to subset S. This provides both global predictor variable importance and local explanations for individual predictions.

3. Results

3.1. Base Model Performance

The comparative performance of all models under baseline conditions is summarised in Table 3. The Stacked Ensemble demonstrated superior overall discriminative capability, achieving the highest area under the curve (AUC: 0.761), which serves as our primary performance metric. It also attained the highest recall (0.806), indicating the strongest ability to identify actual defaulters, while maintaining robust precision (0.783). The Random Forest model achieved the highest accuracy (0.736), making it a reliable classifier under standard conditions, though with a marginally lower AUC (0.749). These baseline results assume equal costs for Type I and Type II errors.
Table 3. Performance comparison of base models.

3.2. Cost Sensitive Model Performance

To reflect the asymmetric financial reality of credit risk assessment, where the cost of an undetected defaulter (false negative) significantly exceeds that of a missed lending opportunity (false positive), we applied cost-sensitive learning. The results in Table 4 demonstrate a strategic recalibration of model behaviour. The Random Forest model became the most sensitive detector of defaulters, achieving the highest recall (0.823). Importantly, the Stacked Ensemble maintained its position as the best performing model based on the primary AUC metric (0.761), while also providing an excellent balance between precision (0.783) and recall (0.806). The logistic regression model evolved into a highly conservative classifier with the highest precision (0.825), making it suitable for scenarios requiring high confidence in positive predictions, albeit with reduced coverage of defaulters.
Table 4. Cost-sensitive model performance comparison.

3.3. Integrated Model Performance

For scenarios demanding the highest possible confidence in risk predictions, such as regulatory compliance or resource-intensive collection processes, we integrated cost-sensitive learning with threshold optimisation. The results in Table 5 show that this approach yielded models with exceptional precision. The Random Forest model achieved the highest precision (0.877), indicating that approximately 88 % of its high-risk predictions were correct. Notably, the Stacked Ensemble once again demonstrated superior overall discriminative ability, maintaining the highest AUC score (0.761) among all models, while also achieving very high precision (0.872). This consistent performance across different evaluation frameworks confirms the Stacked Ensemble as the most robust model, capable of optimal adaptation to varying risk management priorities.
Table 5. Integrated cost-sensitive and threshold-optimised model performance.

3.4. Base Model Error Analysis

Confusion matrix analysis for base models (Table 6) revealed that the Stacked Ensemble achieved balanced performance with 141 true positives and 36 true negatives. Logistic regression demonstrated the most conservative approach with the fewest false positives (24), while Random Forest achieved the highest true positive count (144).
Table 6. Confusion matrices for base models.

3.5. Cost-Sensitive Model Error Analysis

After cost-sensitive learning and threshold optimisation (Table 7), significant redistribution of errors occurred. The Stacked Ensemble reduced false positives to 16 while increasing false negatives to 66. Similarly, Random Forest decreased false positives to 15 while increasing false negatives to 68, indicating a strategic shift toward risk-averse classification.
Table 7. Confusion matrices for cost-sensitive models.

3.6. ROC Curve Analysis

ROC analysis (Figure 1) confirmed the Stacked Ensemble’s superior and consistent discriminatory power across all experimental conditions (AUC = 0.761). Random Forest (AUC = 0.749) and logistic regression/XGBoost (AUC = 0.733) demonstrated competitive but inferior performance, with consistent rankings maintained throughout optimisation procedures.
Figure 1. ROC curves demonstrating consistent discriminatory capability across models.

3.7. Statistical Significance via Friedman’s Test

Table 8 presents average ranks from Friedman’s test across performance metrics. The Stacked Ensemble achieved the best average rank (1.4), followed by Random Forest (2.8). The Friedman test yielded χ 2 = 11.68 ( p = 0.019896 ), indicating significant performance differences. Post-hoc analysis revealed that Stacked Ensemble significantly outperformed MLP but not Random Forest.
Table 8. Friedman test average ranks.

3.8. Global SHAP Interpretability

SHAP analysis (Figure 2) identified Current Account status (mean |SHAP| = 0.153) as the predominant predictor across all models, followed by Loan Duration (0.064) and Savings Account (0.063). Credit History (0.046) and Credit Amount (0.038) demonstrated moderate influence, with consistent importance patterns observed throughout model variations.
Figure 2. Global SHAP feature importance plot for the cost-sensitive stacked ensemble.

3.9. Local SHAP Interpretability

Individual case analysis (Figure 3) for a high-risk applicant revealed that negative Current Account status, limited savings, high instalment rate, large credit amount, and poor credit history collectively contributed to the default prediction. This interpretability remained consistent across all model configurations, providing transparent decision justification.
Figure 3. Local SHAP explanation for a high-risk borrower.

4. Discussion

This study demonstrates that a Stacked Ensemble approach provides a more effective framework for predicting credit default than individual baseline models. This supports existing research, which shows that ensemble methods improve generalisation by reducing the variance and bias found in single models [2,4]. The ensemble’s strong AUC (0.761) and recall (0.806) indicate that combining different modelling approaches, from the interpretable linear logic of logistic regression to the complex pattern recognition of tree-based methods and neural networks, creates a more robust and sensitive predictive system.
The statistical significance analysis via Friedman’s test provides important nuance to these performance comparisons. While the Stacked Ensemble achieved the highest average rank, its advantage over Random Forest did not reach conventional statistical significance ( p > 0.05 ). This finding has practical implications for credit risk modelling: financial institutions may opt for the simpler, more interpretable Random Forest model if the marginal predictive gain from stacking does not justify the additional complexity and reduced transparency. The result challenges the assumption that increasingly sophisticated ensembles invariably yield substantially better performance in credit scoring applications.
The SHAP interpretability analysis offered clear, practical insight into how the model makes decisions. It showed that immediate liquidity indicators, specifically Current Account status and Savings Account balance, were the strongest predictors of default risk, followed by the Loan Duration. This order of importance fits with core principles of credit analysis, where current financial health often matters more than historical factors or loan size. The relatively small role played by Credit Amount differs from some previous studies, highlighting how the importance of specific predictor variables can vary between datasets and lending environments.
A key contribution of this work is its direct engagement with the necessary trade-offs in credit risk assessment. As noted by [14], choosing a model is not just a technical task but a strategic one that balances financial risk against social and commercial goals. Our results make this concrete: logistic regression, with its high precision and low false positive rate, acts as a conservative tool suitable for situations where the cost of a bad loan is severe. In contrast, the Stacked Ensemble and Random Forest both achieve higher recall than traditional models, which focus on identifying potential defaulters, a strategy that may be preferable for institutions prioritising risk control, even if it means some creditworthy applicants are initially declined. Method selection in credit scoring involves trade-offs between risk sensitivity and specificity, with different institutional priorities potentially favouring different model characteristics.
Our implementation details, while technical, have important reproducibility implications for credit scoring research. By specifying exact software versions, we facilitate exact replication of our results, a crucial consideration given the regulatory sensitivity of credit models. Future studies should similarly prioritise reproducibility to enable proper benchmarking and regulatory scrutiny of proposed methodologies.
The study also reveals that algorithmic performance is highly situational. The result that Random Forest (AUC = 0.749) outperformed the often favoured XGBoost (AUC = 0.733) on this particular dataset questions the assumption of a fixed hierarchy among algorithms. It suggests that factors like dataset size, noise, and the nature of the relationships between features can make simpler, more robust ensemble methods more effective than complex boosting techniques. This reinforces the need for empirical, detailed model selection rather than relying on theoretical preferences.
The single stratified split used in our validation, while computationally efficient, represents a methodological limitation that future work should address. Credit applications inherently possess temporal dependencies, and a more rigorous validation scheme incorporating temporal splits or walk-forward validation would better simulate real-world deployment conditions. Nevertheless, our consistent findings across multiple performance metrics provide confidence in the relative rankings of the evaluated approaches.

5. Conclusions

5.1. Overview of the Study

This research addressed the methodological challenge of balancing predictive accuracy with interpretability in credit risk model evaluation. It developed and tested a hybrid framework that combines a Stacked Ensemble model with SHAP-based explainable artificial intelligence. The framework incorporated class imbalance handling through weighting and threshold optimisation to examine how different evaluation priorities affect model performance. Using the German Credit Dataset as a standard benchmark, this study provides a methodological framework for systematically comparing credit scoring approaches, with the approach potentially adaptable to other datasets with appropriate validation.

5.2. Synthesis of Key Findings

Several important lessons were learned and highlighted by the analysis. Most importantly, the Stacked Ensemble achieved the best average performance across multiple evaluation measures, and the results suggest that combining different algorithms can reduce variance and bias through complementary learning. Surprisingly, as this goes against the majority of literature published so far, where gradient boosting techniques are identified as the best model, in our dataset, Random Forest was superior in comparison to XGBoost. This gap highlights the situation-dependent nature of algorithmic performance; in other words, dataset size, feature noise, and class imbalance may be crucial parameters for performance determination.
Statistical analysis using the Friedman test with Nemenyi post-hoc comparisons indicated that while Stacked Ensemble achieved the highest average performance, its advantage over Random Forest did not reach statistical significance at the 0.05 level. These findings were validated through a single stratified holdout approach, confirming the reliability of the performance rankings.
The statistical rigour introduced through Friedman’s test and comprehensive performance metrics provides robust evidence for these conclusions. Notably, the lack of statistical significance between the top two performers (Stacked Ensemble and Random Forest) suggests that practitioners might prioritise model interpretability and computational efficiency when selecting credit scoring models, particularly given the marginal predictive gains offered by more complex ensembles. This statistical perspective complements the practical insights from SHAP analysis, together offering a balanced view of model selection considerations in regulated financial environments.
When the additional class-imbalance handling and threshold optimisation were applied, an important observation emerged: the model with the highest ROC-AUC was not the model with the lowest expected financial cost. This divergence highlights that predictive accuracy alone does not capture the operational impact of misclassification and that cost-aware evaluation provides a more realistic foundation for decision-making in credit lending.
It was also the focus of the interpretability analysis, shedding more light on the behaviour of the model. The results of the SHAP value demonstrated that liquidity-related factors, such as the position of current and savings accounts, the repayment period, and the history of credit, were the main predictors of default. However, the credit amount, which was believed to be a significant predictor in other studies, was not significant in the current data. These findings emphasise the necessity of fitting the credit risk models to the context of their own data since something that is discovered to be definitive in the context of one model may not necessarily be predictive in another. On a local level, SHAP explanations proved how the specific nature of borrowers affected individual predictions, thereby enhancing transparency and compliance with regulatory standards and strengthening trust in decision-making systems that utilise AI.
This is supplemented by errors in classification analysis. The findings indicated that the models had to balance the business cost of making false positive decisions, which subjects the lenders to business loss and the social and economic cost of making negative decisions, which deprives the worthy borrowers of the credit they deserved. Logistic regression had low accuracy yet was conservative, meaning some false positives but more false negatives. Conversely, the ensemble model was a more balanced trade-off, which is justified by the aspect of more suitable adaptation for institutions which aim to optimise the risk, including credit. These findings illustrate methodological trade-offs in credit scoring model development, highlighting how different evaluation priorities can lead to different model selections.
The threshold optimisation results further supported these trade-offs by showing that the default 0.50 decision boundary was rarely optimal. The optimised thresholds improved sensitivity or specificity depending on model structure, thereby demonstrating that thoughtful threshold selection is essential for aligning model outputs with institutional risk tolerance.

5.3. Limitations

This study has several limitations that should be acknowledged. First, the German Credit Dataset, while a standard benchmark, is historical and limited in size (1000 instances), which may affect the generalisability of findings to larger, contemporary datasets. Second, while SHAP provides model interpretability, it does not constitute formal fairness auditing; quantitative fairness assessment using metrics like demographic parity or equalised odds was not performed. Third, the class weighting approach represents a simplified form of cost-aware modelling; a more rigorous approach would involve institution-specific cost matrices. Finally, the confidence intervals, while informative, are based on bootstrap resampling of a single test set; repeated cross-validation would provide more robust uncertainty estimates.
Our employment of a single stratified 70–30 train–test split, while common in credit scoring literature, provides only one performance estimate that may have higher variance than repeated splits or cross-validation. This approach also assumes data are independent and identically distributed, which may not hold for temporal credit application data.
With 1000 instances, our study has adequate power to detect medium to large effects but may be underpowered for detecting small performance differences between top-performing models (e.g., Stacked Ensemble vs. Random Forest). This limitation is reflected in the non-significant Friedman test result for this specific comparison.
Our implementation used specific library versions, which may affect exact reproducibility if newer versions introduce algorithmic changes. We provide frozen environment specifications to mitigate this issue.

5.4. Contributions of the Study

This study makes contributions across theoretical, methodological, and practical domains. Theoretically, it provides a comparative evaluation of major modelling approaches within an interpretable ensemble framework, thereby challenging the assumption of a fixed hierarchy among credit risk algorithms. Methodologically, it introduces a comprehensive and reproducible Python framework for credit scoring. This framework integrates advanced predictive modelling via Stacked Ensembles with robust statistical validation, including Friedman testing and single stratified holdout evaluation. It further combines this with SHAP-based explainability and threshold optimisation for cost-sensitive decision making. Full implementation documentation is provided to ensure transparency and facilitate replication. Practically, the work delivers a structured approach that financial institutions can adopt to develop robust, transparent, and strategically aligned credit assessment systems. This supports not only regulatory compliance and internal model validation but also offers the potential to provide meaningful explanations to applicants, thereby fostering greater trust and accountability in automated lending decisions.

5.5. Future Research Directions

This work points to several useful paths for future research. First, building fairness constraints directly into the ensemble learning process could help ensure strong predictions do not come with unfair outcomes for protected groups. Second, modelling how borrower risk changes over time using survival analysis or recurrent neural networks could give a more dynamic view than a single snapshot. Third, using actual, institution-specific cost structures in the class imbalance handling would make the model training more financially realistic. Finally, testing the proposed framework on a wide range of large-scale datasets from different economic and regional contexts would thoroughly assess its general usefulness and strength.
Future studies should employ more robust validation schemes, such as nested cross-validation or temporal walk-forward validation, to better estimate model performance variance and simulate real deployment scenarios. Incorporating institution-specific cost matrices would also enhance the practical relevance of model evaluations.

5.6. Final Reflections

In conclusion, this study provides evidence that predictive accuracy, interpretability, and fairness in credit risk modelling are not a zero-sum game. Impressive work is being done in the intersection of both advanced machine learning and explainability models. The methodological framework offers a structured approach for evaluating credit scoring systems, though application to specific operational contexts would require appropriate domain data and validation. The methodological framework offers a structured approach for evaluating credit scoring systems, though application to specific operational contexts would require appropriate domain data and validation. Therefore, the framework proposed in this paper is a step towards a resurgence of technical excellence and responsibility that can serve as the foundation for all future research and practice.
The additional methodological elements introduced in this work, particularly class-imbalance-handling modelling and threshold optimisation, further reinforce the idea that machine learning systems can be shaped to reflect institutional priorities while remaining transparent and fair. These additions show that technical innovation can deepen both the practical value and ethical alignment of credit-scoring models.

Author Contributions

Conceptualisation, M.R.M. and D.M.; methodology, M.R.M.; writing—original draft preparation, M.R.M.; writing—review and editing, D.M.; visualisation, M.R.M.; supervision, D.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data supporting the findings of this study can be found at https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data (accessed on 24 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Ala’raj, M.; Abbod, M.F. Classifiers consensus system approach for credit scoring. Knowl.-Based Syst. 2016, 104, 89–105. [Google Scholar]
  2. Tripathi, D.; Shukla, A.K.; Reddy, B.R.; Bopche, G.S.; Chandramohan, D. Credit scoring models using ensemble learning and classification approaches: A comprehensive survey. Wirel. Pers. Commun. 2022, 123, 785–812. [Google Scholar] [CrossRef]
  3. Mokheleli, T.; Museba, T. Machine learning approach for credit score predictions. J. Inf. Syst. Inform. 2023, 5, 497–517. [Google Scholar]
  4. Kou, G.; Xu, Y.; Peng, Y.; Shen, F.; Chen, Y.; Chang, K.; Kou, S. Bankruptcy prediction for SMEs using transactional data and two-stage multiobjective feature selection. Decis. Support Syst. 2021, 140, 113429. [Google Scholar] [CrossRef]
  5. Begenau, J.; Farboodi, M.; Veldkamp, L. Big data in finance and the growth of large firms. J. Monet. Econ. 2018, 97, 71–87. [Google Scholar] [CrossRef]
  6. Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
  7. Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4768–4777. [Google Scholar]
  8. Farboodi, M.; Veldkamp, L. A Model of the Data Economy; Technical Report; National Bureau of Economic Research: Cambridge, MA, USA, 2021. [Google Scholar]
  9. Brown, I.; Mues, C. An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst. Appl. 2012, 39, 3446–3453. [Google Scholar] [CrossRef]
  10. Bellotti, T.; Crook, J. Forecasting and stress testing credit card default using dynamic models. Int. J. Forecast. 2013, 29, 563–574. [Google Scholar] [CrossRef]
  11. Lessmann, S.; Baesens, B.; Seow, H.V.; Thomas, L.C. Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. Eur. J. Oper. Res. 2015, 247, 124–136. [Google Scholar] [CrossRef]
  12. Yang, Y.; Zhang, Y.; Wang, D.; Li, X. Deep learning for credit scoring: Do convolutional neural networks really perform better? J. Bank. Financ. 2023, 146, 106727. [Google Scholar]
  13. Barocas, S.; Hardt, M.; Narayanan, A. Fairness and Machine Learning: Limitations and Opportunities; MIT Press: Cambridge, MA, USA, 2023. [Google Scholar]
  14. Tang, Z.; Zhang, J.; Zhang, K. What-is and how-to for fairness in machine learning: A survey, reflection, and perspective. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
  15. Grinsztajn, L.; Oyallon, E.; Varoquaux, G. Why do tree-based models still outperform deep learning on tabular data? Adv. Neural Inf. Process. Syst. 2022, 35, 507–520. [Google Scholar]
  16. Raimundo, B.; Bravo, J.M. Credit risk scoring: A stacking generalization approach. In Proceedings of the World Conference on Information Systems and Technologies, Pisa, Italy, 4–6 April 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 382–396. [Google Scholar]
  17. Tsai, C.F.; Hsu, Y.F. A meta-learning framework for bankruptcy prediction. J. Forecast. 2013, 32, 167–179. [Google Scholar] [CrossRef]
  18. Mittelstadt, B.; Russell, C.; Wachter, S. Explaining explanations in AI. In Proceedings of the Conference on Fairness, Accountability, and Transparency, Atlanta, GA, USA, 29–31 January 2019; ACM: New York, NY, USA, 2020; pp. 279–288. [Google Scholar]
  19. Demirgüç-Kunt, A.; Klapper, L.; Singer, D.; Ansar, S. The Global Findex Database 2021: Financial Inclusion, Digital Payments, and Resilience in the Age of COVID-19; World Bank Publications: Washington, DC, USA, 2022. [Google Scholar]
  20. Abram, G.; Brown, K.N.; Smeaton, A.F. Friedman test for detecting concept drift in machine learning models. Pattern Recognit. Lett. 2021, 150, 41–48. [Google Scholar]
  21. Fernández, A.; García, S.; Galar, M.; Prati, R.C.; Krawczyk, B.; Herrera, F. Learning from Imbalanced Data Sets; Springer: Berlin/Heidelberg, Germany, 2018; Volume 10. [Google Scholar]
  22. Biecek, P.; Burzykowski, T. Explanatory Model Analysis: Explore, Explain, and Examine Predictive Models; Chapman and Hall/CRC: Boca Raton, FL, USA, 2021. [Google Scholar]
  23. Li, J.; Zhu, Y.; He, K. XGBoost-based credit risk assessment with unbalanced data: A comparative study. Appl. Soft Comput. 2022, 130, 109675. [Google Scholar]
  24. Jaroszewicz, S.; Rzepakowski, P. Neural network interpretability for credit scoring: A case study. Mach. Learn. Appl. 2022, 8, 100299. [Google Scholar]
  25. Raschka, S. Model evaluation, model selection, and algorithm selection in machine learning. arXiv 2021, arXiv:1811.12808. [Google Scholar]
  26. Benavoli, A.; Corani, G.; Mangili, F.; Zaffalon, M. A Bayesian extension of the signed-rank test for comparing multiple algorithms. Mach. Learn. 2020, 109, 2213–2244. [Google Scholar]
  27. Montgomery, D.C.; Peck, E.A.; Vining, G.G. Statistical Methods in Machine Learning; Wiley: Hoboken, NJ, USA, 2020. [Google Scholar]
  28. Wang, S.; Chi, G. Cost-sensitive stacking ensemble learning for company financial distress prediction. Expert Syst. Appl. 2024, 255, 124525. [Google Scholar] [CrossRef]
  29. Liu, Y.; Zhou, Z.H.; Chen, K. Threshold optimisation for classification with imbalanced data: A multi-objective approach. IEEE Trans. Knowl. Data Eng. 2022, 35, 4001–4014. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.