Performance Evaluation of Machine Learning and Deep Learning Models for Credit Risk Prediction

Mapfumo, Irvine; Shongwe, Thokozani

doi:10.3390/jrfm19030210

Open AccessArticle

Performance Evaluation of Machine Learning and Deep Learning Models for Credit Risk Prediction

by

Irvine Mapfumo

^*

and

Thokozani Shongwe

Department of Electrical and Electronic Engineering Science, University of Johannesburg, Johannesburg 2006, South Africa

^*

Author to whom correspondence should be addressed.

J. Risk Financial Manag. 2026, 19(3), 210; https://doi.org/10.3390/jrfm19030210

Submission received: 9 November 2025 / Revised: 3 December 2025 / Accepted: 6 December 2025 / Published: 11 March 2026

(This article belongs to the Section Financial Technology and Innovation)

Download

Browse Figures

Versions Notes

Abstract

Credit risk prediction is essential for financial institutions to effectively assess the likelihood of borrower defaults and manage associated risks. This study presents a comparative analysis of deep learning architectures and traditional machine learning models on imbalanced credit risk datasets. To address class imbalance, we employ three resampling techniques: Synthetic Minority Over-sampling Technique (SMOTE), Edited Nearest Neighbors (ENN), and the hybrid SMOTE-ENN. We evaluate the performance of various models, including multilayer perceptron (MLP), convolutional neural network (CNN), long short-term memory (LSTM), gated recurrent unit (GRU), logistic regression, decision tree, support vector machine (SVM), random forest, adaptive boosting, and extreme gradient boosting. The analysis reveals that SMOTE-ENN combined with MLP achieves the highest F1-score of 0.928 (accuracy 95.4%) on the German dataset, while SMOTE-ENN with random forest attains the best F1-score of 0.789 (accuracy 82.1%) on the Taiwanese dataset. SHapley Additive exPlanations (SHAP) are employed to enhance model interpretability, identifying key drivers of credit default. These findings provide actionable guidance for developing transparent, high-performing, and robust credit risk assessment systems.

Keywords:

credit risk prediction; deep learning; machine learning; resampling techniques; imbalanced data; ensemble learning

1. Introduction

Credit risk assessment is a critical component of the financial services sector, as it allows lenders to assess the probability that a borrower defaults on their loan obligations (Bhattacharya et al., 2023; Chang et al., 2024; Y. Zhao, 2024). Accurate credit risk prediction is essential for financial institutions to mitigate losses, allocate resources more effectively, and set appropriate interest rates. Traditional credit scoring models, such as decision trees and logistic regression, have been employed extensively due to their interpretability and ease of implementation (Ziemba et al., 2023). However, as the volume and complexity of financial data have increased, there has been a shift towards machine learning (ML) and deep learning (DL) techniques that can better capture nonlinear relationships and leverage complex patterns within data (Shi et al., 2024; Zhang et al., 2024).

Despite the advances in ML and DL, credit risk prediction presents several challenges, one of the most prominent being class imbalance (Bhatore et al., 2020; Noriega et al., 2023). In real-world credit datasets, such as those from Taiwan and Germany, there is a natural skew in the distribution of classes, with non-default (low-risk) cases significantly outnumbering default (high-risk) cases (Jiang et al., 2023; Z. Zhao et al., 2024). This imbalance poses difficulties for ML algorithms, which tend to be biased toward the majority class, frequently resulting in subpar performance in identifying minority class (default) cases. Addressing this imbalance is therefore crucial for developing robust and reliable credit risk models.

Current research in credit risk prediction has explored numerous techniques to handle class imbalance, including resampling methods and algorithmic adjustments. Resampling methods, such as Synthetic Minority Over-sampling Technique (SMOTE), generate synthetic samples of the minority class to balance the data distribution, while undersampling techniques, like Edited Nearest Neighbors (ENN), reduce the size of the majority class by eliminating redundant samples (Elreedy et al., 2024; Vairetti et al., 2024). Despite their effectiveness, these methods are not without limitations. Oversampling can lead to overfitting, especially in complex models, as it duplicates minority instances, while undersampling can discard potentially useful data from the majority class, reducing the overall model performance (Altalhan et al., 2025). There is still a void in the literature on the optimal combination of resampling techniques with various ML and DL models for the prediction of credit risk.

This study aims to address these voids by conducting a comprehensive comparative analysis of ML and DL models with different resampling strategies on imbalanced credit datasets. Specifically, this study employs the Taiwan and German credit datasets, which are commonly used benchmarks in credit risk research and exhibit significant class imbalance. Furthermore, this study seeks to identify model–resampling pairs that offer the best balance of predictive accuracy and robustness for imbalanced credit datasets. Such insights can inform the development of more reliable credit scoring systems, ultimately contributing to better credit risk management and decision-making in the financial sector.

The main contributions of this study are summarized as follows:

A comprehensive analysis of the impact of class imbalance on the effectiveness of various ML and DL models for credit risk prediction.
Evaluation of the effectiveness of resampling techniques, in combination with both ML and DL models, on imbalanced credit datasets.
Comparative analysis across two widely used imbalanced credit datasets, the Taiwan and German datasets, providing insights into model performance in diverse credit risk scenarios.
Identification of optimal model–resampling combinations that balance accuracy and robustness.

The rest of this paper is organized in the following manner: Section 2 presents the background and related works, highlighting existing ML and DL approaches in credit risk prediction. Section 3 details the datasets, experimental setup, and resampling strategies employed in this study. Section 4 describes the proposed approach and experimental setup. Section 4 presents the results of the various models under different resampling scenarios, followed by a discussion of the findings. Finally, the study is concluded in Section 5.

2. Related Works

2.1. Machine Learning in Credit Risk Prediction

Credit risk prediction has traditionally relied on ML techniques due to their capacity for predictive accuracy and interpretability. Among the foundational models, logistic regression (LR) remains widely used for binary classification tasks, as it provides a straightforward probabilistic framework for assessing creditworthiness (Alam et al., 2020; Shilpa et al., 2023; Suhadolnik et al., 2023). In credit risk applications, logistic regression’s interpretability allows financial analysts to easily understand the impact of individual predictors, such as income level and debt-to-income ratio, on the likelihood of default. As demonstrated by Montevechi et al. (2024), logistic regression can effectively model linear relationships but struggles with more complex, nonlinear interactions that are common in financial datasets.

Decision trees have been adopted in credit risk assessment to address the limitations of logistic regression. Decision trees are known for their ability to capture nonlinear interactions between variables, making them suitable for more complex datasets (Mienye & Jere, 2024). Additionally, decision trees are intuitive, as they resemble human decision-making processes, and their hierarchical structure allows for transparent credit scoring. For instance, a study by Y. Wang and Ni (2023) showed that decision trees could accurately classify high-risk and low-risk applicants by effectively capturing interactions among credit-related features. However, decision trees can be prone to overfitting, particularly when dealing with high-dimensional data, which can reduce generalizability on unseen data.

To enhance robustness and predictive accuracy, ensemble methods such as random forests and gradient boosting machines (GBMs) have gained popularity in recent years. Random forests leverage multiple decision trees by averaging their predictions, which mitigates overfitting and offers a more reliable model for credit risk prediction. As demonstrated in the work by Mienye and Sun (Mienye & Jere, 2024), random forests perform well in identifying complex interactions and capturing a wide range of credit risk factors, yielding improved accuracy over standalone decision trees. On the other hand, gradient boosting machines build sequential trees, each correcting the errors of the previous one. Techniques like XGBoost and LightGBM—both optimized implementations of gradient boosting—have shown high performance in credit risk applications because of their capacity to manage large datasets and deliver superior predictive performance, as shownby Yu et al. (2024).

Recently, ensemble stacking has been employed to combine the strengths of multiple ML models in credit risk prediction. Stacking involves training various base models, such as decision trees, logistic regression and SVM, and then using their predictions as inputs for a meta-model that generates the final prediction. This approach has been shown to yield high predictive accuracy and robust performance across various credit datasets. A recent study by Yin et al. (2023) highlights the advantages of stacking for credit risk prediction, where the technique achieved superior performance by utilizing the diverse strengths of multiple base models. Despite the computational complexity of stacking, it enables practitioners to maximize the predictive power of simpler models, making it a viable approach for high-stakes financial applications.

2.2. Deep Learning Approaches

In recent years, DL techniques have become more popular in credit risk prediction by virtue of their ability to capture complex, nonlinear relationships in large datasets. Unlike traditional ML models, DL architectures can automatically learn intricate patterns in data, making them suitable for financial applications where complex interactions between variables are prevalent. One commonly used DL model is the MLP, a fully connected neural network that has shown significant promise in predicting credit risk. Studies such as those by Y. Liu et al. (2024) have demonstrated that MLPs can outperform traditional ML models by capturing hidden patterns within borrower behavior and financial features.

Other than MLP, more sophisticated architectures like LSTM networks have been explored in credit risk prediction, particularly when temporal data is available. LSTM networks are a type of recurrent neural network (RNN) that performs well in modeling sequential data by retaining long-term dependencies. This makes them useful in credit risk scenarios where borrowers’ historical payment behavior can inform future default risk. Forinstance, B. Liu et al. (2020) utilized LSTM networks to predict defaults by analyzing past payment sequences, achieving higher predictive accuracy than static ML models. The study demonstrates the advantage of LSTM in leveraging temporal dependencies for credit risk prediction.

Another DL model that has been applied for credit risk applications is the GRU network, which is similar to LSTM but has a simpler structure and fewer parameters. GRUs have been found to be computationally efficient while still retaining the capacity to model sequential data, making them an attractive alternative to LSTM for credit risk prediction. A studyby Thor and Postek (2024) showed the robustness of GRU in their study of credit risk prediction, where the GRU outperformed standard machine learning models.

Additionally, CNNs, typically used for image processing, have been adapted for credit risk prediction. By capturing local patterns in feature space, CNNs can identify clusters of behaviors or attributes that may indicate credit risk. Meng et al. (2024) applied CNNs to financial data and found that they could efficiently identify risk-related patterns, particularly in large datasets with numerous features, outperforming traditional ML models.

2.3. Data Resampling for Imbalanced Data

Handling class imbalance is crucial in credit risk prediction, as most credit datasets contain a significantly lower proportion of default cases (high-risk) compared to non-default cases (low-risk). This imbalance can result in biased predictions, where models tend to predict the majority class more accurately, resulting in poor performance in identifying minority class instances. To address this, various data resampling techniques have been developed to balance the class distribution, improving model performance in detecting defaults. One widely adopted method is the SMOTE, which generates synthetic samples for the minority class by interpolating between existing minority instances. As demonstrated in the workby Dablain et al. (2022), SMOTE effectively reduces bias towards the majority class and improves model performance in highly imbalanced datasets.

SMOTE has been extensively used in credit risk prediction, often in conjunction with machine learning and deep learning models. In a recent study, Onasoga and Hwidi (2024) applied SMOTE to a credit risk dataset and observed significant improvements in the recall of default predictions, indicating that the model was better able to identify high-risk applicants. However, while SMOTE is effective in enhancing recall, it can also lead to overfitting, particularly when combined with complex models like deep neural networks (Alex, 2025). This overfitting arises because synthetic samples are generated based on existing minority instances, which may lead to an artificial concentration of minority samples in feature space, reducing the model’s generalizability.

While SMOTE and its variants remain widely used, recent studies have also introduced alternatives to address the risk of overfitting. For instance, DeepSMOTE leverages deep neural networks to generate more realistic minority samples (Dablain et al., 2022), while cost-sensitive learning adjusts model training to penalize misclassification of the minority class more heavily (Mienye et al., 2024). These approaches represent complementary strategies to traditional resampling and have shown promise in improving generalizability in highly imbalanced credit datasets.

Another approach to managing class imbalance is undersampling, where the majority class is reduced to balance the dataset. One popular undersampling technique is ENN, which removes majority class samples that are misclassified by their nearest neighbors (Xu et al., 2020). This approach has been shown to reduce redundancy in the majority class, allowing models to better differentiate between high-risk and low-risk cases. The study by Nizam-Ozogur and Orman (2024) highlighted the effectiveness of ENN in enhancing model performance in imbalanced datasets by focusing on cleaner and more relevant samples. In credit risk prediction, ENN has been found to enhance precision and reduce false positives by eliminating noisy samples in the majority class, as demonstrated by Xing et al. (2024).

While oversampling and undersampling techniques are effective individually, recent research has explored combining them to maximize their benefits. For example, SMOTE-ENN combines SMOTE’s synthetic oversampling with ENN’s undersampling, creating a more balanced dataset with reduced noise. This hybrid approach has been applied in credit risk prediction, where it has shown improved recall and precision in identifying defaults compared to standalone resampling methods. Zhu et al. (2023) demonstrated that SMOTE-ENN effectively enhanced model robustness in an imbalanced credit risk dataset.

As seen from the reviewed papers in this section, credit risk prediction is a crucial area where both traditional ML and DL approaches have been extensively applied. While traditional ML models such as logistic regression and decision trees are valued for their interpretability and simplicity, they often underperform when modeling complex, nonlinear interactions inherent in financial datasets. Advances in ensemble techniques like random forests and gradient boosting machines have addressed these limitations to some extent, yet challenges remain in optimizing performance on highly imbalanced datasets. Conversely, DL models, including LSTMs, GRUs, and CNNs, have demonstrated strong capability in capturing intricate patterns and temporal dependencies in different applications. However, the practical effectiveness of these models is often hindered by class imbalance, as the minority default cases are typically overshadowed by the majority non-default cases. Resampling strategies such as SMOTE and ENN have been employed to mitigate this imbalance, but questions about the optimal combination of model architectures and resampling strategies persist.

Therefore, this study bridges these gaps by conducting a systematic evaluation of ML and DL models for credit risk prediction, focusing on their performance under varying resampling techniques. The contributions of this work are threefold: firstly, it provides a comparative analysis of traditional ML models, advanced ensemble methods, and state-of-the-art DL architectures under real-world conditions of class imbalance; secondly, it integrates and evaluates resampling methods such as SMOTE, ENN, and SMOTE-ENN to identify their impact on model performance; and thirdly, it utilizes SHAP-based interpretability techniques to explain the contribution of key features in model predictions, ensuring transparency in high-stakes financial decision-making. The findings of this study can guide practitioners in selecting effective model–resampling combinations, advancing the field of credit risk prediction in both research and practical applications.

3. Methodology

3.1. Credit Risk Datasets

This study utilizes two publicly accessible credit datasets: the German Credit dataset and the Taiwan Credit dataset. Each dataset provides valuable insights for credit risk prediction and presents unique characteristics that enhance the robustness of the analysis.

3.1.1. German Credit Dataset

The German Credit dataset, shown in Table 1, contains 1000 instances with 20 attributes encompassing personal and financial information such as age, credit amount, loan duration, and loan purpose (Boughaci et al., 2021). The target variable classifies applicants as good credit risk (1) or bad credit risk (0). This dataset exhibits class imbalance, with about 70% labeled as good credit and 30% as bad credit. It is widely used as a benchmark in credit risk modeling to assess the performance of different algorithms.

3.1.2. Taiwan Credit Dataset

The Taiwan Credit dataset, also known as the UCI Credit Card Default dataset, includes 30,000 instances with 23 features related to demographic information, bill statements, and payment history. The target variable provides an indication whether the client will default (1) or not (0) on their next payment. This dataset is significantly imbalanced, with approximately 22.1% defaults and 77.9% non-defaults. Its large size and comprehensive features make it particularly useful for developing and testing complex models such as DL architectures. The attributes in the dataset are described in Table 2.

Based on the Table 3, both the German and Taiwan credit datasets are widely adopted benchmarks in credit risk modeling research. Each dataset reflects specific socio-economic and institutional contexts: the German dataset is relatively small and dated, representing lending behavior in a European setting, while the Taiwan dataset is larger and more recent but restricted to credit card clients from a single Asian economy. As such, conclusions drawn from these datasets may not fully capture the heterogeneity of global credit markets, where borrower demographics, regulatory environments, and financial products differ substantially. Therefore, the reported findings are interpreted as indicative rather than universally representative.

The decision to employ these two datasets was motivated by their prominence as benchmarking tools in the literature and their complementary characteristics. The German dataset provides a compact but highly imbalanced sample that is frequently used to test baseline algorithms, whereas the Taiwan dataset offers scale and complexity suitable for training more expressive deep learning models. Together, they allow for a comparative analysis across dataset sizes, levels of feature richness, and class imbalance ratios.

3.2. Deep Learning Model Architectures

This section describes the model architectures used in this study, focusing on several DL models that are well-suited for credit risk prediction.

3.2.1. Multilayer Perceptron

The MLP architecture is a fully connected neural network with an input layer, multiple hidden layers, and an output layer. In credit risk prediction, MLPs are advantageous for capturing complex, nonlinear relationships among features (Shang et al., 2023). Given an inputvector

x = [x_{1}, x_{2}, \dots, x_{n}]

, the MLP computes each layer’s output through an affine transformation followed by a nonlinear activation function. The output of the j-th neuron in layer l can be represented as:

h_{j}^{(l)} = f (\sum_{i = 1}^{n} w_{i j}^{(l)} h_{i}^{(l - 1)} + b_{j}^{(l)})

(1)

where

f (\cdot)

is a nonlinear activation function (e.g., ReLU),

w_{i j}^{(l)}

represents the weight between the i-th neuron in layer

(l - 1)

and the j-th neuron in layer l, and

b_{j}^{(l)}

is the bias term (Li et al., 2025). The MLP’s fully connected layers enable it to learn general patterns in credit data, making it effective for tasks like credit scoring when applied to structured data.

3.2.2. Convolutional Neural Network

CNNs, originally developed for image processing, have been adapted for credit risk prediction by treating financial data as structured “grids” where patterns are extracted using convolutional layers (Meer et al., 2024). A CNN applies a series of convolutional filters to capture local patterns in data, which can be useful for identifying clusters of behaviors indicative of credit risk (Krichen, 2023; Salehi et al., 2023). The 2-D discrete convolution of an input

X

with a filter

K

is written compactly as

{(X * K)}_{i, j} = \sum_{(u, v) \in S_{K}} X_{i + u, j + v} K_{u, v}

(2)

where

{(X * K)}_{i, j}

denotes the output at location

(i, j)

,

S_{K}

is the index set covering the finite support of the kernel

K

, and ∗ denotes discrete convolution (with standard padding/stride as specified in the implementation). By stacking multiple convolutional and pooling layers, CNNs can automatically learn hierarchical representations, making them suitable for large credit datasets with numerous interrelated features.

3.2.3. LSTM Networks

LSTM networks, a type of RNN, are particularly effective for sequential data where long-term dependencies are relevant. In credit risk prediction, LSTMs can capture patterns in borrowers’ payment history over time, helping to assess default risk based on temporal behavior (Alvi et al., 2024). The LSTM cell updates its hidden state by maintaining a memory cell

c_{t}

at each time step t, controlled by three gates: the input gate

i_{t}

, the forget gate

f_{t}

, and the output gate

o_{t}

. These gates are defined as follows:

i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i})

(3)

f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f})

(4)

o_{t} = σ (W_{o} \cdot [h_{t - 1}, x_{t}] + b_{o})

(5)

c_{t} = f_{t} \cdot c_{t - 1} + i_{t} \cdot tanh (W_{c} \cdot [h_{t - 1}, x_{t}] + b_{c})

(6)

h_{t} = o_{t} \cdot tanh (c_{t})

(7)

where

x_{t}

is the input at time t,

h_{t}

is the hidden state, and

σ

represents the sigmoid activation function. These equations follow the original formulation of LSTM introduced by Hochreiter and Schmidhuber (1997). This structure allows LSTMs to capture dependencies across different time steps, which is beneficial for time-series data in credit risk analysis.

While the German dataset lacks explicit temporal attributes, we include LSTM for benchmarking, treating static features as a single-time-step sequence. This allows evaluation of its generalization on non-sequential data, revealing limitations (e.g., potential overfitting to assumed dependencies) that inform model selection for similar tabular datasets.

3.2.4. GRU Networks

The GRU is a condensed form of the LSTM that retains the capacity to express sequential relationships while reducing computing complexity by combining the input and forget gates into a single update gate (Niu et al., 2023). In GRUs, the hidden state

h_{t}

at time t is updated based on the reset gate

r_{t}

and the update gate

z_{t}

, defined as:

z_{t} = σ (W_{z} \cdot [h_{t - 1}, x_{t}] + b_{z})

(8)

r_{t} = σ (W_{r} \cdot [h_{t - 1}, x_{t}] + b_{r})

(9)

{\tilde{h}}_{t} = tanh (W_{h} \cdot [r_{t} ⊙ h_{t - 1}, x_{t}] + b_{h})

(10)

h_{t} = z_{t} ⊙ h_{t - 1} + (1 - z_{t}) ⊙ {\tilde{h}}_{t}

(11)

where

{\tilde{h}}_{t}

is the candidate activation (Niu et al., 2023). GRUs have been shown to perform comparably to LSTMs in different applications with sequential data while being less computationally demanding, as highlighted in various recent studies.

Although both the German and Taiwan datasets are primarily tabular, their feature composition provides a rationale for exploring sequential architectures such as LSTMs and GRUs. In particular, the Taiwan dataset includes six consecutive months of repayment status (PAY_0 to PAY_6), bill amounts (BILL_AMT1–6), and payment amounts (PAY_AMT1–6), which naturally form temporal sequences reflecting borrowers’ financial behavior over time. Modeling these attributes as ordered sequences allows recurrent models to capture dynamic payment patterns that may not be fully exploited by feed-forward networks. For the German dataset, where explicitly temporal attributes are absent, the inclusion of LSTM and GRU serves a benchmarking role, enabling comparison of their generalization ability on static features against architectures like MLP and CNN.

3.3. Traditional Machine Learning Models

This section describes the traditional machine learning models used in this study. Each model type, such as logistic regression, decision tree, support vector machine, random forest, AdaBoost and XGBoost, offers unique strengths and is widely applied in the financial domain.

3.3.1. Logistic Regression

Logistic regression is a widely used linear model for binary classification tasks, including credit risk prediction. It uses the sigmoid function to model the likelihood that a given input belongs to a specific class, with outputs constrained between 0 and 1. Given an input vector

x = [x_{1}, x_{2}, \dots, x_{n}]

, the probability of the positive class is computed as:

P (y = 1 | x) = σ (w^{T} x + b),

(12)

σ (z) = \frac{1}{1 + e^{- z}}

(13)

where

w

is the weight vector, b is the bias term, and

σ (\cdot)

is the sigmoid activation (Gutiérrez et al., 2010). Logistic regression is valued for its interpretability, as it allows practitioners to understand the impact of each feature on default probability. However, it may struggle with capturing complex, nonlinear relationships in credit risk data.

3.3.2. Support Vector Machine

SVMs are robust classifiers that seek to identify the optimal hyperplane to maximize the margin between data points of various classes. For linearly separable data, support vector machines maximize the distance between the nearest data points (support vectors) of each class and the hyperplane (M.-W. Huang et al., 2017). The decision boundary is defined as:

f (x) = w^{T} x + b = 0

(14)

where

w

is the weight vector and b is the bias term. For cases where the data is not linearly separable, support vector machines use kernel functions (e.g., radial basis function) to project data into a higher-dimensional space where a linear separator may exist.

3.3.3. Decision Tree

Decision trees are non-parametric models that create a tree-like structure of decisions by recursively partitioning data according to feature values. At each node, the model chooses a feature and a threshold that best splits the data to maximize homogeneity within each partition (Sun et al., 2024). The Gini impurity or information gain criteria are typically used to select the best splits. The prediction for an instance

x

is made by traversing the tree from the root to a leaf node, where the class label is determined. Although decision trees are intuitive and easy to interpret, they have a tendency to overfit, especially on small datasets.

3.3.4. Random Forest

Random forest is an ensemble method that combines multiple decision trees to improve classification accuracy and reduce overfitting (Biau & Scornet, 2016). In a random forest, each tree is trained on a different bootstrap sample of the data, and at each split, a random subset of features is considered. The final prediction is determined by aggregating the predictions of all individual trees, usually by majority voting for classification tasks:

\hat{y} = mode ({y_{1}, y_{2}, \dots, y_{m}})

(15)

where

y_{i}

is the prediction of the i-th tree in the forest. Random forests are widely used in credit risk prediction due to their robustness, ability to capture complex patterns, and reduced overfitting compared to single decision trees (Sun et al., 2024).

3.3.5. XGBoost

XGBoost is a scalable and effective gradient boosting method that makes use of a group of weak learners, usually decision trees. The purpose of each new tree is to reduce the residual errors of the ones which came prior, making XGBoost effective at handling complex datasets (Rao et al., 2022). Given a model

F (x) = \sum_{k = 1}^{K} f_{k} (x)

, where

f_{k}

are individual decision trees, XGBoost optimizes the following objective function:

Obj = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}) + \sum_{k = 1}^{K} Ω (f_{k})

(16)

where

l (y_{i}, {\hat{y}}_{i})

is a differentiable loss function measuring the difference between predicted and actual values, and

Ω (f_{k})

is a regularization term that penalizes the complexity of the trees to prevent overfitting. XGBoost has shown strong performance in credit risk prediction due to its robustness and ability to handle imbalanced data effectively.

3.3.6. Adaptive Boosting

AdaBoost is a boosting technique that builds a strong classifier by combining several weak classifiers. It iteratively trains models on weighted versions of the dataset, focusing more on incorrectly classified instances in each iteration (X. Huang et al., 2022). The final model prediction is a weighted majority vote of all classifiers, where each classifier’s weight depends on its accuracy. For an instance

x

, the prediction is given by:

\hat{y} = sign (\sum_{m = 1}^{M} α_{m} h_{m} (x))

(17)

where

h_{m} (x) \in {- 1, + 1}

denotes the prediction of the m-th weak classifier for input

x

, and

α_{m}

is the weight assigned to that classifier based on its accuracy (Mienye & Jere, 2024). AdaBoost is known for improving model performance on difficult-to-classify instances and is often used for credit risk tasks where correctly identifying high-risk cases is critical.

3.4. Resampling Techniques

To address the class imbalance in credit risk prediction, this study explores several resampling techniques. These techniques either oversample the minority class or undersample the majority class in an effort to balance the dataset, thereby improving the model’s capacity to identify high-risk cases. The following resampling methods are employed: SMOTE, ENN, and a hybrid approach, SMOTE-ENN.

3.4.1. SMOTE

A popular oversampling technique called SMOTE creates synthetic samples for the minority class in order to address class imbalance. SMOTE successfully increases the representation of the minority class without duplicating samples by interpolating between current minority samples and their closest neighbors to create new instances (Nizam-Ozogur & Orman, 2024; L. Wang, 2022). For a given minority instance

x

, a synthetic instance is generated as:

x_{new} = x + δ \cdot (x_{neighbor} - x)

(18)

where

x_{neighbor}

is a randomly selected nearest neighbor of

x

, and

δ

is a random value in the range [0, 1]. By creating synthetic samples that lie along the line segment between

x

and

x_{neighbor}

, SMOTE reduces the risk of overfitting and helps the model better learn patterns associated with the minority class.

3.4.2. ENN

ENN is an undersampling method that curbs the size of the majority class by removing potentially noisy or misclassified samples. For each majority class instance, ENN examines its nearest neighbors, and if it is misclassified by the majority of its neighbors, the instance is removed from the dataset (Xu et al., 2020). This technique helps clean the majority class by eliminating overlapping or borderline cases that could lead to misclassification, enhancing the model’s focus on identifying true default cases. ENN can improve the precision of credit risk models by reducing false positives, which is beneficial in high-stakes financial applications.

3.4.3. SMOTE-ENN

SMOTE-ENN is a hybrid resampling approach that combines the strengths of SMOTE and ENN. First, SMOTE is applied to generate synthetic samples of the minority class, increasing its representation in the dataset. Next, ENN is applied to the combined dataset to remove noisy or misclassified majority class instances, effectively creating a more balanced and cleaner dataset (Lin et al., 2021). This approach allows the model to benefit from the additional minority samples while minimizing noise from the majority class. SMOTE-ENN has shown effectiveness in credit risk prediction, as it balances recall and precision, aiding in the identification of high-risk cases while maintaining overall model robustness.

3.5. SHapley Additive exPlanations (SHAP)

SHAP is a game-theoretic approach to model interpretability that assigns importance values to features based on their contribution to individual predictions (Lundberg & Lee, 2017). SHAP values are derived from Shapley values in cooperative game theory, where each feature’s impact is computed by averaging its marginal contribution across all possible coalitions of features. In this study, SHAP is applied post-training to compute global feature importance on the validation sets, visualized in summary plots that rank features by their average absolute SHAP value and show directional impacts (positive or negative) on predictions.

3.6. Experimental Setup

This study uses K-fold cross-validation to evaluate the models, where the dataset is split into K subsets. Every subset is used once as the validation set, while training is done using the remaining

K - 1

subsets, iterating through each subset to obtain an average performance metric. In this work, K was set to 10, which provides a reasonable trade-off between bias and variance, particularly given the relatively small size of the German dataset. To maintain class distribution across folds, stratified sampling was employed, and a fixed random seed was set to ensure reproducibility. Importantly, all resampling techniques (SMOTE, ENN, and SMOTE-ENN) were applied strictly within the training folds to avoid information leakage from the validation data. In addition, nested cross-validation was used for hyperparameter tuning, where an inner 5-fold stratified CV was applied within each outer training fold. This design ensures that hyperparameters are selected independently of the outer validation data, yielding a more reliable estimate of generalization performance. The outer loop evaluates overall model performance, while the inner loop optimizes hyperparameters for each outer training set. Furthermore, the following Algorithm 1 describes the proposed methodology used in this study.

The proposed approach comprises systematic steps to ensure robust model performance and interpretability of credit risk prediction. Specifically, resampling was conducted only on the training data of each fold, ensuring that the validation set remained untouched and representative of the original distribution. After the resampling, K-fold cross-validation is employed to split the dataset into training and validation sets, ensuring a reliable and unbiased evaluation of model performance. The model is trained iteratively within each fold, and performance metrics are calculated for validation. Furthermore, SHAP analysis is carried out after training to interpret model predictions, offering information on the significance of features and how they affect model outputs. SHAP was used to compute feature contributions, following the interpretability framework introduced in (Lundberg & Lee, 2017). The combination of these steps ensures a comprehensive evaluation framework, balancing predictive accuracy with interpretability for practical credit risk applications.

Algorithm 1 Proposed Approach for Credit Risk Prediction

1:: Input: Dataset D, Resampling technique R, Model M, Number of outer folds K
2:: Output: Average performance metrics (i.e., accuracy, precision, recall, specificity, F1 score) and feature importance insights
3:: Step 1: Initialize 10-fold stratified cross-validation with fixed random seed
4:: for $k = 1$ to K do
5:: Split D into training set $D_{train}$ and validation set $D_{val}$ for fold k
6:: Step 2: Within $D_{train}$ , perform inner 5-fold stratified cross-validation for hyperparameter tuning
7:: Apply resampling technique R on each inner training split only
8:: Select optimal hyperparameters based on inner validation performance
9:: Step 3: Retrain model M with selected hyperparameters on full $D_{train}$
10:: Step 4: Evaluate M on $D_{val}$ to obtain performance metrics
11:: end for
12:: Step 5: Compute average performance metrics across all K folds
13:: Step 6: Perform SHAP analysis to interpret the model:
14:: Use SHAP to compute feature importance on the validation or test set
15:: Generate SHAP summary plots to analyze feature impacts
16:: Return: Average performance metrics and SHAP insights

Hyperparameters were optimized using random search within the inner CV loop. For illustration, in one outer fold of the German dataset, the optimal hyperparameters for MLP included hidden layers [128, 64, 32], ReLU activation, Adam optimizer with lr = 0.001, and batch size 32, selected based on the highest F1-score (0.912) on the inner validation sets. The full set of hyperparameters used across models is summarized in Table 4.

3.7. Evaluation Metrics

In order to assess the model’s performance in credit risk prediction, especially with imbalanced data, a comprehensive set of metrics is essential. This study utilizes accuracy, recall, precision, specificity, and F1 score, as each metric provides unique insights into model effectiveness. Accuracy offers a general measure of performance, but it can be misleading for imbalanced datasets, as it may reflect high performance merely by correctly predicting the majority class. Therefore, additional metrics such as precision, recall, specificity, and F1 score are incorporated to assess the model’s ability to handle both classes accurately, particularly focusing on correctly identifying high-risk (default) cases.

Precision and recall are valuable in understanding the model’s performance in identifying defaults. Precision reflects the proportion of predicted default cases that are actual defaults, reducing the occurrence of false positives (Obaido et al., 2022). Conversely, Recall measures the model’s effectiveness in capturing actual defaults. Specificity complements recall by indicating the model’s accuracy on non-default cases, while the F1 score balances precision and recall, providing a single measure of performance on the minority class, and it is defined as the harmonic mean of precision and recall. The following equations define each of the metrics:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(19)

Precision = \frac{T P}{T P + F P}

(20)

Recall = \frac{T P}{T P + F N}

(21)

Specificity = \frac{T N}{T N + F P}

(22)

F 1 Score = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}

(23)

where

T P

is true positives,

T N

is true negatives,

F P

is false positives, and

F N

is false negatives.

4. Results and Discussion

The performance of several machine learning and deep learning models applied to the German and Taiwan credit datasets is thoroughly examined in this section. All experiments were conducted using Python 3.10. The machine learning models and pre-processing methods were implemented with scikit-learn, while the MLP was developed using TensorFlow/Keras. Data manipulation and analysis were performed using NumPy and pandas. The evaluation involves multiple experimental setups, incorporating different resampling techniques, including SMOTE, ENN, and SMOTE-ENN, to address the inherent class imbalance in the datasets. The experimental setup included a diverse range of models: Logistic Regression, SVM, Decision Trees, Random Forest, XGBoost, AdaBoost, MLP, CNN, LSTM, and GRU. The hyperparameters for each model are summarized in Table 4. These hyperparameters were optimized using a random search strategy within nested cross-validation.

Representative ranges included: max depth

\in {3, 5, 7, None}

for decision trees, number of estimators

\in {100, 200, 500}

for random forest and boosting models, learning rate

\in [0.01, 0.3]

for XGBoost, hidden layer sizes

\in {[32], [64, 32]}

for MLP, and recurrent units

\in {32, 50, 100}

for LSTM/GRU.

Model evaluation was performed using 10-fold stratified cross-validation to maintain balanced class distribution in each fold. No separate hold-out test set was used; all performance metrics were derived from the cross-validation results. Ten-fold cross-validation is generally more reliable than a single train–test split because it reduces variance and provides a more stable estimate of generalization. This approach is especially suitable for small datasets such as the German Credit dataset (1000 records), where a hold-out split could lead to unstable or misleading performance estimates.

4.1. Performance of the Models on the German Dataset

The performance of several machine learning and deep learning models applied to the German dataset using different resampling approaches is compiled in Table 5, including No Resampling, SMOTE, ENN, and SMOTE-ENN. Among the models evaluated, SMOTE-ENN combined with MLP achieved the highest overall performance, with an accuracy of 0.954, recall of 0.929, specificity of 0.968, precision of 0.945, and F1 Score of 0.937. This indicates that the MLP model, when combined with SMOTE-ENN, is particularly effective at balancing sensitivity and specificity while maintaining high predictive accuracy.

The CNN model under SMOTE-ENN also demonstrated strong performance, achieving an accuracy of 0.934, recall of 0.893, specificity of 0.958, precision of 0.926, and F1 Score of 0.909. Compared to MLP, CNN slightly lagged in recall and precision but still achieved robust results. Similarly, GRU and LSTM models trained with SMOTE-ENN achieved impressive results, with GRU achieving an F1 Score of 0.879 and LSTM slightly outperforming GRU with an F1 Score of 0.903. These results underscore the effectiveness of SMOTE-ENN in improving model performance, particularly for deep learning models.

For models without resampling, CNN emerged as the top performer, achieving an F1 Score of 0.850, followed by MLP with an F1 Score of 0.842. This suggests that CNN and MLP possess inherent robustness when applied to imbalanced datasets, even without resampling techniques. However, the recall values for these models were generally lower, indicating challenges in identifying minority class instances without resampling.

The SHAP summary plot in Figure 1 provides insights into feature importance for the German dataset. The plot highlights that “CheckingAccount” and “Duration” are the most influential features in predicting credit outcomes, with their SHAP values indicating strong contributions to the model’s decisions. Features such as “SavingAccounts” and “Age” also played significant roles, albeit to a lesser extent. The SHAP analysis underscores the importance of financial stability indicators like “CheckingAccount” and “SavingAccounts” in credit risk modeling, aligning with domain knowledge.

4.2. Performance of the Models on the Taiwan Dataset

Table 6 presents the performance metrics for models applied to the Taiwan dataset under the same resampling techniques. The XGBoost model without resampling achieved the highest accuracy among the No Resampling configurations, with an accuracy of 0.817, recall of 0.376, specificity of 0.940, precision of 0.637, and F1 Score of 0.473. However, the recall value remained relatively low, indicating limitations in detecting minority class instances.

SMOTE-ENN combined with random forest obtained the best-performing configuration, achieving an accuracy of 0.821, recall of 0.745, specificity of 0.842, precision of 0.512, and F1 Score of 0.610. This configuration demonstrated improved recall compared to other models, indicating better sensitivity to minority class predictions. Additionally, SMOTE-ENN combined with LSTM achieved comparable results, with an F1 Score of 0.487, highlighting the effectiveness of deep learning models under resampling techniques.

The SHAP summary plot in Figure 2 illustrates the feature importance for the Taiwan dataset using the random forest predictions. “PAY_0” and “LIMIT_BAL” were identified as the most critical features influencing model predictions. These features reflect past payment behavior and credit limits, which are intuitive and highly relevant indicators for predicting credit default. Other features, such as “BILL_AMT1” and “PAY_AMT1,” also contributed significantly, emphasizing the role of financial transaction history in credit risk prediction.

5. Discussion and Conclusions

The results demonstrate that deep learning models, particularly MLP and CNN, achieve superior performance on both datasets when paired with the hybrid SMOTE-ENN resampling technique. This superiority stems from their ability to model complex non-linear interactions in credit features, which is significantly enhanced by the cleaner class boundaries produced by SMOTE-ENN compared with SMOTE alone or no resampling.

The superior performance observed for the MLP combined with the SMOTE-ENN pipeline can be attributed to two complementary effects. First, the MLP is capable of modeling complex, non-linear relationships commonly present in tabular credit-scoring datasets. Second, the SMOTE-ENN procedure addresses class imbalance by generating synthetic minority-class examples while removing noisy majority-class samples; the ENN step in particular helps eliminate ambiguous majority instances and retain representative minority prototypes. This process results in sharper and more discriminative decision boundaries, ultimately improving classification accuracy and robustness.

The substantial decline in SVM performance following SMOTE-ENN on the Taiwan dataset can be explained by the SVM’s sensitivity to class overlap and noise near the decision margin. While ENN removes some ambiguous samples, the SMOTE step may still generate synthetic minority instances close to the true boundary—an issue exacerbated by the noisier, more overlapping structure of the Taiwan dataset. These synthetic samples distort the margin optimization process essential to SVMs, resulting in a notable reduction in generalization performance.

The SHAP analysis further validated these models by identifying critical features like PAY_0 and LIMIT_BAL for the Taiwan dataset and CheckingAccount and Duration for the German dataset, indicating their relevance in credit scoring. These findings align closely with financial intuition and offer actionable insights for lenders, emphasizing the importance of current account status, repayment behavior, loan duration, and available credit limit in default prediction.

Although deep learning models exhibited greater robustness and higher F1-scores, their increased computational requirements may favor simpler models (e.g., Random Forest or XGBoost) in real-time scoring environments where inference speed is prioritized over marginal gains in predictive performance.

In summary, the combination of SMOTE-ENN with MLP (German dataset) or Random Forest (Taiwan dataset), coupled with SHAP explanations, represents a practical, high-performing, and transparent solution for credit risk assessment. Future research may explore cost-sensitive learning, ensemble-of-ensembles strategies, and the integration of SHAP into live credit-scoring pipelines to further improve fairness, performance, and regulatory compliance.

Author Contributions

Conceptualization, I.M. and T.S.; methodology, I.M.; software, I.M.; validation, T.S.; formal analysis, I.M. and T.S.; investigation, I.M.; resources, I.M.; data curation, I.M.; writing—original draft preparation, I.M.; writing—review and editing, I.M. and T.S.; visualization, I.M. and T.S.; supervision, T.S.; project administration, T.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The credit data sets used in the study are openly available online at https://www.kaggle.com/datasets/uciml/german-credit and https://www.kaggle.com/datasets/uciml/default-of-credit-card-clients-dataset (accessed on 3 December 2025). Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Alam, T. M., Shaukat, K., Hameed, I. A., Luo, S., Sarwar, M. U., Shabbir, S., Li, J., & Khushi, M. (2020). An investigation of credit card default prediction in the imbalanced datasets. IEEE Access, 8, 201173–201198. [Google Scholar] [CrossRef]
Alex, S. A. (2025). Imbalanced data learning using SMOTE and deep learning architecture with optimized features. Neural Computing and Applications, 37(2), 967–984. [Google Scholar] [CrossRef]
Altalhan, M., Algarni, A., & Alouane, M. T.-H. (2025). Imbalanced data problem in machine learning: A review. IEEE Access, 13, 13686–13699. [Google Scholar] [CrossRef]
Alvi, J., Arif, I., & Nizam, K. (2024). Advancing financial resilience: A systematic review of default prediction models and future directions in credit risk management. Heliyon, 10(21), e39770. [Google Scholar] [CrossRef]
Bhatore, S., Mohan, L., & Reddy, Y. R. (2020). Machine learning techniques for credit risk evaluation: A systematic literature review. Journal of Banking and Financial Technology, 4(1), 111–138. [Google Scholar] [CrossRef]
Bhattacharya, A., Biswas, S. K., & Mandal, A. (2023). Credit risk evaluation: A comprehensive study. Multimedia Tools and Applications, 82(12), 18217–18267. [Google Scholar] [CrossRef]
Biau, G., & Scornet, E. (2016). A random forest guided tour. Test, 25, 197–227. [Google Scholar] [CrossRef]
Boughaci, D., Alkhawaldeh, A. A., Jaber, J. J., & Hamadneh, N. (2021). Classification with segmentation for credit scoring and bankruptcy prediction. Empirical Economics, 61, 1281–1309. [Google Scholar] [CrossRef]
Chang, V., Xu, Q. A., Akinloye, S. H., Benson, V., & Hall, K. (2024). Prediction of bank credit worthiness through credit risk analysis: An explainable machine learning study. Annals of Operations Research, 354, 247–271. [Google Scholar] [CrossRef]
Dablain, D., Krawczyk, B., & Chawla, N. V. (2022). DeepSMOTE: Fusing deep learning and SMOTE for imbalanced data. IEEE Transactions on Neural Networks and Learning Systems, 34(9), 6390–6404. [Google Scholar] [CrossRef]
Elreedy, D., Atiya, A. F., & Kamalov, F. (2024). A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning. Machine Learning, 113(7), 4903–4923. [Google Scholar] [CrossRef]
Gutiérrez, P. A., Hervás-Martínez, C., & Martínez-Estudillo, F. J. (2010). Logistic regression by means of evolutionary radial basis function neural networks. IEEE Transactions on Neural Networks, 22(2), 246–263. [Google Scholar] [CrossRef]
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735–1780. [Google Scholar] [CrossRef]
Huang, M.-W., Chen, C.-W., Lin, W.-C., Ke, S.-W., & Tsai, C.-F. (2017). SVM and SVM ensembles in breast cancer prediction. PLoS ONE, 12(1), e0161501. [Google Scholar] [CrossRef]
Huang, X., Li, Z., Jin, Y., & Zhang, W. (2022). Fair-AdaBoost: Extending AdaBoost method to achieve fair classification. Expert Systems with Applications, 202, 117240. [Google Scholar] [CrossRef]
Jiang, C., Lu, W., Wang, Z., & Ding, Y. (2023). Benchmarking state-of-the-art imbalanced data learning approaches for credit scoring. Expert Systems with Applications, 213, 118878. [Google Scholar] [CrossRef]
Krichen, M. (2023). Convolutional neural networks: A survey. Computers, 12(8), 151. [Google Scholar] [CrossRef]
Li, J., Yang, R., Cao, X., Zeng, B., Shi, Z., Ren, W., & Cao, X. (2025). Inception MLP: A vision MLP backbone for multi-scale feature extraction. Information Sciences, 701, 121865. [Google Scholar] [CrossRef]
Lin, M., Zhu, X., Hua, T., Tang, X., Tu, G., & Chen, X. (2021). Detection of ionospheric scintillation based on XGBoost model improved by SMOTE-ENN technique. Remote Sensing, 13(13), 2577. [Google Scholar] [CrossRef]
Liu, B., Zhang, Z., Yan, J., Zhang, N., Zha, H., Li, G., Li, Y., & Yu, Q. (2020). A deep learning approach with feature derivation and selection for overdue repayment forecasting. Applied Sciences, 10(23), 8491. [Google Scholar] [CrossRef]
Liu, Y., Baals, L. J., Osterrieder, J., & Hadji-Misheva, B. (2024). Leveraging network topology for credit risk assessment in P2P lending: A comparative study under the lens of machine learning. Expert Systems with Applications, 252, 124100. [Google Scholar] [CrossRef]
Lundberg, S. M., & Lee, S.-I. (2017, December 4–9). A unified approach to interpreting model predictions. 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA. [Google Scholar]
Meer, M., Khan, M. A., Jabeen, K., Alzahrani, A. I., Alalwan, N., Shabaz, M., & Khan, F. (2024). Deep convolutional neural networks information fusion and improved whale optimization algorithm based smart oral squamous cell carcinoma classification framework using histopathological images. Expert Systems, 42(1), e13536. [Google Scholar] [CrossRef]
Meng, B., Sun, J., & Shi, B. (2024). A novel URP-CNN model for bond credit risk evaluation of Chinese listed companies. Expert Systems with Applications, 255, 124861. [Google Scholar] [CrossRef]
Mienye, I. D., & Jere, N. (2024). A survey of decision trees: Concepts, algorithms, and applications. IEEE Access, 12, 86716–86727. [Google Scholar] [CrossRef]
Mienye, I. D., Swart, T. G., & Obaido, G. (2024). Scalable XAI: Towards explainable machine learning models in distributed systems. In Pan-African artificial intelligence and smart systems conference (pp. 3–16). Springer. [Google Scholar]
Montevechi, A. A., Miranda, R. C., Medeiros, A. L., & Montevechi, J. A. B. (2024). Advancing credit risk modelling with machine learning: A comprehensive review of the state-of-the-art. Engineering Applications of Artificial Intelligence, 137, 109082. [Google Scholar] [CrossRef]
Niu, Z., Zhong, G., Yue, G., Wang, L.-N., Yu, H., Ling, X., & Dong, J. (2023). Recurrent attention unit: A new gated recurrent unit for long-term memory of important parts in sequential data. Neurocomputing, 517, 1–9. [Google Scholar] [CrossRef]
Nizam-Ozogur, H., & Orman, Z. (2024). A heuristic-based hybrid sampling method using a combination of SMOTE and ENN for imbalanced health data. Expert Systems, 41(8), e13596. [Google Scholar] [CrossRef]
Noriega, J. P., Rivera, L. A., & Herrera, J. A. (2023). Machine learning for credit risk prediction: A systematic literature review. Data, 8(11), 169. [Google Scholar] [CrossRef]
Obaido, G., Ogbuokiri, B., Mienye, I. D., & Kasongo, S. M. (2022). A voting classifier for mortality prediction post-thoracic surgery. In International conference on intelligent systems design and applications (pp. 263–272). Springer. [Google Scholar]
Onasoga, B., & Hwidi, J. (2024). Enhancing credit card default prediction: Prioritizing recall over accuracy. In International conference on innovative computing and communication (pp. 441–459). Springer. [Google Scholar]
Rao, C., Liu, Y., & Goh, M. (2022). Credit risk assessment mechanism of personal auto loan based on PSO-XGBoost model. Complex & Intelligent Systems, 9, 1391–1414. [Google Scholar] [CrossRef]
Salehi, A. W., Khan, S., Gupta, G., Alabduallah, B. I., Almjally, A., Alsolai, H., Siddiqui, T., & Mellit, A. (2023). A study of CNN and transfer learning in medical imaging: Advantages, challenges, future scope. Sustainability, 15(7), 5930. [Google Scholar] [CrossRef]
Shang, H., Shang, L., Wu, J., Xu, Z., Zhou, S., Wang, Z., Wang, H., & Yin, J. (2023). NIR spectroscopy combined with 1D-convolutional neural network for breast cancerization analysis and diagnosis. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 287, 121990. [Google Scholar] [CrossRef]
Shi, Y., Qu, Y., Chen, Z., Mi, Y., & Wang, Y. (2024). Improved credit risk prediction based on an integrated graph representation learning approach with graph transformation. European Journal of Operational Research, 315(2), 786–801. [Google Scholar] [CrossRef]
Shilpa, N. A., Shaha, P., Hajek, P., & Abedin, M. Z. (2023). Default risk prediction based on support vector machine and logit support vector machine. In Novel financial applications of machine learning and deep learning: Algorithms, product modeling, and applications (pp. 93–106). Springer. [Google Scholar]
Suhadolnik, N., Ueyama, J., & Da Silva, S. (2023). Machine learning for enhanced credit risk assessment: An empirical approach. Journal of Risk and Financial Management, 16(12), 496. [Google Scholar] [CrossRef]
Sun, Z., Wang, G., Li, P., Wang, H., Zhang, M., & Liang, X. (2024). An improved random forest based on the classification accuracy and correlation measurement of decision trees. Expert Systems with Applications, 237, 121549. [Google Scholar] [CrossRef]
Thor, M., & Postek, Ł. (2024). Gated recurrent unit network: A promising approach to corporate default prediction. Journal of Forecasting, 43(5), 1131–1152. [Google Scholar] [CrossRef]
Vairetti, C., Assadi, J. L., & Maldonado, S. (2024). Efficient hybrid oversampling and intelligent undersampling for imbalanced big data classification. Expert Systems with Applications, 246, 123149. [Google Scholar] [CrossRef]
Wang, L. (2022). Imbalanced credit risk prediction based on SMOTE and multi-kernel FCM improved by particle swarm optimization. Applied Soft Computing, 114, 108153. [Google Scholar] [CrossRef]
Wang, Y., & Ni, X. S. (2023). A survey of machine learning methodologies for loan evaluation in peer-to-peer (P2P) lending. In Data analytics for management, banking and finance: Theories and application (pp. 1–49). Springer. [Google Scholar]
Xing, Q., Yu, C., Huang, S., Zheng, Q., Mu, X., & Sun, M. (2024). Enhanced credit score prediction using ensemble deep learning model. arXiv, arXiv:2410.00256. [Google Scholar] [CrossRef]
Xu, Z., Shen, D., Nie, T., & Kou, Y. (2020). A hybrid sampling algorithm combining M-SMOTE and ENN based on random forest for medical imbalanced data. Journal of Biomedical Informatics, 107, 103465. [Google Scholar] [CrossRef]
Yin, W., Kirkulak-Uludag, B., Zhu, D., & Zhou, Z. (2023). Stacking ensemble method for personal credit risk assessment in peer-to-peer lending. Applied Soft Computing, 142, 110302. [Google Scholar] [CrossRef]
Yu, C., Jin, Y., Xing, Q., Zhang, Y., Guo, S., & Meng, S. (2024). Advanced user credit risk prediction model using LightGBM, XGBoost and TabNet with SMOTEENN. arXiv, arXiv:2408.03497. [Google Scholar] [CrossRef]
Zhang, X., Ma, Y., & Wang, M. (2024). An attention-based LogisticCNN-BiLSTM hybrid neural network for credit risk prediction of listed real estate enterprises. Expert Systems, 41(2), e13299. [Google Scholar] [CrossRef]
Zhao, Y. (2024). Investigation of the application of machine learning algorithms in credit risk assessment of medium and micro enterprises. IEEE Access, 12, 152945–152958. [Google Scholar] [CrossRef]
Zhao, Z., Cui, T., Ding, S., Li, J., & Bellotti, A. G. (2024). Resampling techniques study on class imbalance problem in credit risk prediction. Mathematics, 12(5), 701. [Google Scholar] [CrossRef]
Zhu, Y., Hu, Y., Liu, Q., Liu, H., Ma, C., & Yin, J. (2023). A hybrid approach for predicting corporate financial risk: Integrating SMOTE-ENN and NGBoost. IEEE Access, 11, 111106–111125. [Google Scholar] [CrossRef]
Ziemba, P., Becker, J., Becker, A., & Radomska-Zalas, A. (2023). Framework for multi-criteria assessment of classification models for the purposes of credit scoring. Journal of Big Data, 10(1), 94. [Google Scholar] [CrossRef] [PubMed]

Figure 1. SHAP summary plot showing feature importance and their impact on MLP model predictions for the German Credit Dataset. Each point represents a single borrower, with color indicating the feature value (red = higher, blue = lower). The x-axis shows the SHAP value (unitless), which reflects the contribution of each feature to the prediction. Reported SHAP values are normalized by the method, allowing direct comparison of relative feature importance.

Figure 2. SHAP summary plot showing feature importance and their impact on model predictions for the Taiwan Credit Dataset. Colour scale indicates feature values (red = higher, blue = lower). The x-axis represents the SHAP value (unitless), quantifying each feature’s contribution to the prediction. SHAP values are normalized by the algorithm, enabling relative importance comparison across features.

Table 1. Attributes of the German Credit Dataset.

S/N	Variable	Description	Category
1	StatusOfCheckingAccount	Status of existing checking account	Categorical
2	DurationInMonths	Duration of credit in months	Continuous
3	CreditHistory	Credit history record	Categorical
4	LoanPurpose	Purpose of the loan	Categorical
5	CreditAmount	Amount of credit requested	Continuous
6	SavingsAccountBonds	Status of savings accounts/bonds	Categorical
7	EmploymentDuration	Length of current employment	Categorical
8	InstallmentRatePercentage	Installment rate as % of disposable income	Continuous
9	PersonalStatusSex	Personal status combined with gender	Categorical
10	OtherDebtorsGuarantors	Presence of other debtors or guarantors	Categorical
11	ResidenceDuration	Duration at current residence	Continuous
12	PropertyType	Type of property ownership	Categorical
13	AgeInYears	Age of applicant in years	Continuous
14	OtherInstallmentPlans	Other installment plans (bank, stores)	Categorical
15	HousingType	Housing situation (rent, own, free)	Categorical
16	ExistingCreditsCount	Number of existing credits at this bank	Continuous
17	JobType	Type of job (skilled, unskilled, management)	Categorical
18	Dependents	Number of dependents	Continuous
19	Telephone	Telephone availability (yes/no)	Categorical
20	ForeignWorker	Is the applicant a foreign worker	Categorical

Table 2. Attributes of the Taiwan Credit Database.

S/N	Variable	Description	Category
1	LIMIT_BAL	Amount of given credit in NT dollars	Continuous
2	SEX	Gender (1 = male; 2 = female)	Categorical
3	EDUCATION	Education level (1 = graduate; 2 = university; 3 = high school; 4 = others)	Categorical
4	MARRIAGE	Marital status (1 = married; 2 = single; 3 = others)	Categorical
5	AGE	Age of client in years	Continuous
6	PAY_0	Repayment status in September 2005	Categorical
7	PAY_2	Repayment status in August 2005	Categorical
8	PAY_3	Repayment status in July 2005	Categorical
9	PAY_4	Repayment status in June 2005	Categorical
10	PAY_5	Repayment status in May 2005	Categorical
11	PAY_6	Repayment status in April 2005	Categorical
12	BILL_AMT1	Amount of bill statement in September 2005	Continuous
13	BILL_AMT2	Amount of bill statement in August 2005	Continuous
14	BILL_AMT3	Amount of bill statement in July 2005	Continuous
15	BILL_AMT4	Amount of bill statement in June 2005	Continuous
16	BILL_AMT5	Amount of bill statement in May 2005	Continuous
17	BILL_AMT6	Amount of bill statement in April 2005	Continuous
18	PAY_AMT1	Amount paid in September 2005	Continuous
19	PAY_AMT2	Amount paid in August 2005	Continuous
20	PAY_AMT3	Amount paid in July 2005	Continuous
21	PAY_AMT4	Amount paid in June 2005	Continuous
22	PAY_AMT5	Amount paid in May 2005	Continuous
23	PAY_AMT6	Amount paid in April 2005	Continuous

Table 3. Summary of Credit Risk Datasets.

Dataset	Instances	Attributes	Minority Class (%)
German Credit	1000	20	30% (Bad Credit)
Taiwan Credit	30,000	23	22.1% (Default)

Table 4. Model Hyperparameters Used in the Study.

Model	Parameter	Value
Logistic Regression	Solver	liblinear
SVM	Kernel	RBF
Decision Tree	Max Depth	None
Random Forest	Number of Estimators	100
XGBoost	Learning Rate	0.1
	Max Depth	6
	Number of Estimators	100
AdaBoost	Number of Estimators	50
MLP	Hidden Layers	[128, 64, 32]
	Activation Function	ReLU
	Output Activation	Softmax
	Optimizer	Adam (lr = 0.001)
	Batch Size	32
	Epochs	80 (with early stopping)
CNN	Number of Filters	64, 128
	Kernel Size	3
	Activation Function	ReLU
	Output Activation	Softmax
	Optimizer	Adam (lr = 0.0005)
	Batch Size	64
	Epochs	40 (with early stopping)
LSTM	Number of Units	100
	Activation Function	Tanh
	Output Activation	Sigmoid
	Optimizer	Adam (lr = 0.0005)
	Batch Size	64
	Epochs	60 (with dropout = 0.3)
GRU	Number of Units	80
	Activation Function	Tanh
	Output Activation	Sigmoid
	Optimizer	RMSprop (lr = 0.001)
	Batch Size	128
	Epochs	50 (with dropout = 0.2)

Table 5. Performance on the German credit dataset across all models and resampling techniques.

Resampling	Model	Accuracy	Recall	Specificity	Precision	F1-Score
No Resampling
No Resampling	Logistic Regression	0.705	0.943	0.136	0.723	0.818
No Resampling	SVM	0.750	1.000	0.153	0.738	0.849
No Resampling	Decision Tree	0.645	0.738	0.424	0.754	0.746
No Resampling	Random Forest	0.750	0.894	0.407	0.783	0.834
No Resampling	XGBoost	0.750	0.894	0.407	0.783	0.834
No Resampling	AdaBoost	0.760	0.915	0.390	0.782	0.843
No Resampling	MLP	0.770	0.872	0.525	0.815	0.842
No Resampling	CNN	0.775	0.908	0.458	0.800	0.850
No Resampling	LSTM	0.720	0.894	0.305	0.754	0.818
No Resampling	GRU	0.735	0.794	0.593	0.824	0.809
SMOTE
SMOTE	Logistic Regression	0.635	0.709	0.458	0.758	0.733
SMOTE	SVM	0.695	0.745	0.576	0.808	0.775
SMOTE	Decision Tree	0.655	0.723	0.492	0.773	0.747
SMOTE	Random Forest	0.735	0.801	0.576	0.819	0.810
SMOTE	XGBoost	0.760	0.851	0.542	0.816	0.833
SMOTE	AdaBoost	0.705	0.773	0.542	0.801	0.787
SMOTE	MLP	0.771	0.732	0.817	0.820	0.773
SMOTE	CNN	0.714	0.624	0.817	0.795	0.699
SMOTE	LSTM	0.739	0.644	0.847	0.828	0.725
SMOTE	GRU	0.746	0.698	0.802	0.800	0.746
ENN
ENN	Logistic Regression	0.575	0.553	0.627	0.780	0.647
ENN	SVM	0.610	0.553	0.746	0.839	0.667
ENN	Decision Tree	0.660	0.610	0.780	0.869	0.717
ENN	Random Forest	0.675	0.638	0.763	0.865	0.735
ENN	XGBoost	0.630	0.596	0.712	0.832	0.694
ENN	AdaBoost	0.670	0.624	0.780	0.871	0.727
ENN	MLP	0.844	0.821	0.861	0.821	0.821
ENN	CNN	0.797	0.714	0.861	0.800	0.755
ENN	LSTM	0.750	0.643	0.833	0.750	0.692
ENN	GRU	0.750	0.625	0.847	0.761	0.686
SMOTE-ENN
SMOTE-ENN	Logistic Regression	0.570	0.582	0.542	0.752	0.656
SMOTE-ENN	SVM	0.615	0.624	0.593	0.786	0.696
SMOTE-ENN	Decision Tree	0.630	0.667	0.542	0.777	0.718
SMOTE-ENN	Random Forest	0.680	0.738	0.542	0.794	0.765
SMOTE-ENN	XGBoost	0.670	0.702	0.593	0.805	0.750
SMOTE-ENN	AdaBoost	0.615	0.652	0.525	0.767	0.705
SMOTE-ENN	MLP	0.954	0.929	0.968	0.945	0.937
SMOTE-ENN	CNN	0.934	0.893	0.958	0.926	0.909
SMOTE-ENN	LSTM	0.927	0.911	0.937	0.895	0.903
SMOTE-ENN	GRU	0.914	0.839	0.958	0.922	0.879

Table 6. Performance on the Taiwanese credit dataset across all models and resampling techniques.

Resampling	Model	Accuracy	Recall	Specificity	Precision	F1-Score
No Resampling
No Resampling	Logistic Regression	0.810	0.235	0.971	0.691	0.351
No Resampling	SVM	0.820	0.331	0.957	0.682	0.446
No Resampling	Decision Tree	0.724	0.409	0.813	0.380	0.394
No Resampling	Random Forest	0.817	0.361	0.944	0.644	0.463
No Resampling	XGBoost	0.817	0.376	0.940	0.637	0.473
No Resampling	AdaBoost	0.816	0.308	0.958	0.674	0.423
No Resampling	MLP	0.810	0.332	0.943	0.622	0.433
No Resampling	CNN	0.796	0.179	0.969	0.618	0.278
No Resampling	LSTM	0.807	0.349	0.935	0.600	0.441
No Resampling	GRU	0.812	0.331	0.947	0.637	0.436
SMOTE
SMOTE	Logistic Regression	0.688	0.582	0.718	0.366	0.449
SMOTE	SVM	0.749	0.583	0.795	0.444	0.504
SMOTE	Decision Tree	0.694	0.488	0.752	0.355	0.411
SMOTE	Random Forest	0.783	0.487	0.866	0.504	0.495
SMOTE	XGBoost	0.763	0.464	0.846	0.458	0.461
SMOTE	AdaBoost	0.733	0.586	0.773	0.420	0.490
SMOTE	MLP	0.726	0.538	0.779	0.405	0.462
SMOTE	CNN	0.738	0.483	0.809	0.415	0.446
SMOTE	LSTM	0.756	0.590	0.802	0.455	0.514
SMOTE	GRU	0.703	0.697	0.705	0.398	0.507
ENN
ENN	Logistic Regression	0.803	0.447	0.902	0.562	0.498
ENN	SVM	0.803	0.490	0.891	0.557	0.521
ENN	Decision Tree	0.669	0.593	0.690	0.349	0.439
ENN	Random Forest	0.778	0.553	0.841	0.493	0.521
ENN	XGBoost	0.767	0.580	0.820	0.474	0.521
ENN	AdaBoost	0.786	0.549	0.852	0.510	0.529
ENN	MLP	0.772	0.556	0.832	0.482	0.516
ENN	CNN	0.787	0.302	0.922	0.521	0.383
ENN	LSTM	0.760	0.621	0.799	0.463	0.531
ENN	GRU	0.789	0.524	0.864	0.518	0.521
SMOTE-ENN
SMOTE-ENN	Logistic Regression	0.593	0.746	0.550	0.317	0.445
SMOTE-ENN	SVM	0.647	0.741	0.621	0.354	0.479
SMOTE-ENN	Decision Tree	0.660	0.601	0.677	0.343	0.436
SMOTE-ENN	Random Forest	0.821	0.745	0.842	0.512	0.610
SMOTE-ENN	XGBoost	0.708	0.632	0.729	0.395	0.486
SMOTE-ENN	AdaBoost	0.652	0.733	0.629	0.356	0.479
SMOTE-ENN	MLP	0.651	0.714	0.633	0.353	0.472
SMOTE-ENN	CNN	0.606	0.701	0.580	0.318	0.438
SMOTE-ENN	LSTM	0.647	0.767	0.613	0.357	0.487
SMOTE-ENN	GRU	0.679	0.720	0.667	0.377	0.495

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mapfumo, I.; Shongwe, T. Performance Evaluation of Machine Learning and Deep Learning Models for Credit Risk Prediction. J. Risk Financial Manag. 2026, 19, 210. https://doi.org/10.3390/jrfm19030210

AMA Style

Mapfumo I, Shongwe T. Performance Evaluation of Machine Learning and Deep Learning Models for Credit Risk Prediction. Journal of Risk and Financial Management. 2026; 19(3):210. https://doi.org/10.3390/jrfm19030210

Chicago/Turabian Style

Mapfumo, Irvine, and Thokozani Shongwe. 2026. "Performance Evaluation of Machine Learning and Deep Learning Models for Credit Risk Prediction" Journal of Risk and Financial Management 19, no. 3: 210. https://doi.org/10.3390/jrfm19030210

APA Style

Mapfumo, I., & Shongwe, T. (2026). Performance Evaluation of Machine Learning and Deep Learning Models for Credit Risk Prediction. Journal of Risk and Financial Management, 19(3), 210. https://doi.org/10.3390/jrfm19030210

Article Menu

Performance Evaluation of Machine Learning and Deep Learning Models for Credit Risk Prediction

Abstract

1. Introduction

2. Related Works

2.1. Machine Learning in Credit Risk Prediction

2.2. Deep Learning Approaches

2.3. Data Resampling for Imbalanced Data

3. Methodology

3.1. Credit Risk Datasets

3.1.1. German Credit Dataset

3.1.2. Taiwan Credit Dataset

3.2. Deep Learning Model Architectures

3.2.1. Multilayer Perceptron

3.2.2. Convolutional Neural Network

3.2.3. LSTM Networks

3.2.4. GRU Networks

3.3. Traditional Machine Learning Models

3.3.1. Logistic Regression

3.3.2. Support Vector Machine

3.3.3. Decision Tree

3.3.4. Random Forest

3.3.5. XGBoost

3.3.6. Adaptive Boosting

3.4. Resampling Techniques

3.4.1. SMOTE

3.4.2. ENN

3.4.3. SMOTE-ENN

3.5. SHapley Additive exPlanations (SHAP)

3.6. Experimental Setup

3.7. Evaluation Metrics

4. Results and Discussion

4.1. Performance of the Models on the German Dataset

4.2. Performance of the Models on the Taiwan Dataset

5. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI