Next Article in Journal
Inflation and CO2 Emissions: Asymmetric Moderating Effects of Financial Development in Fiji
Previous Article in Journal
Introduction: Globalization and Economic Integration
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Performance Evaluation of Machine Learning and Deep Learning Models for Credit Risk Prediction

Department of Electrical and Electronic Engineering Science, University of Johannesburg, Johannesburg 2006, South Africa
*
Author to whom correspondence should be addressed.
J. Risk Financial Manag. 2026, 19(3), 210; https://doi.org/10.3390/jrfm19030210
Submission received: 9 November 2025 / Revised: 3 December 2025 / Accepted: 6 December 2025 / Published: 11 March 2026
(This article belongs to the Section Financial Technology and Innovation)

Abstract

Credit risk prediction is essential for financial institutions to effectively assess the likelihood of borrower defaults and manage associated risks. This study presents a comparative analysis of deep learning architectures and traditional machine learning models on imbalanced credit risk datasets. To address class imbalance, we employ three resampling techniques: Synthetic Minority Over-sampling Technique (SMOTE), Edited Nearest Neighbors (ENN), and the hybrid SMOTE-ENN. We evaluate the performance of various models, including multilayer perceptron (MLP), convolutional neural network (CNN), long short-term memory (LSTM), gated recurrent unit (GRU), logistic regression, decision tree, support vector machine (SVM), random forest, adaptive boosting, and extreme gradient boosting. The analysis reveals that SMOTE-ENN combined with MLP achieves the highest F1-score of 0.928 (accuracy 95.4%) on the German dataset, while SMOTE-ENN with random forest attains the best F1-score of 0.789 (accuracy 82.1%) on the Taiwanese dataset. SHapley Additive exPlanations (SHAP) are employed to enhance model interpretability, identifying key drivers of credit default. These findings provide actionable guidance for developing transparent, high-performing, and robust credit risk assessment systems.

1. Introduction

Credit risk assessment is a critical component of the financial services sector, as it allows lenders to assess the probability that a borrower defaults on their loan obligations (Bhattacharya et al., 2023; Chang et al., 2024; Y. Zhao, 2024). Accurate credit risk prediction is essential for financial institutions to mitigate losses, allocate resources more effectively, and set appropriate interest rates. Traditional credit scoring models, such as decision trees and logistic regression, have been employed extensively due to their interpretability and ease of implementation (Ziemba et al., 2023). However, as the volume and complexity of financial data have increased, there has been a shift towards machine learning (ML) and deep learning (DL) techniques that can better capture nonlinear relationships and leverage complex patterns within data (Shi et al., 2024; Zhang et al., 2024).
Despite the advances in ML and DL, credit risk prediction presents several challenges, one of the most prominent being class imbalance (Bhatore et al., 2020; Noriega et al., 2023). In real-world credit datasets, such as those from Taiwan and Germany, there is a natural skew in the distribution of classes, with non-default (low-risk) cases significantly outnumbering default (high-risk) cases (Jiang et al., 2023; Z. Zhao et al., 2024). This imbalance poses difficulties for ML algorithms, which tend to be biased toward the majority class, frequently resulting in subpar performance in identifying minority class (default) cases. Addressing this imbalance is therefore crucial for developing robust and reliable credit risk models.
Current research in credit risk prediction has explored numerous techniques to handle class imbalance, including resampling methods and algorithmic adjustments. Resampling methods, such as Synthetic Minority Over-sampling Technique (SMOTE), generate synthetic samples of the minority class to balance the data distribution, while undersampling techniques, like Edited Nearest Neighbors (ENN), reduce the size of the majority class by eliminating redundant samples (Elreedy et al., 2024; Vairetti et al., 2024). Despite their effectiveness, these methods are not without limitations. Oversampling can lead to overfitting, especially in complex models, as it duplicates minority instances, while undersampling can discard potentially useful data from the majority class, reducing the overall model performance (Altalhan et al., 2025). There is still a void in the literature on the optimal combination of resampling techniques with various ML and DL models for the prediction of credit risk.
This study aims to address these voids by conducting a comprehensive comparative analysis of ML and DL models with different resampling strategies on imbalanced credit datasets. Specifically, this study employs the Taiwan and German credit datasets, which are commonly used benchmarks in credit risk research and exhibit significant class imbalance. Furthermore, this study seeks to identify model–resampling pairs that offer the best balance of predictive accuracy and robustness for imbalanced credit datasets. Such insights can inform the development of more reliable credit scoring systems, ultimately contributing to better credit risk management and decision-making in the financial sector.
The main contributions of this study are summarized as follows:
  • A comprehensive analysis of the impact of class imbalance on the effectiveness of various ML and DL models for credit risk prediction.
  • Evaluation of the effectiveness of resampling techniques, in combination with both ML and DL models, on imbalanced credit datasets.
  • Comparative analysis across two widely used imbalanced credit datasets, the Taiwan and German datasets, providing insights into model performance in diverse credit risk scenarios.
  • Identification of optimal model–resampling combinations that balance accuracy and robustness.
The rest of this paper is organized in the following manner: Section 2 presents the background and related works, highlighting existing ML and DL approaches in credit risk prediction. Section 3 details the datasets, experimental setup, and resampling strategies employed in this study. Section 4 describes the proposed approach and experimental setup. Section 4 presents the results of the various models under different resampling scenarios, followed by a discussion of the findings. Finally, the study is concluded in Section 5.

2. Related Works

2.1. Machine Learning in Credit Risk Prediction

Credit risk prediction has traditionally relied on ML techniques due to their capacity for predictive accuracy and interpretability. Among the foundational models, logistic regression (LR) remains widely used for binary classification tasks, as it provides a straightforward probabilistic framework for assessing creditworthiness (Alam et al., 2020; Shilpa et al., 2023; Suhadolnik et al., 2023). In credit risk applications, logistic regression’s interpretability allows financial analysts to easily understand the impact of individual predictors, such as income level and debt-to-income ratio, on the likelihood of default. As demonstrated by Montevechi et al. (2024), logistic regression can effectively model linear relationships but struggles with more complex, nonlinear interactions that are common in financial datasets.
Decision trees have been adopted in credit risk assessment to address the limitations of logistic regression. Decision trees are known for their ability to capture nonlinear interactions between variables, making them suitable for more complex datasets (Mienye & Jere, 2024). Additionally, decision trees are intuitive, as they resemble human decision-making processes, and their hierarchical structure allows for transparent credit scoring. For instance, a study by Y. Wang and Ni (2023) showed that decision trees could accurately classify high-risk and low-risk applicants by effectively capturing interactions among credit-related features. However, decision trees can be prone to overfitting, particularly when dealing with high-dimensional data, which can reduce generalizability on unseen data.
To enhance robustness and predictive accuracy, ensemble methods such as random forests and gradient boosting machines (GBMs) have gained popularity in recent years. Random forests leverage multiple decision trees by averaging their predictions, which mitigates overfitting and offers a more reliable model for credit risk prediction. As demonstrated in the work by Mienye and Sun (Mienye & Jere, 2024), random forests perform well in identifying complex interactions and capturing a wide range of credit risk factors, yielding improved accuracy over standalone decision trees. On the other hand, gradient boosting machines build sequential trees, each correcting the errors of the previous one. Techniques like XGBoost and LightGBM—both optimized implementations of gradient boosting—have shown high performance in credit risk applications because of their capacity to manage large datasets and deliver superior predictive performance, as shownby Yu et al. (2024).
Recently, ensemble stacking has been employed to combine the strengths of multiple ML models in credit risk prediction. Stacking involves training various base models, such as decision trees, logistic regression and SVM, and then using their predictions as inputs for a meta-model that generates the final prediction. This approach has been shown to yield high predictive accuracy and robust performance across various credit datasets. A recent study by Yin et al. (2023) highlights the advantages of stacking for credit risk prediction, where the technique achieved superior performance by utilizing the diverse strengths of multiple base models. Despite the computational complexity of stacking, it enables practitioners to maximize the predictive power of simpler models, making it a viable approach for high-stakes financial applications.

2.2. Deep Learning Approaches

In recent years, DL techniques have become more popular in credit risk prediction by virtue of their ability to capture complex, nonlinear relationships in large datasets. Unlike traditional ML models, DL architectures can automatically learn intricate patterns in data, making them suitable for financial applications where complex interactions between variables are prevalent. One commonly used DL model is the MLP, a fully connected neural network that has shown significant promise in predicting credit risk. Studies such as those by Y. Liu et al. (2024) have demonstrated that MLPs can outperform traditional ML models by capturing hidden patterns within borrower behavior and financial features.
Other than MLP, more sophisticated architectures like LSTM networks have been explored in credit risk prediction, particularly when temporal data is available. LSTM networks are a type of recurrent neural network (RNN) that performs well in modeling sequential data by retaining long-term dependencies. This makes them useful in credit risk scenarios where borrowers’ historical payment behavior can inform future default risk. Forinstance, B. Liu et al. (2020) utilized LSTM networks to predict defaults by analyzing past payment sequences, achieving higher predictive accuracy than static ML models. The study demonstrates the advantage of LSTM in leveraging temporal dependencies for credit risk prediction.
Another DL model that has been applied for credit risk applications is the GRU network, which is similar to LSTM but has a simpler structure and fewer parameters. GRUs have been found to be computationally efficient while still retaining the capacity to model sequential data, making them an attractive alternative to LSTM for credit risk prediction. A studyby Thor and Postek (2024) showed the robustness of GRU in their study of credit risk prediction, where the GRU outperformed standard machine learning models.
Additionally, CNNs, typically used for image processing, have been adapted for credit risk prediction. By capturing local patterns in feature space, CNNs can identify clusters of behaviors or attributes that may indicate credit risk. Meng et al. (2024) applied CNNs to financial data and found that they could efficiently identify risk-related patterns, particularly in large datasets with numerous features, outperforming traditional ML models.

2.3. Data Resampling for Imbalanced Data

Handling class imbalance is crucial in credit risk prediction, as most credit datasets contain a significantly lower proportion of default cases (high-risk) compared to non-default cases (low-risk). This imbalance can result in biased predictions, where models tend to predict the majority class more accurately, resulting in poor performance in identifying minority class instances. To address this, various data resampling techniques have been developed to balance the class distribution, improving model performance in detecting defaults. One widely adopted method is the SMOTE, which generates synthetic samples for the minority class by interpolating between existing minority instances. As demonstrated in the workby Dablain et al. (2022), SMOTE effectively reduces bias towards the majority class and improves model performance in highly imbalanced datasets.
SMOTE has been extensively used in credit risk prediction, often in conjunction with machine learning and deep learning models. In a recent study, Onasoga and Hwidi (2024) applied SMOTE to a credit risk dataset and observed significant improvements in the recall of default predictions, indicating that the model was better able to identify high-risk applicants. However, while SMOTE is effective in enhancing recall, it can also lead to overfitting, particularly when combined with complex models like deep neural networks (Alex, 2025). This overfitting arises because synthetic samples are generated based on existing minority instances, which may lead to an artificial concentration of minority samples in feature space, reducing the model’s generalizability.
While SMOTE and its variants remain widely used, recent studies have also introduced alternatives to address the risk of overfitting. For instance, DeepSMOTE leverages deep neural networks to generate more realistic minority samples (Dablain et al., 2022), while cost-sensitive learning adjusts model training to penalize misclassification of the minority class more heavily (Mienye et al., 2024). These approaches represent complementary strategies to traditional resampling and have shown promise in improving generalizability in highly imbalanced credit datasets.
Another approach to managing class imbalance is undersampling, where the majority class is reduced to balance the dataset. One popular undersampling technique is ENN, which removes majority class samples that are misclassified by their nearest neighbors (Xu et al., 2020). This approach has been shown to reduce redundancy in the majority class, allowing models to better differentiate between high-risk and low-risk cases. The study by Nizam-Ozogur and Orman (2024) highlighted the effectiveness of ENN in enhancing model performance in imbalanced datasets by focusing on cleaner and more relevant samples. In credit risk prediction, ENN has been found to enhance precision and reduce false positives by eliminating noisy samples in the majority class, as demonstrated by Xing et al. (2024).
While oversampling and undersampling techniques are effective individually, recent research has explored combining them to maximize their benefits. For example, SMOTE-ENN combines SMOTE’s synthetic oversampling with ENN’s undersampling, creating a more balanced dataset with reduced noise. This hybrid approach has been applied in credit risk prediction, where it has shown improved recall and precision in identifying defaults compared to standalone resampling methods. Zhu et al. (2023) demonstrated that SMOTE-ENN effectively enhanced model robustness in an imbalanced credit risk dataset.
As seen from the reviewed papers in this section, credit risk prediction is a crucial area where both traditional ML and DL approaches have been extensively applied. While traditional ML models such as logistic regression and decision trees are valued for their interpretability and simplicity, they often underperform when modeling complex, nonlinear interactions inherent in financial datasets. Advances in ensemble techniques like random forests and gradient boosting machines have addressed these limitations to some extent, yet challenges remain in optimizing performance on highly imbalanced datasets. Conversely, DL models, including LSTMs, GRUs, and CNNs, have demonstrated strong capability in capturing intricate patterns and temporal dependencies in different applications. However, the practical effectiveness of these models is often hindered by class imbalance, as the minority default cases are typically overshadowed by the majority non-default cases. Resampling strategies such as SMOTE and ENN have been employed to mitigate this imbalance, but questions about the optimal combination of model architectures and resampling strategies persist.
Therefore, this study bridges these gaps by conducting a systematic evaluation of ML and DL models for credit risk prediction, focusing on their performance under varying resampling techniques. The contributions of this work are threefold: firstly, it provides a comparative analysis of traditional ML models, advanced ensemble methods, and state-of-the-art DL architectures under real-world conditions of class imbalance; secondly, it integrates and evaluates resampling methods such as SMOTE, ENN, and SMOTE-ENN to identify their impact on model performance; and thirdly, it utilizes SHAP-based interpretability techniques to explain the contribution of key features in model predictions, ensuring transparency in high-stakes financial decision-making. The findings of this study can guide practitioners in selecting effective model–resampling combinations, advancing the field of credit risk prediction in both research and practical applications.

3. Methodology

3.1. Credit Risk Datasets

This study utilizes two publicly accessible credit datasets: the German Credit dataset and the Taiwan Credit dataset. Each dataset provides valuable insights for credit risk prediction and presents unique characteristics that enhance the robustness of the analysis.

3.1.1. German Credit Dataset

The German Credit dataset, shown in Table 1, contains 1000 instances with 20 attributes encompassing personal and financial information such as age, credit amount, loan duration, and loan purpose (Boughaci et al., 2021). The target variable classifies applicants as good credit risk (1) or bad credit risk (0). This dataset exhibits class imbalance, with about 70% labeled as good credit and 30% as bad credit. It is widely used as a benchmark in credit risk modeling to assess the performance of different algorithms.

3.1.2. Taiwan Credit Dataset

The Taiwan Credit dataset, also known as the UCI Credit Card Default dataset, includes 30,000 instances with 23 features related to demographic information, bill statements, and payment history. The target variable provides an indication whether the client will default (1) or not (0) on their next payment. This dataset is significantly imbalanced, with approximately 22.1% defaults and 77.9% non-defaults. Its large size and comprehensive features make it particularly useful for developing and testing complex models such as DL architectures. The attributes in the dataset are described in Table 2.
Based on the Table 3, both the German and Taiwan credit datasets are widely adopted benchmarks in credit risk modeling research. Each dataset reflects specific socio-economic and institutional contexts: the German dataset is relatively small and dated, representing lending behavior in a European setting, while the Taiwan dataset is larger and more recent but restricted to credit card clients from a single Asian economy. As such, conclusions drawn from these datasets may not fully capture the heterogeneity of global credit markets, where borrower demographics, regulatory environments, and financial products differ substantially. Therefore, the reported findings are interpreted as indicative rather than universally representative.
The decision to employ these two datasets was motivated by their prominence as benchmarking tools in the literature and their complementary characteristics. The German dataset provides a compact but highly imbalanced sample that is frequently used to test baseline algorithms, whereas the Taiwan dataset offers scale and complexity suitable for training more expressive deep learning models. Together, they allow for a comparative analysis across dataset sizes, levels of feature richness, and class imbalance ratios.

3.2. Deep Learning Model Architectures

This section describes the model architectures used in this study, focusing on several DL models that are well-suited for credit risk prediction.

3.2.1. Multilayer Perceptron

The MLP architecture is a fully connected neural network with an input layer, multiple hidden layers, and an output layer. In credit risk prediction, MLPs are advantageous for capturing complex, nonlinear relationships among features (Shang et al., 2023). Given an inputvector x = [ x 1 , x 2 , , x n ] , the MLP computes each layer’s output through an affine transformation followed by a nonlinear activation function. The output of the j-th neuron in layer l can be represented as:
h j ( l ) = f i = 1 n w i j ( l ) h i ( l 1 ) + b j ( l )
where f ( · ) is a nonlinear activation function (e.g., ReLU), w i j ( l ) represents the weight between the i-th neuron in layer ( l 1 ) and the j-th neuron in layer l, and b j ( l ) is the bias term (Li et al., 2025). The MLP’s fully connected layers enable it to learn general patterns in credit data, making it effective for tasks like credit scoring when applied to structured data.

3.2.2. Convolutional Neural Network

CNNs, originally developed for image processing, have been adapted for credit risk prediction by treating financial data as structured “grids” where patterns are extracted using convolutional layers (Meer et al., 2024). A CNN applies a series of convolutional filters to capture local patterns in data, which can be useful for identifying clusters of behaviors indicative of credit risk (Krichen, 2023; Salehi et al., 2023). The 2-D discrete convolution of an input X with a filter K is written compactly as   
( X K ) i , j = ( u , v ) S K X i + u , j + v K u , v
where ( X K ) i , j denotes the output at location ( i , j ) , S K is the index set covering the finite support of the kernel K , and ∗ denotes discrete convolution (with standard padding/stride as specified in the implementation). By stacking multiple convolutional and pooling layers, CNNs can automatically learn hierarchical representations, making them suitable for large credit datasets with numerous interrelated features.

3.2.3. LSTM Networks

LSTM networks, a type of RNN, are particularly effective for sequential data where long-term dependencies are relevant. In credit risk prediction, LSTMs can capture patterns in borrowers’ payment history over time, helping to assess default risk based on temporal behavior (Alvi et al., 2024). The LSTM cell updates its hidden state by maintaining a memory cell c t at each time step t, controlled by three gates: the input gate i t , the forget gate f t , and the output gate o t . These gates are defined as follows:
i t = σ ( W i · [ h t 1 , x t ] + b i )
f t = σ ( W f · [ h t 1 , x t ] + b f )
o t = σ ( W o · [ h t 1 , x t ] + b o )
c t = f t · c t 1 + i t · tanh ( W c · [ h t 1 , x t ] + b c )
h t = o t · tanh ( c t )
where x t is the input at time t, h t is the hidden state, and σ represents the sigmoid activation function. These equations follow the original formulation of LSTM introduced by Hochreiter and Schmidhuber (1997). This structure allows LSTMs to capture dependencies across different time steps, which is beneficial for time-series data in credit risk analysis.
While the German dataset lacks explicit temporal attributes, we include LSTM for benchmarking, treating static features as a single-time-step sequence. This allows evaluation of its generalization on non-sequential data, revealing limitations (e.g., potential overfitting to assumed dependencies) that inform model selection for similar tabular datasets.

3.2.4. GRU Networks

The GRU is a condensed form of the LSTM that retains the capacity to express sequential relationships while reducing computing complexity by combining the input and forget gates into a single update gate (Niu et al., 2023). In GRUs, the hidden state h t at time t is updated based on the reset gate r t and the update gate z t , defined as:
z t = σ W z · [ h t 1 , x t ] + b z
r t = σ W r · [ h t 1 , x t ] + b r
h ˜ t = tanh W h · [ r t h t 1 , x t ] + b h
h t = z t h t 1 + ( 1 z t ) h ˜ t
where h ˜ t is the candidate activation (Niu et al., 2023). GRUs have been shown to perform comparably to LSTMs in different applications with sequential data while being less computationally demanding, as highlighted in various recent studies.
Although both the German and Taiwan datasets are primarily tabular, their feature composition provides a rationale for exploring sequential architectures such as LSTMs and GRUs. In particular, the Taiwan dataset includes six consecutive months of repayment status (PAY_0 to PAY_6), bill amounts (BILL_AMT1–6), and payment amounts (PAY_AMT1–6), which naturally form temporal sequences reflecting borrowers’ financial behavior over time. Modeling these attributes as ordered sequences allows recurrent models to capture dynamic payment patterns that may not be fully exploited by feed-forward networks. For the German dataset, where explicitly temporal attributes are absent, the inclusion of LSTM and GRU serves a benchmarking role, enabling comparison of their generalization ability on static features against architectures like MLP and CNN.

3.3. Traditional Machine Learning Models

This section describes the traditional machine learning models used in this study. Each model type, such as logistic regression, decision tree, support vector machine, random forest, AdaBoost and XGBoost, offers unique strengths and is widely applied in the financial domain.

3.3.1. Logistic Regression

Logistic regression is a widely used linear model for binary classification tasks, including credit risk prediction. It uses the sigmoid function to model the likelihood that a given input belongs to a specific class, with outputs constrained between 0 and 1. Given an input vector x = [ x 1 , x 2 , , x n ] , the probability of the positive class is computed as:
P ( y = 1 | x ) = σ ( w T x + b ) ,
σ ( z ) = 1 1 + e z
where w is the weight vector, b is the bias term, and σ ( · ) is the sigmoid activation (Gutiérrez et al., 2010). Logistic regression is valued for its interpretability, as it allows practitioners to understand the impact of each feature on default probability. However, it may struggle with capturing complex, nonlinear relationships in credit risk data.

3.3.2. Support Vector Machine

SVMs are robust classifiers that seek to identify the optimal hyperplane to maximize the margin between data points of various classes. For linearly separable data, support vector machines maximize the distance between the nearest data points (support vectors) of each class and the hyperplane (M.-W. Huang et al., 2017). The decision boundary is defined as:
f ( x ) = w T x + b = 0
where w is the weight vector and b is the bias term. For cases where the data is not linearly separable, support vector machines use kernel functions (e.g., radial basis function) to project data into a higher-dimensional space where a linear separator may exist.

3.3.3. Decision Tree

Decision trees are non-parametric models that create a tree-like structure of decisions by recursively partitioning data according to feature values. At each node, the model chooses a feature and a threshold that best splits the data to maximize homogeneity within each partition (Sun et al., 2024). The Gini impurity or information gain criteria are typically used to select the best splits. The prediction for an instance x is made by traversing the tree from the root to a leaf node, where the class label is determined. Although decision trees are intuitive and easy to interpret, they have a tendency to overfit, especially on small datasets.

3.3.4. Random Forest

Random forest is an ensemble method that combines multiple decision trees to improve classification accuracy and reduce overfitting (Biau & Scornet, 2016). In a random forest, each tree is trained on a different bootstrap sample of the data, and at each split, a random subset of features is considered. The final prediction is determined by aggregating the predictions of all individual trees, usually by majority voting for classification tasks:
y ^ = mode ( { y 1 , y 2 , , y m } )
where y i is the prediction of the i-th tree in the forest. Random forests are widely used in credit risk prediction due to their robustness, ability to capture complex patterns, and reduced overfitting compared to single decision trees (Sun et al., 2024).

3.3.5. XGBoost

XGBoost is a scalable and effective gradient boosting method that makes use of a group of weak learners, usually decision trees. The purpose of each new tree is to reduce the residual errors of the ones which came prior, making XGBoost effective at handling complex datasets (Rao et al., 2022). Given a model F ( x ) = k = 1 K f k ( x ) , where f k are individual decision trees, XGBoost optimizes the following objective function:
Obj = i = 1 n l ( y i , y ^ i ) + k = 1 K Ω ( f k )
where l ( y i , y ^ i ) is a differentiable loss function measuring the difference between predicted and actual values, and Ω ( f k ) is a regularization term that penalizes the complexity of the trees to prevent overfitting. XGBoost has shown strong performance in credit risk prediction due to its robustness and ability to handle imbalanced data effectively.

3.3.6. Adaptive Boosting

AdaBoost is a boosting technique that builds a strong classifier by combining several weak classifiers. It iteratively trains models on weighted versions of the dataset, focusing more on incorrectly classified instances in each iteration (X. Huang et al., 2022). The final model prediction is a weighted majority vote of all classifiers, where each classifier’s weight depends on its accuracy. For an instance x , the prediction is given by:
y ^ = sign m = 1 M α m h m ( x )
where h m ( x ) { 1 , + 1 } denotes the prediction of the m-th weak classifier for input x , and α m is the weight assigned to that classifier based on its accuracy (Mienye & Jere, 2024). AdaBoost is known for improving model performance on difficult-to-classify instances and is often used for credit risk tasks where correctly identifying high-risk cases is critical.

3.4. Resampling Techniques

To address the class imbalance in credit risk prediction, this study explores several resampling techniques. These techniques either oversample the minority class or undersample the majority class in an effort to balance the dataset, thereby improving the model’s capacity to identify high-risk cases. The following resampling methods are employed: SMOTE, ENN, and a hybrid approach, SMOTE-ENN.

3.4.1. SMOTE

A popular oversampling technique called SMOTE creates synthetic samples for the minority class in order to address class imbalance. SMOTE successfully increases the representation of the minority class without duplicating samples by interpolating between current minority samples and their closest neighbors to create new instances (Nizam-Ozogur & Orman, 2024; L. Wang, 2022). For a given minority instance x , a synthetic instance is generated as:
x new = x + δ · ( x neighbor x )
where x neighbor is a randomly selected nearest neighbor of x , and δ is a random value in the range [0, 1]. By creating synthetic samples that lie along the line segment between x and x neighbor , SMOTE reduces the risk of overfitting and helps the model better learn patterns associated with the minority class.

3.4.2. ENN

ENN is an undersampling method that curbs the size of the majority class by removing potentially noisy or misclassified samples. For each majority class instance, ENN examines its nearest neighbors, and if it is misclassified by the majority of its neighbors, the instance is removed from the dataset (Xu et al., 2020). This technique helps clean the majority class by eliminating overlapping or borderline cases that could lead to misclassification, enhancing the model’s focus on identifying true default cases. ENN can improve the precision of credit risk models by reducing false positives, which is beneficial in high-stakes financial applications.

3.4.3. SMOTE-ENN

SMOTE-ENN is a hybrid resampling approach that combines the strengths of SMOTE and ENN. First, SMOTE is applied to generate synthetic samples of the minority class, increasing its representation in the dataset. Next, ENN is applied to the combined dataset to remove noisy or misclassified majority class instances, effectively creating a more balanced and cleaner dataset (Lin et al., 2021). This approach allows the model to benefit from the additional minority samples while minimizing noise from the majority class. SMOTE-ENN has shown effectiveness in credit risk prediction, as it balances recall and precision, aiding in the identification of high-risk cases while maintaining overall model robustness.

3.5. SHapley Additive exPlanations (SHAP)

SHAP is a game-theoretic approach to model interpretability that assigns importance values to features based on their contribution to individual predictions (Lundberg & Lee, 2017). SHAP values are derived from Shapley values in cooperative game theory, where each feature’s impact is computed by averaging its marginal contribution across all possible coalitions of features. In this study, SHAP is applied post-training to compute global feature importance on the validation sets, visualized in summary plots that rank features by their average absolute SHAP value and show directional impacts (positive or negative) on predictions.

3.6. Experimental Setup

This study uses K-fold cross-validation to evaluate the models, where the dataset is split into K subsets. Every subset is used once as the validation set, while training is done using the remaining K 1 subsets, iterating through each subset to obtain an average performance metric. In this work, K was set to 10, which provides a reasonable trade-off between bias and variance, particularly given the relatively small size of the German dataset. To maintain class distribution across folds, stratified sampling was employed, and a fixed random seed was set to ensure reproducibility. Importantly, all resampling techniques (SMOTE, ENN, and SMOTE-ENN) were applied strictly within the training folds to avoid information leakage from the validation data. In addition, nested cross-validation was used for hyperparameter tuning, where an inner 5-fold stratified CV was applied within each outer training fold. This design ensures that hyperparameters are selected independently of the outer validation data, yielding a more reliable estimate of generalization performance. The outer loop evaluates overall model performance, while the inner loop optimizes hyperparameters for each outer training set. Furthermore, the following Algorithm 1 describes the proposed methodology used in this study.
The proposed approach comprises systematic steps to ensure robust model performance and interpretability of credit risk prediction. Specifically, resampling was conducted only on the training data of each fold, ensuring that the validation set remained untouched and representative of the original distribution. After the resampling, K-fold cross-validation is employed to split the dataset into training and validation sets, ensuring a reliable and unbiased evaluation of model performance. The model is trained iteratively within each fold, and performance metrics are calculated for validation. Furthermore, SHAP analysis is carried out after training to interpret model predictions, offering information on the significance of features and how they affect model outputs. SHAP was used to compute feature contributions, following the interpretability framework introduced in (Lundberg & Lee, 2017). The combination of these steps ensures a comprehensive evaluation framework, balancing predictive accuracy with interpretability for practical credit risk applications.
Algorithm 1 Proposed Approach for Credit Risk Prediction
  1:
Input: Dataset D, Resampling technique R, Model M, Number of outer folds K
  2:
Output: Average performance metrics (i.e., accuracy, precision, recall, specificity, F1 score) and feature importance insights
  3:
Step 1: Initialize 10-fold stratified cross-validation with fixed random seed
  4:
for  k = 1 to K do
  5:
    Split D into training set D train and validation set D val for fold k
  6:
    Step 2: Within D train , perform inner 5-fold stratified cross-validation for hyperparameter tuning
  7:
            Apply resampling technique R on each inner training split only
  8:
            Select optimal hyperparameters based on inner validation performance
  9:
    Step 3: Retrain model M with selected hyperparameters on full D train
10:
    Step 4: Evaluate M on D val to obtain performance metrics
11:
end for
12:
Step 5: Compute average performance metrics across all K folds
13:
Step 6: Perform SHAP analysis to interpret the model:
14:
        Use SHAP to compute feature importance on the validation or test set
15:
        Generate SHAP summary plots to analyze feature impacts
16:
Return: Average performance metrics and SHAP insights
Hyperparameters were optimized using random search within the inner CV loop. For illustration, in one outer fold of the German dataset, the optimal hyperparameters for MLP included hidden layers [128, 64, 32], ReLU activation, Adam optimizer with lr = 0.001, and batch size 32, selected based on the highest F1-score (0.912) on the inner validation sets. The full set of hyperparameters used across models is summarized in Table 4.

3.7. Evaluation Metrics

In order to assess the model’s performance in credit risk prediction, especially with imbalanced data, a comprehensive set of metrics is essential. This study utilizes accuracy, recall, precision, specificity, and F1 score, as each metric provides unique insights into model effectiveness. Accuracy offers a general measure of performance, but it can be misleading for imbalanced datasets, as it may reflect high performance merely by correctly predicting the majority class. Therefore, additional metrics such as precision, recall, specificity, and F1 score are incorporated to assess the model’s ability to handle both classes accurately, particularly focusing on correctly identifying high-risk (default) cases.
Precision and recall are valuable in understanding the model’s performance in identifying defaults. Precision reflects the proportion of predicted default cases that are actual defaults, reducing the occurrence of false positives (Obaido et al., 2022). Conversely, Recall measures the model’s effectiveness in capturing actual defaults. Specificity complements recall by indicating the model’s accuracy on non-default cases, while the F1 score balances precision and recall, providing a single measure of performance on the minority class, and it is defined as the harmonic mean of precision and recall. The following equations define each of the metrics:
Accuracy = T P + T N T P + T N + F P + F N
Precision = T P T P + F P
Recall = T P T P + F N
Specificity = T N T N + F P
F 1 Score = 2 · Precision · Recall Precision + Recall
where T P is true positives, T N is true negatives, F P is false positives, and F N is false negatives.

4. Results and Discussion

The performance of several machine learning and deep learning models applied to the German and Taiwan credit datasets is thoroughly examined in this section. All experiments were conducted using Python 3.10. The machine learning models and pre-processing methods were implemented with scikit-learn, while the MLP was developed using TensorFlow/Keras. Data manipulation and analysis were performed using NumPy and pandas. The evaluation involves multiple experimental setups, incorporating different resampling techniques, including SMOTE, ENN, and SMOTE-ENN, to address the inherent class imbalance in the datasets. The experimental setup included a diverse range of models: Logistic Regression, SVM, Decision Trees, Random Forest, XGBoost, AdaBoost, MLP, CNN, LSTM, and GRU. The hyperparameters for each model are summarized in Table 4. These hyperparameters were optimized using a random search strategy within nested cross-validation.
Representative ranges included: max depth { 3 , 5 , 7 , None } for decision trees, number of estimators { 100 , 200 , 500 } for random forest and boosting models, learning rate [ 0.01 , 0.3 ] for XGBoost, hidden layer sizes { [ 32 ] , [ 64 , 32 ] } for MLP, and recurrent units { 32 , 50 , 100 } for LSTM/GRU.
Model evaluation was performed using 10-fold stratified cross-validation to maintain balanced class distribution in each fold. No separate hold-out test set was used; all performance metrics were derived from the cross-validation results. Ten-fold cross-validation is generally more reliable than a single train–test split because it reduces variance and provides a more stable estimate of generalization. This approach is especially suitable for small datasets such as the German Credit dataset (1000 records), where a hold-out split could lead to unstable or misleading performance estimates.

4.1. Performance of the Models on the German Dataset

The performance of several machine learning and deep learning models applied to the German dataset using different resampling approaches is compiled in Table 5, including No Resampling, SMOTE, ENN, and SMOTE-ENN. Among the models evaluated, SMOTE-ENN combined with MLP achieved the highest overall performance, with an accuracy of 0.954, recall of 0.929, specificity of 0.968, precision of 0.945, and F1 Score of 0.937. This indicates that the MLP model, when combined with SMOTE-ENN, is particularly effective at balancing sensitivity and specificity while maintaining high predictive accuracy.
The CNN model under SMOTE-ENN also demonstrated strong performance, achieving an accuracy of 0.934, recall of 0.893, specificity of 0.958, precision of 0.926, and F1 Score of 0.909. Compared to MLP, CNN slightly lagged in recall and precision but still achieved robust results. Similarly, GRU and LSTM models trained with SMOTE-ENN achieved impressive results, with GRU achieving an F1 Score of 0.879 and LSTM slightly outperforming GRU with an F1 Score of 0.903. These results underscore the effectiveness of SMOTE-ENN in improving model performance, particularly for deep learning models.
For models without resampling, CNN emerged as the top performer, achieving an F1 Score of 0.850, followed by MLP with an F1 Score of 0.842. This suggests that CNN and MLP possess inherent robustness when applied to imbalanced datasets, even without resampling techniques. However, the recall values for these models were generally lower, indicating challenges in identifying minority class instances without resampling.
The SHAP summary plot in Figure 1 provides insights into feature importance for the German dataset. The plot highlights that “CheckingAccount” and “Duration” are the most influential features in predicting credit outcomes, with their SHAP values indicating strong contributions to the model’s decisions. Features such as “SavingAccounts” and “Age” also played significant roles, albeit to a lesser extent. The SHAP analysis underscores the importance of financial stability indicators like “CheckingAccount” and “SavingAccounts” in credit risk modeling, aligning with domain knowledge.

4.2. Performance of the Models on the Taiwan Dataset

Table 6 presents the performance metrics for models applied to the Taiwan dataset under the same resampling techniques. The XGBoost model without resampling achieved the highest accuracy among the No Resampling configurations, with an accuracy of 0.817, recall of 0.376, specificity of 0.940, precision of 0.637, and F1 Score of 0.473. However, the recall value remained relatively low, indicating limitations in detecting minority class instances.
SMOTE-ENN combined with random forest obtained the best-performing configuration, achieving an accuracy of 0.821, recall of 0.745, specificity of 0.842, precision of 0.512, and F1 Score of 0.610. This configuration demonstrated improved recall compared to other models, indicating better sensitivity to minority class predictions. Additionally, SMOTE-ENN combined with LSTM achieved comparable results, with an F1 Score of 0.487, highlighting the effectiveness of deep learning models under resampling techniques.
The SHAP summary plot in Figure 2 illustrates the feature importance for the Taiwan dataset using the random forest predictions. “PAY_0” and “LIMIT_BAL” were identified as the most critical features influencing model predictions. These features reflect past payment behavior and credit limits, which are intuitive and highly relevant indicators for predicting credit default. Other features, such as “BILL_AMT1” and “PAY_AMT1,” also contributed significantly, emphasizing the role of financial transaction history in credit risk prediction.

5. Discussion and Conclusions

The results demonstrate that deep learning models, particularly MLP and CNN, achieve superior performance on both datasets when paired with the hybrid SMOTE-ENN resampling technique. This superiority stems from their ability to model complex non-linear interactions in credit features, which is significantly enhanced by the cleaner class boundaries produced by SMOTE-ENN compared with SMOTE alone or no resampling.
The superior performance observed for the MLP combined with the SMOTE-ENN pipeline can be attributed to two complementary effects. First, the MLP is capable of modeling complex, non-linear relationships commonly present in tabular credit-scoring datasets. Second, the SMOTE-ENN procedure addresses class imbalance by generating synthetic minority-class examples while removing noisy majority-class samples; the ENN step in particular helps eliminate ambiguous majority instances and retain representative minority prototypes. This process results in sharper and more discriminative decision boundaries, ultimately improving classification accuracy and robustness.
The substantial decline in SVM performance following SMOTE-ENN on the Taiwan dataset can be explained by the SVM’s sensitivity to class overlap and noise near the decision margin. While ENN removes some ambiguous samples, the SMOTE step may still generate synthetic minority instances close to the true boundary—an issue exacerbated by the noisier, more overlapping structure of the Taiwan dataset. These synthetic samples distort the margin optimization process essential to SVMs, resulting in a notable reduction in generalization performance.
The SHAP analysis further validated these models by identifying critical features like PAY_0 and LIMIT_BAL for the Taiwan dataset and CheckingAccount and Duration for the German dataset, indicating their relevance in credit scoring. These findings align closely with financial intuition and offer actionable insights for lenders, emphasizing the importance of current account status, repayment behavior, loan duration, and available credit limit in default prediction.
Although deep learning models exhibited greater robustness and higher F1-scores, their increased computational requirements may favor simpler models (e.g., Random Forest or XGBoost) in real-time scoring environments where inference speed is prioritized over marginal gains in predictive performance.
In summary, the combination of SMOTE-ENN with MLP (German dataset) or Random Forest (Taiwan dataset), coupled with SHAP explanations, represents a practical, high-performing, and transparent solution for credit risk assessment. Future research may explore cost-sensitive learning, ensemble-of-ensembles strategies, and the integration of SHAP into live credit-scoring pipelines to further improve fairness, performance, and regulatory compliance.

Author Contributions

Conceptualization, I.M. and T.S.; methodology, I.M.; software, I.M.; validation, T.S.; formal analysis, I.M. and T.S.; investigation, I.M.; resources, I.M.; data curation, I.M.; writing—original draft preparation, I.M.; writing—review and editing, I.M. and T.S.; visualization, I.M. and T.S.; supervision, T.S.; project administration, T.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The credit data sets used in the study are openly available online at https://www.kaggle.com/datasets/uciml/german-credit and https://www.kaggle.com/datasets/uciml/default-of-credit-card-clients-dataset (accessed on 3 December 2025). Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Alam, T. M., Shaukat, K., Hameed, I. A., Luo, S., Sarwar, M. U., Shabbir, S., Li, J., & Khushi, M. (2020). An investigation of credit card default prediction in the imbalanced datasets. IEEE Access, 8, 201173–201198. [Google Scholar] [CrossRef]
  2. Alex, S. A. (2025). Imbalanced data learning using SMOTE and deep learning architecture with optimized features. Neural Computing and Applications, 37(2), 967–984. [Google Scholar] [CrossRef]
  3. Altalhan, M., Algarni, A., & Alouane, M. T.-H. (2025). Imbalanced data problem in machine learning: A review. IEEE Access, 13, 13686–13699. [Google Scholar] [CrossRef]
  4. Alvi, J., Arif, I., & Nizam, K. (2024). Advancing financial resilience: A systematic review of default prediction models and future directions in credit risk management. Heliyon, 10(21), e39770. [Google Scholar] [CrossRef]
  5. Bhatore, S., Mohan, L., & Reddy, Y. R. (2020). Machine learning techniques for credit risk evaluation: A systematic literature review. Journal of Banking and Financial Technology, 4(1), 111–138. [Google Scholar] [CrossRef]
  6. Bhattacharya, A., Biswas, S. K., & Mandal, A. (2023). Credit risk evaluation: A comprehensive study. Multimedia Tools and Applications, 82(12), 18217–18267. [Google Scholar] [CrossRef]
  7. Biau, G., & Scornet, E. (2016). A random forest guided tour. Test, 25, 197–227. [Google Scholar] [CrossRef]
  8. Boughaci, D., Alkhawaldeh, A. A., Jaber, J. J., & Hamadneh, N. (2021). Classification with segmentation for credit scoring and bankruptcy prediction. Empirical Economics, 61, 1281–1309. [Google Scholar] [CrossRef]
  9. Chang, V., Xu, Q. A., Akinloye, S. H., Benson, V., & Hall, K. (2024). Prediction of bank credit worthiness through credit risk analysis: An explainable machine learning study. Annals of Operations Research, 354, 247–271. [Google Scholar] [CrossRef]
  10. Dablain, D., Krawczyk, B., & Chawla, N. V. (2022). DeepSMOTE: Fusing deep learning and SMOTE for imbalanced data. IEEE Transactions on Neural Networks and Learning Systems, 34(9), 6390–6404. [Google Scholar] [CrossRef]
  11. Elreedy, D., Atiya, A. F., & Kamalov, F. (2024). A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning. Machine Learning, 113(7), 4903–4923. [Google Scholar] [CrossRef]
  12. Gutiérrez, P. A., Hervás-Martínez, C., & Martínez-Estudillo, F. J. (2010). Logistic regression by means of evolutionary radial basis function neural networks. IEEE Transactions on Neural Networks, 22(2), 246–263. [Google Scholar] [CrossRef]
  13. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735–1780. [Google Scholar] [CrossRef]
  14. Huang, M.-W., Chen, C.-W., Lin, W.-C., Ke, S.-W., & Tsai, C.-F. (2017). SVM and SVM ensembles in breast cancer prediction. PLoS ONE, 12(1), e0161501. [Google Scholar] [CrossRef]
  15. Huang, X., Li, Z., Jin, Y., & Zhang, W. (2022). Fair-AdaBoost: Extending AdaBoost method to achieve fair classification. Expert Systems with Applications, 202, 117240. [Google Scholar] [CrossRef]
  16. Jiang, C., Lu, W., Wang, Z., & Ding, Y. (2023). Benchmarking state-of-the-art imbalanced data learning approaches for credit scoring. Expert Systems with Applications, 213, 118878. [Google Scholar] [CrossRef]
  17. Krichen, M. (2023). Convolutional neural networks: A survey. Computers, 12(8), 151. [Google Scholar] [CrossRef]
  18. Li, J., Yang, R., Cao, X., Zeng, B., Shi, Z., Ren, W., & Cao, X. (2025). Inception MLP: A vision MLP backbone for multi-scale feature extraction. Information Sciences, 701, 121865. [Google Scholar] [CrossRef]
  19. Lin, M., Zhu, X., Hua, T., Tang, X., Tu, G., & Chen, X. (2021). Detection of ionospheric scintillation based on XGBoost model improved by SMOTE-ENN technique. Remote Sensing, 13(13), 2577. [Google Scholar] [CrossRef]
  20. Liu, B., Zhang, Z., Yan, J., Zhang, N., Zha, H., Li, G., Li, Y., & Yu, Q. (2020). A deep learning approach with feature derivation and selection for overdue repayment forecasting. Applied Sciences, 10(23), 8491. [Google Scholar] [CrossRef]
  21. Liu, Y., Baals, L. J., Osterrieder, J., & Hadji-Misheva, B. (2024). Leveraging network topology for credit risk assessment in P2P lending: A comparative study under the lens of machine learning. Expert Systems with Applications, 252, 124100. [Google Scholar] [CrossRef]
  22. Lundberg, S. M., & Lee, S.-I. (2017, December 4–9). A unified approach to interpreting model predictions. 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA. [Google Scholar]
  23. Meer, M., Khan, M. A., Jabeen, K., Alzahrani, A. I., Alalwan, N., Shabaz, M., & Khan, F. (2024). Deep convolutional neural networks information fusion and improved whale optimization algorithm based smart oral squamous cell carcinoma classification framework using histopathological images. Expert Systems, 42(1), e13536. [Google Scholar] [CrossRef]
  24. Meng, B., Sun, J., & Shi, B. (2024). A novel URP-CNN model for bond credit risk evaluation of Chinese listed companies. Expert Systems with Applications, 255, 124861. [Google Scholar] [CrossRef]
  25. Mienye, I. D., & Jere, N. (2024). A survey of decision trees: Concepts, algorithms, and applications. IEEE Access, 12, 86716–86727. [Google Scholar] [CrossRef]
  26. Mienye, I. D., Swart, T. G., & Obaido, G. (2024). Scalable XAI: Towards explainable machine learning models in distributed systems. In Pan-African artificial intelligence and smart systems conference (pp. 3–16). Springer. [Google Scholar]
  27. Montevechi, A. A., Miranda, R. C., Medeiros, A. L., & Montevechi, J. A. B. (2024). Advancing credit risk modelling with machine learning: A comprehensive review of the state-of-the-art. Engineering Applications of Artificial Intelligence, 137, 109082. [Google Scholar] [CrossRef]
  28. Niu, Z., Zhong, G., Yue, G., Wang, L.-N., Yu, H., Ling, X., & Dong, J. (2023). Recurrent attention unit: A new gated recurrent unit for long-term memory of important parts in sequential data. Neurocomputing, 517, 1–9. [Google Scholar] [CrossRef]
  29. Nizam-Ozogur, H., & Orman, Z. (2024). A heuristic-based hybrid sampling method using a combination of SMOTE and ENN for imbalanced health data. Expert Systems, 41(8), e13596. [Google Scholar] [CrossRef]
  30. Noriega, J. P., Rivera, L. A., & Herrera, J. A. (2023). Machine learning for credit risk prediction: A systematic literature review. Data, 8(11), 169. [Google Scholar] [CrossRef]
  31. Obaido, G., Ogbuokiri, B., Mienye, I. D., & Kasongo, S. M. (2022). A voting classifier for mortality prediction post-thoracic surgery. In International conference on intelligent systems design and applications (pp. 263–272). Springer. [Google Scholar]
  32. Onasoga, B., & Hwidi, J. (2024). Enhancing credit card default prediction: Prioritizing recall over accuracy. In International conference on innovative computing and communication (pp. 441–459). Springer. [Google Scholar]
  33. Rao, C., Liu, Y., & Goh, M. (2022). Credit risk assessment mechanism of personal auto loan based on PSO-XGBoost model. Complex & Intelligent Systems, 9, 1391–1414. [Google Scholar] [CrossRef]
  34. Salehi, A. W., Khan, S., Gupta, G., Alabduallah, B. I., Almjally, A., Alsolai, H., Siddiqui, T., & Mellit, A. (2023). A study of CNN and transfer learning in medical imaging: Advantages, challenges, future scope. Sustainability, 15(7), 5930. [Google Scholar] [CrossRef]
  35. Shang, H., Shang, L., Wu, J., Xu, Z., Zhou, S., Wang, Z., Wang, H., & Yin, J. (2023). NIR spectroscopy combined with 1D-convolutional neural network for breast cancerization analysis and diagnosis. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 287, 121990. [Google Scholar] [CrossRef]
  36. Shi, Y., Qu, Y., Chen, Z., Mi, Y., & Wang, Y. (2024). Improved credit risk prediction based on an integrated graph representation learning approach with graph transformation. European Journal of Operational Research, 315(2), 786–801. [Google Scholar] [CrossRef]
  37. Shilpa, N. A., Shaha, P., Hajek, P., & Abedin, M. Z. (2023). Default risk prediction based on support vector machine and logit support vector machine. In Novel financial applications of machine learning and deep learning: Algorithms, product modeling, and applications (pp. 93–106). Springer. [Google Scholar]
  38. Suhadolnik, N., Ueyama, J., & Da Silva, S. (2023). Machine learning for enhanced credit risk assessment: An empirical approach. Journal of Risk and Financial Management, 16(12), 496. [Google Scholar] [CrossRef]
  39. Sun, Z., Wang, G., Li, P., Wang, H., Zhang, M., & Liang, X. (2024). An improved random forest based on the classification accuracy and correlation measurement of decision trees. Expert Systems with Applications, 237, 121549. [Google Scholar] [CrossRef]
  40. Thor, M., & Postek, Ł. (2024). Gated recurrent unit network: A promising approach to corporate default prediction. Journal of Forecasting, 43(5), 1131–1152. [Google Scholar] [CrossRef]
  41. Vairetti, C., Assadi, J. L., & Maldonado, S. (2024). Efficient hybrid oversampling and intelligent undersampling for imbalanced big data classification. Expert Systems with Applications, 246, 123149. [Google Scholar] [CrossRef]
  42. Wang, L. (2022). Imbalanced credit risk prediction based on SMOTE and multi-kernel FCM improved by particle swarm optimization. Applied Soft Computing, 114, 108153. [Google Scholar] [CrossRef]
  43. Wang, Y., & Ni, X. S. (2023). A survey of machine learning methodologies for loan evaluation in peer-to-peer (P2P) lending. In Data analytics for management, banking and finance: Theories and application (pp. 1–49). Springer. [Google Scholar]
  44. Xing, Q., Yu, C., Huang, S., Zheng, Q., Mu, X., & Sun, M. (2024). Enhanced credit score prediction using ensemble deep learning model. arXiv, arXiv:2410.00256. [Google Scholar] [CrossRef]
  45. Xu, Z., Shen, D., Nie, T., & Kou, Y. (2020). A hybrid sampling algorithm combining M-SMOTE and ENN based on random forest for medical imbalanced data. Journal of Biomedical Informatics, 107, 103465. [Google Scholar] [CrossRef]
  46. Yin, W., Kirkulak-Uludag, B., Zhu, D., & Zhou, Z. (2023). Stacking ensemble method for personal credit risk assessment in peer-to-peer lending. Applied Soft Computing, 142, 110302. [Google Scholar] [CrossRef]
  47. Yu, C., Jin, Y., Xing, Q., Zhang, Y., Guo, S., & Meng, S. (2024). Advanced user credit risk prediction model using LightGBM, XGBoost and TabNet with SMOTEENN. arXiv, arXiv:2408.03497. [Google Scholar] [CrossRef]
  48. Zhang, X., Ma, Y., & Wang, M. (2024). An attention-based LogisticCNN-BiLSTM hybrid neural network for credit risk prediction of listed real estate enterprises. Expert Systems, 41(2), e13299. [Google Scholar] [CrossRef]
  49. Zhao, Y. (2024). Investigation of the application of machine learning algorithms in credit risk assessment of medium and micro enterprises. IEEE Access, 12, 152945–152958. [Google Scholar] [CrossRef]
  50. Zhao, Z., Cui, T., Ding, S., Li, J., & Bellotti, A. G. (2024). Resampling techniques study on class imbalance problem in credit risk prediction. Mathematics, 12(5), 701. [Google Scholar] [CrossRef]
  51. Zhu, Y., Hu, Y., Liu, Q., Liu, H., Ma, C., & Yin, J. (2023). A hybrid approach for predicting corporate financial risk: Integrating SMOTE-ENN and NGBoost. IEEE Access, 11, 111106–111125. [Google Scholar] [CrossRef]
  52. Ziemba, P., Becker, J., Becker, A., & Radomska-Zalas, A. (2023). Framework for multi-criteria assessment of classification models for the purposes of credit scoring. Journal of Big Data, 10(1), 94. [Google Scholar] [CrossRef] [PubMed]
Figure 1. SHAP summary plot showing feature importance and their impact on MLP model predictions for the German Credit Dataset. Each point represents a single borrower, with color indicating the feature value (red = higher, blue = lower). The x-axis shows the SHAP value (unitless), which reflects the contribution of each feature to the prediction. Reported SHAP values are normalized by the method, allowing direct comparison of relative feature importance.
Figure 1. SHAP summary plot showing feature importance and their impact on MLP model predictions for the German Credit Dataset. Each point represents a single borrower, with color indicating the feature value (red = higher, blue = lower). The x-axis shows the SHAP value (unitless), which reflects the contribution of each feature to the prediction. Reported SHAP values are normalized by the method, allowing direct comparison of relative feature importance.
Jrfm 19 00210 g001
Figure 2. SHAP summary plot showing feature importance and their impact on model predictions for the Taiwan Credit Dataset. Colour scale indicates feature values (red = higher, blue = lower). The x-axis represents the SHAP value (unitless), quantifying each feature’s contribution to the prediction. SHAP values are normalized by the algorithm, enabling relative importance comparison across features.
Figure 2. SHAP summary plot showing feature importance and their impact on model predictions for the Taiwan Credit Dataset. Colour scale indicates feature values (red = higher, blue = lower). The x-axis represents the SHAP value (unitless), quantifying each feature’s contribution to the prediction. SHAP values are normalized by the algorithm, enabling relative importance comparison across features.
Jrfm 19 00210 g002
Table 1. Attributes of the German Credit Dataset.
Table 1. Attributes of the German Credit Dataset.
S/NVariableDescriptionCategory
1StatusOfCheckingAccountStatus of existing checking accountCategorical
2DurationInMonthsDuration of credit in monthsContinuous
3CreditHistoryCredit history recordCategorical
4LoanPurposePurpose of the loanCategorical
5CreditAmountAmount of credit requestedContinuous
6SavingsAccountBondsStatus of savings accounts/bondsCategorical
7EmploymentDurationLength of current employmentCategorical
8InstallmentRatePercentageInstallment rate as % of disposable incomeContinuous
9PersonalStatusSexPersonal status combined with genderCategorical
10OtherDebtorsGuarantorsPresence of other debtors or guarantorsCategorical
11ResidenceDurationDuration at current residenceContinuous
12PropertyTypeType of property ownershipCategorical
13AgeInYearsAge of applicant in yearsContinuous
14OtherInstallmentPlansOther installment plans (bank, stores)Categorical
15HousingTypeHousing situation (rent, own, free)Categorical
16ExistingCreditsCountNumber of existing credits at this bankContinuous
17JobTypeType of job (skilled, unskilled, management)Categorical
18DependentsNumber of dependentsContinuous
19TelephoneTelephone availability (yes/no)Categorical
20ForeignWorkerIs the applicant a foreign workerCategorical
Table 2. Attributes of the Taiwan Credit Database.
Table 2. Attributes of the Taiwan Credit Database.
S/NVariableDescriptionCategory
1LIMIT_BALAmount of given credit in NT dollarsContinuous
2SEXGender (1 = male; 2 = female)Categorical
3EDUCATIONEducation level (1 = graduate; 2 = university; 3 = high school; 4 = others)Categorical
4MARRIAGEMarital status (1 = married; 2 = single; 3 = others)Categorical
5AGEAge of client in yearsContinuous
6PAY_0Repayment status in September 2005Categorical
7PAY_2Repayment status in August 2005Categorical
8PAY_3Repayment status in July 2005Categorical
9PAY_4Repayment status in June 2005Categorical
10PAY_5Repayment status in May 2005Categorical
11PAY_6Repayment status in April 2005Categorical
12BILL_AMT1Amount of bill statement in September 2005Continuous
13BILL_AMT2Amount of bill statement in August 2005Continuous
14BILL_AMT3Amount of bill statement in July 2005Continuous
15BILL_AMT4Amount of bill statement in June 2005Continuous
16BILL_AMT5Amount of bill statement in May 2005Continuous
17BILL_AMT6Amount of bill statement in April 2005Continuous
18PAY_AMT1Amount paid in September 2005Continuous
19PAY_AMT2Amount paid in August 2005Continuous
20PAY_AMT3Amount paid in July 2005Continuous
21PAY_AMT4Amount paid in June 2005Continuous
22PAY_AMT5Amount paid in May 2005Continuous
23PAY_AMT6Amount paid in April 2005Continuous
Table 3. Summary of Credit Risk Datasets.
Table 3. Summary of Credit Risk Datasets.
DatasetInstancesAttributesMinority Class (%)
German Credit10002030% (Bad Credit)
Taiwan Credit30,0002322.1% (Default)
Table 4. Model Hyperparameters Used in the Study.
Table 4. Model Hyperparameters Used in the Study.
ModelParameterValue
Logistic RegressionSolverliblinear
SVMKernelRBF
Decision TreeMax DepthNone
Random ForestNumber of Estimators100
XGBoostLearning Rate0.1
Max Depth6
Number of Estimators100
AdaBoostNumber of Estimators50
MLPHidden Layers[128, 64, 32]
Activation FunctionReLU
Output ActivationSoftmax
OptimizerAdam (lr = 0.001)
Batch Size32
Epochs80 (with early stopping)
CNNNumber of Filters64, 128
Kernel Size3
Activation FunctionReLU
Output ActivationSoftmax
OptimizerAdam (lr = 0.0005)
Batch Size64
Epochs40 (with early stopping)
LSTMNumber of Units100
Activation FunctionTanh
Output ActivationSigmoid
OptimizerAdam (lr = 0.0005)
Batch Size64
Epochs60 (with dropout = 0.3)
GRUNumber of Units80
Activation FunctionTanh
Output ActivationSigmoid
OptimizerRMSprop (lr = 0.001)
Batch Size128
Epochs50 (with dropout = 0.2)
Table 5. Performance on the German credit dataset across all models and resampling techniques.
Table 5. Performance on the German credit dataset across all models and resampling techniques.
ResamplingModelAccuracyRecallSpecificityPrecisionF1-Score
No Resampling
No ResamplingLogistic Regression0.7050.9430.1360.7230.818
No ResamplingSVM0.7501.0000.1530.7380.849
No ResamplingDecision Tree0.6450.7380.4240.7540.746
No ResamplingRandom Forest0.7500.8940.4070.7830.834
No ResamplingXGBoost0.7500.8940.4070.7830.834
No ResamplingAdaBoost0.7600.9150.3900.7820.843
No ResamplingMLP0.7700.8720.5250.8150.842
No ResamplingCNN0.7750.9080.4580.8000.850
No ResamplingLSTM0.7200.8940.3050.7540.818
No ResamplingGRU0.7350.7940.5930.8240.809
SMOTE
SMOTELogistic Regression0.6350.7090.4580.7580.733
SMOTESVM0.6950.7450.5760.8080.775
SMOTEDecision Tree0.6550.7230.4920.7730.747
SMOTERandom Forest0.7350.8010.5760.8190.810
SMOTEXGBoost0.7600.8510.5420.8160.833
SMOTEAdaBoost0.7050.7730.5420.8010.787
SMOTEMLP0.7710.7320.8170.8200.773
SMOTECNN0.7140.6240.8170.7950.699
SMOTELSTM0.7390.6440.8470.8280.725
SMOTEGRU0.7460.6980.8020.8000.746
ENN
ENNLogistic Regression0.5750.5530.6270.7800.647
ENNSVM0.6100.5530.7460.8390.667
ENNDecision Tree0.6600.6100.7800.8690.717
ENNRandom Forest0.6750.6380.7630.8650.735
ENNXGBoost0.6300.5960.7120.8320.694
ENNAdaBoost0.6700.6240.7800.8710.727
ENNMLP0.8440.8210.8610.8210.821
ENNCNN0.7970.7140.8610.8000.755
ENNLSTM0.7500.6430.8330.7500.692
ENNGRU0.7500.6250.8470.7610.686
SMOTE-ENN
SMOTE-ENNLogistic Regression0.5700.5820.5420.7520.656
SMOTE-ENNSVM0.6150.6240.5930.7860.696
SMOTE-ENNDecision Tree0.6300.6670.5420.7770.718
SMOTE-ENNRandom Forest0.6800.7380.5420.7940.765
SMOTE-ENNXGBoost0.6700.7020.5930.8050.750
SMOTE-ENNAdaBoost0.6150.6520.5250.7670.705
SMOTE-ENNMLP0.9540.9290.9680.9450.937
SMOTE-ENNCNN0.9340.8930.9580.9260.909
SMOTE-ENNLSTM0.9270.9110.9370.8950.903
SMOTE-ENNGRU0.9140.8390.9580.9220.879
Table 6. Performance on the Taiwanese credit dataset across all models and resampling techniques.
Table 6. Performance on the Taiwanese credit dataset across all models and resampling techniques.
ResamplingModelAccuracyRecallSpecificityPrecisionF1-Score
No Resampling
No ResamplingLogistic Regression0.8100.2350.9710.6910.351
No ResamplingSVM0.8200.3310.9570.6820.446
No ResamplingDecision Tree0.7240.4090.8130.3800.394
No ResamplingRandom Forest0.8170.3610.9440.6440.463
No ResamplingXGBoost0.8170.3760.9400.6370.473
No ResamplingAdaBoost0.8160.3080.9580.6740.423
No ResamplingMLP0.8100.3320.9430.6220.433
No ResamplingCNN0.7960.1790.9690.6180.278
No ResamplingLSTM0.8070.3490.9350.6000.441
No ResamplingGRU0.8120.3310.9470.6370.436
SMOTE
SMOTELogistic Regression0.6880.5820.7180.3660.449
SMOTESVM0.7490.5830.7950.4440.504
SMOTEDecision Tree0.6940.4880.7520.3550.411
SMOTERandom Forest0.7830.4870.8660.5040.495
SMOTEXGBoost0.7630.4640.8460.4580.461
SMOTEAdaBoost0.7330.5860.7730.4200.490
SMOTEMLP0.7260.5380.7790.4050.462
SMOTECNN0.7380.4830.8090.4150.446
SMOTELSTM0.7560.5900.8020.4550.514
SMOTEGRU0.7030.6970.7050.3980.507
ENN
ENNLogistic Regression0.8030.4470.9020.5620.498
ENNSVM0.8030.4900.8910.5570.521
ENNDecision Tree0.6690.5930.6900.3490.439
ENNRandom Forest0.7780.5530.8410.4930.521
ENNXGBoost0.7670.5800.8200.4740.521
ENNAdaBoost0.7860.5490.8520.5100.529
ENNMLP0.7720.5560.8320.4820.516
ENNCNN0.7870.3020.9220.5210.383
ENNLSTM0.7600.6210.7990.4630.531
ENNGRU0.7890.5240.8640.5180.521
SMOTE-ENN
SMOTE-ENNLogistic Regression0.5930.7460.5500.3170.445
SMOTE-ENNSVM0.6470.7410.6210.3540.479
SMOTE-ENNDecision Tree0.6600.6010.6770.3430.436
SMOTE-ENNRandom Forest0.8210.7450.8420.5120.610
SMOTE-ENNXGBoost0.7080.6320.7290.3950.486
SMOTE-ENNAdaBoost0.6520.7330.6290.3560.479
SMOTE-ENNMLP0.6510.7140.6330.3530.472
SMOTE-ENNCNN0.6060.7010.5800.3180.438
SMOTE-ENNLSTM0.6470.7670.6130.3570.487
SMOTE-ENNGRU0.6790.7200.6670.3770.495
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mapfumo, I.; Shongwe, T. Performance Evaluation of Machine Learning and Deep Learning Models for Credit Risk Prediction. J. Risk Financial Manag. 2026, 19, 210. https://doi.org/10.3390/jrfm19030210

AMA Style

Mapfumo I, Shongwe T. Performance Evaluation of Machine Learning and Deep Learning Models for Credit Risk Prediction. Journal of Risk and Financial Management. 2026; 19(3):210. https://doi.org/10.3390/jrfm19030210

Chicago/Turabian Style

Mapfumo, Irvine, and Thokozani Shongwe. 2026. "Performance Evaluation of Machine Learning and Deep Learning Models for Credit Risk Prediction" Journal of Risk and Financial Management 19, no. 3: 210. https://doi.org/10.3390/jrfm19030210

APA Style

Mapfumo, I., & Shongwe, T. (2026). Performance Evaluation of Machine Learning and Deep Learning Models for Credit Risk Prediction. Journal of Risk and Financial Management, 19(3), 210. https://doi.org/10.3390/jrfm19030210

Article Metrics

Back to TopTop