1. Introduction
Credit risk assessment is a critical component of the financial services sector, as it allows lenders to assess the probability that a borrower defaults on their loan obligations (
Bhattacharya et al., 2023;
Chang et al., 2024;
Y. Zhao, 2024). Accurate credit risk prediction is essential for financial institutions to mitigate losses, allocate resources more effectively, and set appropriate interest rates. Traditional credit scoring models, such as decision trees and logistic regression, have been employed extensively due to their interpretability and ease of implementation (
Ziemba et al., 2023). However, as the volume and complexity of financial data have increased, there has been a shift towards machine learning (ML) and deep learning (DL) techniques that can better capture nonlinear relationships and leverage complex patterns within data (
Shi et al., 2024;
Zhang et al., 2024).
Despite the advances in ML and DL, credit risk prediction presents several challenges, one of the most prominent being class imbalance (
Bhatore et al., 2020;
Noriega et al., 2023). In real-world credit datasets, such as those from Taiwan and Germany, there is a natural skew in the distribution of classes, with non-default (low-risk) cases significantly outnumbering default (high-risk) cases (
Jiang et al., 2023;
Z. Zhao et al., 2024). This imbalance poses difficulties for ML algorithms, which tend to be biased toward the majority class, frequently resulting in subpar performance in identifying minority class (default) cases. Addressing this imbalance is therefore crucial for developing robust and reliable credit risk models.
Current research in credit risk prediction has explored numerous techniques to handle class imbalance, including resampling methods and algorithmic adjustments. Resampling methods, such as Synthetic Minority Over-sampling Technique (SMOTE), generate synthetic samples of the minority class to balance the data distribution, while undersampling techniques, like Edited Nearest Neighbors (ENN), reduce the size of the majority class by eliminating redundant samples (
Elreedy et al., 2024;
Vairetti et al., 2024). Despite their effectiveness, these methods are not without limitations. Oversampling can lead to overfitting, especially in complex models, as it duplicates minority instances, while undersampling can discard potentially useful data from the majority class, reducing the overall model performance (
Altalhan et al., 2025). There is still a void in the literature on the optimal combination of resampling techniques with various ML and DL models for the prediction of credit risk.
This study aims to address these voids by conducting a comprehensive comparative analysis of ML and DL models with different resampling strategies on imbalanced credit datasets. Specifically, this study employs the Taiwan and German credit datasets, which are commonly used benchmarks in credit risk research and exhibit significant class imbalance. Furthermore, this study seeks to identify model–resampling pairs that offer the best balance of predictive accuracy and robustness for imbalanced credit datasets. Such insights can inform the development of more reliable credit scoring systems, ultimately contributing to better credit risk management and decision-making in the financial sector.
The main contributions of this study are summarized as follows:
A comprehensive analysis of the impact of class imbalance on the effectiveness of various ML and DL models for credit risk prediction.
Evaluation of the effectiveness of resampling techniques, in combination with both ML and DL models, on imbalanced credit datasets.
Comparative analysis across two widely used imbalanced credit datasets, the Taiwan and German datasets, providing insights into model performance in diverse credit risk scenarios.
Identification of optimal model–resampling combinations that balance accuracy and robustness.
The rest of this paper is organized in the following manner:
Section 2 presents the background and related works, highlighting existing ML and DL approaches in credit risk prediction.
Section 3 details the datasets, experimental setup, and resampling strategies employed in this study.
Section 4 describes the proposed approach and experimental setup.
Section 4 presents the results of the various models under different resampling scenarios, followed by a discussion of the findings. Finally, the study is concluded in
Section 5.
2. Related Works
2.1. Machine Learning in Credit Risk Prediction
Credit risk prediction has traditionally relied on ML techniques due to their capacity for predictive accuracy and interpretability. Among the foundational models, logistic regression (LR) remains widely used for binary classification tasks, as it provides a straightforward probabilistic framework for assessing creditworthiness (
Alam et al., 2020;
Shilpa et al., 2023;
Suhadolnik et al., 2023). In credit risk applications, logistic regression’s interpretability allows financial analysts to easily understand the impact of individual predictors, such as income level and debt-to-income ratio, on the likelihood of default. As demonstrated by
Montevechi et al. (
2024), logistic regression can effectively model linear relationships but struggles with more complex, nonlinear interactions that are common in financial datasets.
Decision trees have been adopted in credit risk assessment to address the limitations of logistic regression. Decision trees are known for their ability to capture nonlinear interactions between variables, making them suitable for more complex datasets (
Mienye & Jere, 2024). Additionally, decision trees are intuitive, as they resemble human decision-making processes, and their hierarchical structure allows for transparent credit scoring. For instance, a study by
Y. Wang and Ni (
2023) showed that decision trees could accurately classify high-risk and low-risk applicants by effectively capturing interactions among credit-related features. However, decision trees can be prone to overfitting, particularly when dealing with high-dimensional data, which can reduce generalizability on unseen data.
To enhance robustness and predictive accuracy, ensemble methods such as random forests and gradient boosting machines (GBMs) have gained popularity in recent years. Random forests leverage multiple decision trees by averaging their predictions, which mitigates overfitting and offers a more reliable model for credit risk prediction. As demonstrated in the work by Mienye and Sun (
Mienye & Jere, 2024), random forests perform well in identifying complex interactions and capturing a wide range of credit risk factors, yielding improved accuracy over standalone decision trees. On the other hand, gradient boosting machines build sequential trees, each correcting the errors of the previous one. Techniques like XGBoost and LightGBM—both optimized implementations of gradient boosting—have shown high performance in credit risk applications because of their capacity to manage large datasets and deliver superior predictive performance, as shownby
Yu et al. (
2024).
Recently, ensemble stacking has been employed to combine the strengths of multiple ML models in credit risk prediction. Stacking involves training various base models, such as decision trees, logistic regression and SVM, and then using their predictions as inputs for a meta-model that generates the final prediction. This approach has been shown to yield high predictive accuracy and robust performance across various credit datasets. A recent study by
Yin et al. (
2023) highlights the advantages of stacking for credit risk prediction, where the technique achieved superior performance by utilizing the diverse strengths of multiple base models. Despite the computational complexity of stacking, it enables practitioners to maximize the predictive power of simpler models, making it a viable approach for high-stakes financial applications.
2.2. Deep Learning Approaches
In recent years, DL techniques have become more popular in credit risk prediction by virtue of their ability to capture complex, nonlinear relationships in large datasets. Unlike traditional ML models, DL architectures can automatically learn intricate patterns in data, making them suitable for financial applications where complex interactions between variables are prevalent. One commonly used DL model is the MLP, a fully connected neural network that has shown significant promise in predicting credit risk. Studies such as those by
Y. Liu et al. (
2024) have demonstrated that MLPs can outperform traditional ML models by capturing hidden patterns within borrower behavior and financial features.
Other than MLP, more sophisticated architectures like LSTM networks have been explored in credit risk prediction, particularly when temporal data is available. LSTM networks are a type of recurrent neural network (RNN) that performs well in modeling sequential data by retaining long-term dependencies. This makes them useful in credit risk scenarios where borrowers’ historical payment behavior can inform future default risk. Forinstance,
B. Liu et al. (
2020) utilized LSTM networks to predict defaults by analyzing past payment sequences, achieving higher predictive accuracy than static ML models. The study demonstrates the advantage of LSTM in leveraging temporal dependencies for credit risk prediction.
Another DL model that has been applied for credit risk applications is the GRU network, which is similar to LSTM but has a simpler structure and fewer parameters. GRUs have been found to be computationally efficient while still retaining the capacity to model sequential data, making them an attractive alternative to LSTM for credit risk prediction. A studyby
Thor and Postek (
2024) showed the robustness of GRU in their study of credit risk prediction, where the GRU outperformed standard machine learning models.
Additionally, CNNs, typically used for image processing, have been adapted for credit risk prediction. By capturing local patterns in feature space, CNNs can identify clusters of behaviors or attributes that may indicate credit risk.
Meng et al. (
2024) applied CNNs to financial data and found that they could efficiently identify risk-related patterns, particularly in large datasets with numerous features, outperforming traditional ML models.
2.3. Data Resampling for Imbalanced Data
Handling class imbalance is crucial in credit risk prediction, as most credit datasets contain a significantly lower proportion of default cases (high-risk) compared to non-default cases (low-risk). This imbalance can result in biased predictions, where models tend to predict the majority class more accurately, resulting in poor performance in identifying minority class instances. To address this, various data resampling techniques have been developed to balance the class distribution, improving model performance in detecting defaults. One widely adopted method is the SMOTE, which generates synthetic samples for the minority class by interpolating between existing minority instances. As demonstrated in the workby
Dablain et al. (
2022), SMOTE effectively reduces bias towards the majority class and improves model performance in highly imbalanced datasets.
SMOTE has been extensively used in credit risk prediction, often in conjunction with machine learning and deep learning models. In a recent study,
Onasoga and Hwidi (
2024) applied SMOTE to a credit risk dataset and observed significant improvements in the recall of default predictions, indicating that the model was better able to identify high-risk applicants. However, while SMOTE is effective in enhancing recall, it can also lead to overfitting, particularly when combined with complex models like deep neural networks (
Alex, 2025). This overfitting arises because synthetic samples are generated based on existing minority instances, which may lead to an artificial concentration of minority samples in feature space, reducing the model’s generalizability.
While SMOTE and its variants remain widely used, recent studies have also introduced alternatives to address the risk of overfitting. For instance, DeepSMOTE leverages deep neural networks to generate more realistic minority samples (
Dablain et al., 2022), while cost-sensitive learning adjusts model training to penalize misclassification of the minority class more heavily (
Mienye et al., 2024). These approaches represent complementary strategies to traditional resampling and have shown promise in improving generalizability in highly imbalanced credit datasets.
Another approach to managing class imbalance is undersampling, where the majority class is reduced to balance the dataset. One popular undersampling technique is ENN, which removes majority class samples that are misclassified by their nearest neighbors (
Xu et al., 2020). This approach has been shown to reduce redundancy in the majority class, allowing models to better differentiate between high-risk and low-risk cases. The study by
Nizam-Ozogur and Orman (
2024) highlighted the effectiveness of ENN in enhancing model performance in imbalanced datasets by focusing on cleaner and more relevant samples. In credit risk prediction, ENN has been found to enhance precision and reduce false positives by eliminating noisy samples in the majority class, as demonstrated by
Xing et al. (
2024).
While oversampling and undersampling techniques are effective individually, recent research has explored combining them to maximize their benefits. For example, SMOTE-ENN combines SMOTE’s synthetic oversampling with ENN’s undersampling, creating a more balanced dataset with reduced noise. This hybrid approach has been applied in credit risk prediction, where it has shown improved recall and precision in identifying defaults compared to standalone resampling methods.
Zhu et al. (
2023) demonstrated that SMOTE-ENN effectively enhanced model robustness in an imbalanced credit risk dataset.
As seen from the reviewed papers in this section, credit risk prediction is a crucial area where both traditional ML and DL approaches have been extensively applied. While traditional ML models such as logistic regression and decision trees are valued for their interpretability and simplicity, they often underperform when modeling complex, nonlinear interactions inherent in financial datasets. Advances in ensemble techniques like random forests and gradient boosting machines have addressed these limitations to some extent, yet challenges remain in optimizing performance on highly imbalanced datasets. Conversely, DL models, including LSTMs, GRUs, and CNNs, have demonstrated strong capability in capturing intricate patterns and temporal dependencies in different applications. However, the practical effectiveness of these models is often hindered by class imbalance, as the minority default cases are typically overshadowed by the majority non-default cases. Resampling strategies such as SMOTE and ENN have been employed to mitigate this imbalance, but questions about the optimal combination of model architectures and resampling strategies persist.
Therefore, this study bridges these gaps by conducting a systematic evaluation of ML and DL models for credit risk prediction, focusing on their performance under varying resampling techniques. The contributions of this work are threefold: firstly, it provides a comparative analysis of traditional ML models, advanced ensemble methods, and state-of-the-art DL architectures under real-world conditions of class imbalance; secondly, it integrates and evaluates resampling methods such as SMOTE, ENN, and SMOTE-ENN to identify their impact on model performance; and thirdly, it utilizes SHAP-based interpretability techniques to explain the contribution of key features in model predictions, ensuring transparency in high-stakes financial decision-making. The findings of this study can guide practitioners in selecting effective model–resampling combinations, advancing the field of credit risk prediction in both research and practical applications.
3. Methodology
3.1. Credit Risk Datasets
This study utilizes two publicly accessible credit datasets: the German Credit dataset and the Taiwan Credit dataset. Each dataset provides valuable insights for credit risk prediction and presents unique characteristics that enhance the robustness of the analysis.
3.1.1. German Credit Dataset
The German Credit dataset, shown in
Table 1, contains 1000 instances with 20 attributes encompassing personal and financial information such as age, credit amount, loan duration, and loan purpose (
Boughaci et al., 2021). The target variable classifies applicants as good credit risk (1) or bad credit risk (0). This dataset exhibits class imbalance, with about 70% labeled as good credit and 30% as bad credit. It is widely used as a benchmark in credit risk modeling to assess the performance of different algorithms.
3.1.2. Taiwan Credit Dataset
The Taiwan Credit dataset, also known as the UCI Credit Card Default dataset, includes 30,000 instances with 23 features related to demographic information, bill statements, and payment history. The target variable provides an indication whether the client will default (1) or not (0) on their next payment. This dataset is significantly imbalanced, with approximately 22.1% defaults and 77.9% non-defaults. Its large size and comprehensive features make it particularly useful for developing and testing complex models such as DL architectures. The attributes in the dataset are described in
Table 2.
Based on the
Table 3, both the German and Taiwan credit datasets are widely adopted benchmarks in credit risk modeling research. Each dataset reflects specific socio-economic and institutional contexts: the German dataset is relatively small and dated, representing lending behavior in a European setting, while the Taiwan dataset is larger and more recent but restricted to credit card clients from a single Asian economy. As such, conclusions drawn from these datasets may not fully capture the heterogeneity of global credit markets, where borrower demographics, regulatory environments, and financial products differ substantially. Therefore, the reported findings are interpreted as indicative rather than universally representative.
The decision to employ these two datasets was motivated by their prominence as benchmarking tools in the literature and their complementary characteristics. The German dataset provides a compact but highly imbalanced sample that is frequently used to test baseline algorithms, whereas the Taiwan dataset offers scale and complexity suitable for training more expressive deep learning models. Together, they allow for a comparative analysis across dataset sizes, levels of feature richness, and class imbalance ratios.
3.2. Deep Learning Model Architectures
This section describes the model architectures used in this study, focusing on several DL models that are well-suited for credit risk prediction.
3.2.1. Multilayer Perceptron
The MLP architecture is a fully connected neural network with an input layer, multiple hidden layers, and an output layer. In credit risk prediction, MLPs are advantageous for capturing complex, nonlinear relationships among features (
Shang et al., 2023). Given an inputvector
, the MLP computes each layer’s output through an affine transformation followed by a nonlinear activation function. The output of the
j-th neuron in layer
l can be represented as:
where
is a nonlinear activation function (e.g., ReLU),
represents the weight between the
i-th neuron in layer
and the
j-th neuron in layer
l, and
is the bias term (
Li et al., 2025). The MLP’s fully connected layers enable it to learn general patterns in credit data, making it effective for tasks like credit scoring when applied to structured data.
3.2.2. Convolutional Neural Network
CNNs, originally developed for image processing, have been adapted for credit risk prediction by treating financial data as structured “grids” where patterns are extracted using convolutional layers (
Meer et al., 2024). A CNN applies a series of convolutional filters to capture local patterns in data, which can be useful for identifying clusters of behaviors indicative of credit risk (
Krichen, 2023;
Salehi et al., 2023). The 2-D discrete convolution of an input
with a filter
is written compactly as
where
denotes the output at location
,
is the index set covering the finite support of the kernel
, and ∗ denotes discrete convolution (with standard padding/stride as specified in the implementation). By stacking multiple convolutional and pooling layers, CNNs can automatically learn hierarchical representations, making them suitable for large credit datasets with numerous interrelated features.
3.2.3. LSTM Networks
LSTM networks, a type of RNN, are particularly effective for sequential data where long-term dependencies are relevant. In credit risk prediction, LSTMs can capture patterns in borrowers’ payment history over time, helping to assess default risk based on temporal behavior (
Alvi et al., 2024). The LSTM cell updates its hidden state by maintaining a memory cell
at each time step
t, controlled by three gates: the input gate
, the forget gate
, and the output gate
. These gates are defined as follows:
where
is the input at time
t,
is the hidden state, and
represents the sigmoid activation function. These equations follow the original formulation of LSTM introduced by
Hochreiter and Schmidhuber (
1997). This structure allows LSTMs to capture dependencies across different time steps, which is beneficial for time-series data in credit risk analysis.
While the German dataset lacks explicit temporal attributes, we include LSTM for benchmarking, treating static features as a single-time-step sequence. This allows evaluation of its generalization on non-sequential data, revealing limitations (e.g., potential overfitting to assumed dependencies) that inform model selection for similar tabular datasets.
3.2.4. GRU Networks
The GRU is a condensed form of the LSTM that retains the capacity to express sequential relationships while reducing computing complexity by combining the input and forget gates into a single update gate (
Niu et al., 2023). In GRUs, the hidden state
at time
t is updated based on the reset gate
and the update gate
, defined as:
where
is the candidate activation (
Niu et al., 2023). GRUs have been shown to perform comparably to LSTMs in different applications with sequential data while being less computationally demanding, as highlighted in various recent studies.
Although both the German and Taiwan datasets are primarily tabular, their feature composition provides a rationale for exploring sequential architectures such as LSTMs and GRUs. In particular, the Taiwan dataset includes six consecutive months of repayment status (PAY_0 to PAY_6), bill amounts (BILL_AMT1–6), and payment amounts (PAY_AMT1–6), which naturally form temporal sequences reflecting borrowers’ financial behavior over time. Modeling these attributes as ordered sequences allows recurrent models to capture dynamic payment patterns that may not be fully exploited by feed-forward networks. For the German dataset, where explicitly temporal attributes are absent, the inclusion of LSTM and GRU serves a benchmarking role, enabling comparison of their generalization ability on static features against architectures like MLP and CNN.
3.3. Traditional Machine Learning Models
This section describes the traditional machine learning models used in this study. Each model type, such as logistic regression, decision tree, support vector machine, random forest, AdaBoost and XGBoost, offers unique strengths and is widely applied in the financial domain.
3.3.1. Logistic Regression
Logistic regression is a widely used linear model for binary classification tasks, including credit risk prediction. It uses the sigmoid function to model the likelihood that a given input belongs to a specific class, with outputs constrained between 0 and 1. Given an input vector
, the probability of the positive class is computed as:
where
is the weight vector,
b is the bias term, and
is the sigmoid activation (
Gutiérrez et al., 2010). Logistic regression is valued for its interpretability, as it allows practitioners to understand the impact of each feature on default probability. However, it may struggle with capturing complex, nonlinear relationships in credit risk data.
3.3.2. Support Vector Machine
SVMs are robust classifiers that seek to identify the optimal hyperplane to maximize the margin between data points of various classes. For linearly separable data, support vector machines maximize the distance between the nearest data points (support vectors) of each class and the hyperplane (
M.-W. Huang et al., 2017). The decision boundary is defined as:
where
is the weight vector and
b is the bias term. For cases where the data is not linearly separable, support vector machines use kernel functions (e.g., radial basis function) to project data into a higher-dimensional space where a linear separator may exist.
3.3.3. Decision Tree
Decision trees are non-parametric models that create a tree-like structure of decisions by recursively partitioning data according to feature values. At each node, the model chooses a feature and a threshold that best splits the data to maximize homogeneity within each partition (
Sun et al., 2024). The Gini impurity or information gain criteria are typically used to select the best splits. The prediction for an instance
is made by traversing the tree from the root to a leaf node, where the class label is determined. Although decision trees are intuitive and easy to interpret, they have a tendency to overfit, especially on small datasets.
3.3.4. Random Forest
Random forest is an ensemble method that combines multiple decision trees to improve classification accuracy and reduce overfitting (
Biau & Scornet, 2016). In a random forest, each tree is trained on a different bootstrap sample of the data, and at each split, a random subset of features is considered. The final prediction is determined by aggregating the predictions of all individual trees, usually by majority voting for classification tasks:
where
is the prediction of the
i-th tree in the forest. Random forests are widely used in credit risk prediction due to their robustness, ability to capture complex patterns, and reduced overfitting compared to single decision trees (
Sun et al., 2024).
3.3.5. XGBoost
XGBoost is a scalable and effective gradient boosting method that makes use of a group of weak learners, usually decision trees. The purpose of each new tree is to reduce the residual errors of the ones which came prior, making XGBoost effective at handling complex datasets (
Rao et al., 2022). Given a model
, where
are individual decision trees, XGBoost optimizes the following objective function:
where
is a differentiable loss function measuring the difference between predicted and actual values, and
is a regularization term that penalizes the complexity of the trees to prevent overfitting. XGBoost has shown strong performance in credit risk prediction due to its robustness and ability to handle imbalanced data effectively.
3.3.6. Adaptive Boosting
AdaBoost is a boosting technique that builds a strong classifier by combining several weak classifiers. It iteratively trains models on weighted versions of the dataset, focusing more on incorrectly classified instances in each iteration (
X. Huang et al., 2022). The final model prediction is a weighted majority vote of all classifiers, where each classifier’s weight depends on its accuracy. For an instance
, the prediction is given by:
where
denotes the prediction of the
m-th weak classifier for input
, and
is the weight assigned to that classifier based on its accuracy (
Mienye & Jere, 2024). AdaBoost is known for improving model performance on difficult-to-classify instances and is often used for credit risk tasks where correctly identifying high-risk cases is critical.
3.4. Resampling Techniques
To address the class imbalance in credit risk prediction, this study explores several resampling techniques. These techniques either oversample the minority class or undersample the majority class in an effort to balance the dataset, thereby improving the model’s capacity to identify high-risk cases. The following resampling methods are employed: SMOTE, ENN, and a hybrid approach, SMOTE-ENN.
3.4.1. SMOTE
A popular oversampling technique called SMOTE creates synthetic samples for the minority class in order to address class imbalance. SMOTE successfully increases the representation of the minority class without duplicating samples by interpolating between current minority samples and their closest neighbors to create new instances (
Nizam-Ozogur & Orman, 2024;
L. Wang, 2022). For a given minority instance
, a synthetic instance is generated as:
where
is a randomly selected nearest neighbor of
, and
is a random value in the range [0, 1]. By creating synthetic samples that lie along the line segment between
and
, SMOTE reduces the risk of overfitting and helps the model better learn patterns associated with the minority class.
3.4.2. ENN
ENN is an undersampling method that curbs the size of the majority class by removing potentially noisy or misclassified samples. For each majority class instance, ENN examines its nearest neighbors, and if it is misclassified by the majority of its neighbors, the instance is removed from the dataset (
Xu et al., 2020). This technique helps clean the majority class by eliminating overlapping or borderline cases that could lead to misclassification, enhancing the model’s focus on identifying true default cases. ENN can improve the precision of credit risk models by reducing false positives, which is beneficial in high-stakes financial applications.
3.4.3. SMOTE-ENN
SMOTE-ENN is a hybrid resampling approach that combines the strengths of SMOTE and ENN. First, SMOTE is applied to generate synthetic samples of the minority class, increasing its representation in the dataset. Next, ENN is applied to the combined dataset to remove noisy or misclassified majority class instances, effectively creating a more balanced and cleaner dataset (
Lin et al., 2021). This approach allows the model to benefit from the additional minority samples while minimizing noise from the majority class. SMOTE-ENN has shown effectiveness in credit risk prediction, as it balances recall and precision, aiding in the identification of high-risk cases while maintaining overall model robustness.
3.5. SHapley Additive exPlanations (SHAP)
SHAP is a game-theoretic approach to model interpretability that assigns importance values to features based on their contribution to individual predictions (
Lundberg & Lee, 2017). SHAP values are derived from Shapley values in cooperative game theory, where each feature’s impact is computed by averaging its marginal contribution across all possible coalitions of features. In this study, SHAP is applied post-training to compute global feature importance on the validation sets, visualized in summary plots that rank features by their average absolute SHAP value and show directional impacts (positive or negative) on predictions.
3.6. Experimental Setup
This study uses K-fold cross-validation to evaluate the models, where the dataset is split into K subsets. Every subset is used once as the validation set, while training is done using the remaining subsets, iterating through each subset to obtain an average performance metric. In this work, K was set to 10, which provides a reasonable trade-off between bias and variance, particularly given the relatively small size of the German dataset. To maintain class distribution across folds, stratified sampling was employed, and a fixed random seed was set to ensure reproducibility. Importantly, all resampling techniques (SMOTE, ENN, and SMOTE-ENN) were applied strictly within the training folds to avoid information leakage from the validation data. In addition, nested cross-validation was used for hyperparameter tuning, where an inner 5-fold stratified CV was applied within each outer training fold. This design ensures that hyperparameters are selected independently of the outer validation data, yielding a more reliable estimate of generalization performance. The outer loop evaluates overall model performance, while the inner loop optimizes hyperparameters for each outer training set. Furthermore, the following Algorithm 1 describes the proposed methodology used in this study.
The proposed approach comprises systematic steps to ensure robust model performance and interpretability of credit risk prediction. Specifically, resampling was conducted only on the training data of each fold, ensuring that the validation set remained untouched and representative of the original distribution. After the resampling, K-fold cross-validation is employed to split the dataset into training and validation sets, ensuring a reliable and unbiased evaluation of model performance. The model is trained iteratively within each fold, and performance metrics are calculated for validation. Furthermore, SHAP analysis is carried out after training to interpret model predictions, offering information on the significance of features and how they affect model outputs. SHAP was used to compute feature contributions, following the interpretability framework introduced in (
Lundberg & Lee, 2017). The combination of these steps ensures a comprehensive evaluation framework, balancing predictive accuracy with interpretability for practical credit risk applications.
| Algorithm 1 Proposed Approach for Credit Risk Prediction |
- 1:
Input: Dataset D, Resampling technique R, Model M, Number of outer folds K - 2:
Output: Average performance metrics (i.e., accuracy, precision, recall, specificity, F1 score) and feature importance insights - 3:
Step 1: Initialize 10-fold stratified cross-validation with fixed random seed - 4:
for to K do - 5:
Split D into training set and validation set for fold k - 6:
Step 2: Within , perform inner 5-fold stratified cross-validation for hyperparameter tuning - 7:
Apply resampling technique R on each inner training split only - 8:
Select optimal hyperparameters based on inner validation performance - 9:
Step 3: Retrain model M with selected hyperparameters on full - 10:
Step 4: Evaluate M on to obtain performance metrics - 11:
end for - 12:
Step 5: Compute average performance metrics across all K folds - 13:
Step 6: Perform SHAP analysis to interpret the model: - 14:
Use SHAP to compute feature importance on the validation or test set - 15:
Generate SHAP summary plots to analyze feature impacts - 16:
Return: Average performance metrics and SHAP insights
|
Hyperparameters were optimized using random search within the inner CV loop. For illustration, in one outer fold of the German dataset, the optimal hyperparameters for MLP included hidden layers [128, 64, 32], ReLU activation, Adam optimizer with lr = 0.001, and batch size 32, selected based on the highest F1-score (0.912) on the inner validation sets. The full set of hyperparameters used across models is summarized in
Table 4.
3.7. Evaluation Metrics
In order to assess the model’s performance in credit risk prediction, especially with imbalanced data, a comprehensive set of metrics is essential. This study utilizes accuracy, recall, precision, specificity, and F1 score, as each metric provides unique insights into model effectiveness. Accuracy offers a general measure of performance, but it can be misleading for imbalanced datasets, as it may reflect high performance merely by correctly predicting the majority class. Therefore, additional metrics such as precision, recall, specificity, and F1 score are incorporated to assess the model’s ability to handle both classes accurately, particularly focusing on correctly identifying high-risk (default) cases.
Precision and recall are valuable in understanding the model’s performance in identifying defaults. Precision reflects the proportion of predicted default cases that are actual defaults, reducing the occurrence of false positives (
Obaido et al., 2022). Conversely, Recall measures the model’s effectiveness in capturing actual defaults. Specificity complements recall by indicating the model’s accuracy on non-default cases, while the F1 score balances precision and recall, providing a single measure of performance on the minority class, and it is defined as the harmonic mean of precision and recall. The following equations define each of the metrics:
where
is true positives,
is true negatives,
is false positives, and
is false negatives.
4. Results and Discussion
The performance of several machine learning and deep learning models applied to the German and Taiwan credit datasets is thoroughly examined in this section. All experiments were conducted using Python 3.10. The machine learning models and pre-processing methods were implemented with scikit-learn, while the MLP was developed using TensorFlow/Keras. Data manipulation and analysis were performed using NumPy and pandas. The evaluation involves multiple experimental setups, incorporating different resampling techniques, including SMOTE, ENN, and SMOTE-ENN, to address the inherent class imbalance in the datasets. The experimental setup included a diverse range of models: Logistic Regression, SVM, Decision Trees, Random Forest, XGBoost, AdaBoost, MLP, CNN, LSTM, and GRU. The hyperparameters for each model are summarized in
Table 4. These hyperparameters were optimized using a random search strategy within nested cross-validation.
Representative ranges included: max depth for decision trees, number of estimators for random forest and boosting models, learning rate for XGBoost, hidden layer sizes for MLP, and recurrent units for LSTM/GRU.
Model evaluation was performed using 10-fold stratified cross-validation to maintain balanced class distribution in each fold. No separate hold-out test set was used; all performance metrics were derived from the cross-validation results. Ten-fold cross-validation is generally more reliable than a single train–test split because it reduces variance and provides a more stable estimate of generalization. This approach is especially suitable for small datasets such as the German Credit dataset (1000 records), where a hold-out split could lead to unstable or misleading performance estimates.
4.1. Performance of the Models on the German Dataset
The performance of several machine learning and deep learning models applied to the German dataset using different resampling approaches is compiled in
Table 5, including
No Resampling, SMOTE, ENN, and SMOTE-ENN. Among the models evaluated, SMOTE-ENN combined with MLP achieved the highest overall performance, with an accuracy of 0.954, recall of 0.929, specificity of 0.968, precision of 0.945, and F1 Score of 0.937. This indicates that the MLP model, when combined with SMOTE-ENN, is particularly effective at balancing sensitivity and specificity while maintaining high predictive accuracy.
The CNN model under SMOTE-ENN also demonstrated strong performance, achieving an accuracy of 0.934, recall of 0.893, specificity of 0.958, precision of 0.926, and F1 Score of 0.909. Compared to MLP, CNN slightly lagged in recall and precision but still achieved robust results. Similarly, GRU and LSTM models trained with SMOTE-ENN achieved impressive results, with GRU achieving an F1 Score of 0.879 and LSTM slightly outperforming GRU with an F1 Score of 0.903. These results underscore the effectiveness of SMOTE-ENN in improving model performance, particularly for deep learning models.
For models without resampling, CNN emerged as the top performer, achieving an F1 Score of 0.850, followed by MLP with an F1 Score of 0.842. This suggests that CNN and MLP possess inherent robustness when applied to imbalanced datasets, even without resampling techniques. However, the recall values for these models were generally lower, indicating challenges in identifying minority class instances without resampling.
The SHAP summary plot in
Figure 1 provides insights into feature importance for the German dataset. The plot highlights that “CheckingAccount” and “Duration” are the most influential features in predicting credit outcomes, with their SHAP values indicating strong contributions to the model’s decisions. Features such as “SavingAccounts” and “Age” also played significant roles, albeit to a lesser extent. The SHAP analysis underscores the importance of financial stability indicators like “CheckingAccount” and “SavingAccounts” in credit risk modeling, aligning with domain knowledge.
4.2. Performance of the Models on the Taiwan Dataset
Table 6 presents the performance metrics for models applied to the Taiwan dataset under the same resampling techniques. The XGBoost model without resampling achieved the highest accuracy among the
No Resampling configurations, with an accuracy of 0.817, recall of 0.376, specificity of 0.940, precision of 0.637, and F1 Score of 0.473. However, the recall value remained relatively low, indicating limitations in detecting minority class instances.
SMOTE-ENN combined with random forest obtained the best-performing configuration, achieving an accuracy of 0.821, recall of 0.745, specificity of 0.842, precision of 0.512, and F1 Score of 0.610. This configuration demonstrated improved recall compared to other models, indicating better sensitivity to minority class predictions. Additionally, SMOTE-ENN combined with LSTM achieved comparable results, with an F1 Score of 0.487, highlighting the effectiveness of deep learning models under resampling techniques.
The SHAP summary plot in
Figure 2 illustrates the feature importance for the Taiwan dataset using the random forest predictions. “PAY_0” and “LIMIT_BAL” were identified as the most critical features influencing model predictions. These features reflect past payment behavior and credit limits, which are intuitive and highly relevant indicators for predicting credit default. Other features, such as “BILL_AMT1” and “PAY_AMT1,” also contributed significantly, emphasizing the role of financial transaction history in credit risk prediction.
5. Discussion and Conclusions
The results demonstrate that deep learning models, particularly MLP and CNN, achieve superior performance on both datasets when paired with the hybrid SMOTE-ENN resampling technique. This superiority stems from their ability to model complex non-linear interactions in credit features, which is significantly enhanced by the cleaner class boundaries produced by SMOTE-ENN compared with SMOTE alone or no resampling.
The superior performance observed for the MLP combined with the SMOTE-ENN pipeline can be attributed to two complementary effects. First, the MLP is capable of modeling complex, non-linear relationships commonly present in tabular credit-scoring datasets. Second, the SMOTE-ENN procedure addresses class imbalance by generating synthetic minority-class examples while removing noisy majority-class samples; the ENN step in particular helps eliminate ambiguous majority instances and retain representative minority prototypes. This process results in sharper and more discriminative decision boundaries, ultimately improving classification accuracy and robustness.
The substantial decline in SVM performance following SMOTE-ENN on the Taiwan dataset can be explained by the SVM’s sensitivity to class overlap and noise near the decision margin. While ENN removes some ambiguous samples, the SMOTE step may still generate synthetic minority instances close to the true boundary—an issue exacerbated by the noisier, more overlapping structure of the Taiwan dataset. These synthetic samples distort the margin optimization process essential to SVMs, resulting in a notable reduction in generalization performance.
The SHAP analysis further validated these models by identifying critical features like PAY_0 and LIMIT_BAL for the Taiwan dataset and CheckingAccount and Duration for the German dataset, indicating their relevance in credit scoring. These findings align closely with financial intuition and offer actionable insights for lenders, emphasizing the importance of current account status, repayment behavior, loan duration, and available credit limit in default prediction.
Although deep learning models exhibited greater robustness and higher F1-scores, their increased computational requirements may favor simpler models (e.g., Random Forest or XGBoost) in real-time scoring environments where inference speed is prioritized over marginal gains in predictive performance.
In summary, the combination of SMOTE-ENN with MLP (German dataset) or Random Forest (Taiwan dataset), coupled with SHAP explanations, represents a practical, high-performing, and transparent solution for credit risk assessment. Future research may explore cost-sensitive learning, ensemble-of-ensembles strategies, and the integration of SHAP into live credit-scoring pipelines to further improve fairness, performance, and regulatory compliance.
Author Contributions
Conceptualization, I.M. and T.S.; methodology, I.M.; software, I.M.; validation, T.S.; formal analysis, I.M. and T.S.; investigation, I.M.; resources, I.M.; data curation, I.M.; writing—original draft preparation, I.M.; writing—review and editing, I.M. and T.S.; visualization, I.M. and T.S.; supervision, T.S.; project administration, T.S. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Alam, T. M., Shaukat, K., Hameed, I. A., Luo, S., Sarwar, M. U., Shabbir, S., Li, J., & Khushi, M. (2020). An investigation of credit card default prediction in the imbalanced datasets. IEEE Access, 8, 201173–201198. [Google Scholar] [CrossRef]
- Alex, S. A. (2025). Imbalanced data learning using SMOTE and deep learning architecture with optimized features. Neural Computing and Applications, 37(2), 967–984. [Google Scholar] [CrossRef]
- Altalhan, M., Algarni, A., & Alouane, M. T.-H. (2025). Imbalanced data problem in machine learning: A review. IEEE Access, 13, 13686–13699. [Google Scholar] [CrossRef]
- Alvi, J., Arif, I., & Nizam, K. (2024). Advancing financial resilience: A systematic review of default prediction models and future directions in credit risk management. Heliyon, 10(21), e39770. [Google Scholar] [CrossRef]
- Bhatore, S., Mohan, L., & Reddy, Y. R. (2020). Machine learning techniques for credit risk evaluation: A systematic literature review. Journal of Banking and Financial Technology, 4(1), 111–138. [Google Scholar] [CrossRef]
- Bhattacharya, A., Biswas, S. K., & Mandal, A. (2023). Credit risk evaluation: A comprehensive study. Multimedia Tools and Applications, 82(12), 18217–18267. [Google Scholar] [CrossRef]
- Biau, G., & Scornet, E. (2016). A random forest guided tour. Test, 25, 197–227. [Google Scholar] [CrossRef]
- Boughaci, D., Alkhawaldeh, A. A., Jaber, J. J., & Hamadneh, N. (2021). Classification with segmentation for credit scoring and bankruptcy prediction. Empirical Economics, 61, 1281–1309. [Google Scholar] [CrossRef]
- Chang, V., Xu, Q. A., Akinloye, S. H., Benson, V., & Hall, K. (2024). Prediction of bank credit worthiness through credit risk analysis: An explainable machine learning study. Annals of Operations Research, 354, 247–271. [Google Scholar] [CrossRef]
- Dablain, D., Krawczyk, B., & Chawla, N. V. (2022). DeepSMOTE: Fusing deep learning and SMOTE for imbalanced data. IEEE Transactions on Neural Networks and Learning Systems, 34(9), 6390–6404. [Google Scholar] [CrossRef]
- Elreedy, D., Atiya, A. F., & Kamalov, F. (2024). A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning. Machine Learning, 113(7), 4903–4923. [Google Scholar] [CrossRef]
- Gutiérrez, P. A., Hervás-Martínez, C., & Martínez-Estudillo, F. J. (2010). Logistic regression by means of evolutionary radial basis function neural networks. IEEE Transactions on Neural Networks, 22(2), 246–263. [Google Scholar] [CrossRef]
- Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735–1780. [Google Scholar] [CrossRef]
- Huang, M.-W., Chen, C.-W., Lin, W.-C., Ke, S.-W., & Tsai, C.-F. (2017). SVM and SVM ensembles in breast cancer prediction. PLoS ONE, 12(1), e0161501. [Google Scholar] [CrossRef]
- Huang, X., Li, Z., Jin, Y., & Zhang, W. (2022). Fair-AdaBoost: Extending AdaBoost method to achieve fair classification. Expert Systems with Applications, 202, 117240. [Google Scholar] [CrossRef]
- Jiang, C., Lu, W., Wang, Z., & Ding, Y. (2023). Benchmarking state-of-the-art imbalanced data learning approaches for credit scoring. Expert Systems with Applications, 213, 118878. [Google Scholar] [CrossRef]
- Krichen, M. (2023). Convolutional neural networks: A survey. Computers, 12(8), 151. [Google Scholar] [CrossRef]
- Li, J., Yang, R., Cao, X., Zeng, B., Shi, Z., Ren, W., & Cao, X. (2025). Inception MLP: A vision MLP backbone for multi-scale feature extraction. Information Sciences, 701, 121865. [Google Scholar] [CrossRef]
- Lin, M., Zhu, X., Hua, T., Tang, X., Tu, G., & Chen, X. (2021). Detection of ionospheric scintillation based on XGBoost model improved by SMOTE-ENN technique. Remote Sensing, 13(13), 2577. [Google Scholar] [CrossRef]
- Liu, B., Zhang, Z., Yan, J., Zhang, N., Zha, H., Li, G., Li, Y., & Yu, Q. (2020). A deep learning approach with feature derivation and selection for overdue repayment forecasting. Applied Sciences, 10(23), 8491. [Google Scholar] [CrossRef]
- Liu, Y., Baals, L. J., Osterrieder, J., & Hadji-Misheva, B. (2024). Leveraging network topology for credit risk assessment in P2P lending: A comparative study under the lens of machine learning. Expert Systems with Applications, 252, 124100. [Google Scholar] [CrossRef]
- Lundberg, S. M., & Lee, S.-I. (2017, December 4–9). A unified approach to interpreting model predictions. 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA. [Google Scholar]
- Meer, M., Khan, M. A., Jabeen, K., Alzahrani, A. I., Alalwan, N., Shabaz, M., & Khan, F. (2024). Deep convolutional neural networks information fusion and improved whale optimization algorithm based smart oral squamous cell carcinoma classification framework using histopathological images. Expert Systems, 42(1), e13536. [Google Scholar] [CrossRef]
- Meng, B., Sun, J., & Shi, B. (2024). A novel URP-CNN model for bond credit risk evaluation of Chinese listed companies. Expert Systems with Applications, 255, 124861. [Google Scholar] [CrossRef]
- Mienye, I. D., & Jere, N. (2024). A survey of decision trees: Concepts, algorithms, and applications. IEEE Access, 12, 86716–86727. [Google Scholar] [CrossRef]
- Mienye, I. D., Swart, T. G., & Obaido, G. (2024). Scalable XAI: Towards explainable machine learning models in distributed systems. In Pan-African artificial intelligence and smart systems conference (pp. 3–16). Springer. [Google Scholar]
- Montevechi, A. A., Miranda, R. C., Medeiros, A. L., & Montevechi, J. A. B. (2024). Advancing credit risk modelling with machine learning: A comprehensive review of the state-of-the-art. Engineering Applications of Artificial Intelligence, 137, 109082. [Google Scholar] [CrossRef]
- Niu, Z., Zhong, G., Yue, G., Wang, L.-N., Yu, H., Ling, X., & Dong, J. (2023). Recurrent attention unit: A new gated recurrent unit for long-term memory of important parts in sequential data. Neurocomputing, 517, 1–9. [Google Scholar] [CrossRef]
- Nizam-Ozogur, H., & Orman, Z. (2024). A heuristic-based hybrid sampling method using a combination of SMOTE and ENN for imbalanced health data. Expert Systems, 41(8), e13596. [Google Scholar] [CrossRef]
- Noriega, J. P., Rivera, L. A., & Herrera, J. A. (2023). Machine learning for credit risk prediction: A systematic literature review. Data, 8(11), 169. [Google Scholar] [CrossRef]
- Obaido, G., Ogbuokiri, B., Mienye, I. D., & Kasongo, S. M. (2022). A voting classifier for mortality prediction post-thoracic surgery. In International conference on intelligent systems design and applications (pp. 263–272). Springer. [Google Scholar]
- Onasoga, B., & Hwidi, J. (2024). Enhancing credit card default prediction: Prioritizing recall over accuracy. In International conference on innovative computing and communication (pp. 441–459). Springer. [Google Scholar]
- Rao, C., Liu, Y., & Goh, M. (2022). Credit risk assessment mechanism of personal auto loan based on PSO-XGBoost model. Complex & Intelligent Systems, 9, 1391–1414. [Google Scholar] [CrossRef]
- Salehi, A. W., Khan, S., Gupta, G., Alabduallah, B. I., Almjally, A., Alsolai, H., Siddiqui, T., & Mellit, A. (2023). A study of CNN and transfer learning in medical imaging: Advantages, challenges, future scope. Sustainability, 15(7), 5930. [Google Scholar] [CrossRef]
- Shang, H., Shang, L., Wu, J., Xu, Z., Zhou, S., Wang, Z., Wang, H., & Yin, J. (2023). NIR spectroscopy combined with 1D-convolutional neural network for breast cancerization analysis and diagnosis. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 287, 121990. [Google Scholar] [CrossRef]
- Shi, Y., Qu, Y., Chen, Z., Mi, Y., & Wang, Y. (2024). Improved credit risk prediction based on an integrated graph representation learning approach with graph transformation. European Journal of Operational Research, 315(2), 786–801. [Google Scholar] [CrossRef]
- Shilpa, N. A., Shaha, P., Hajek, P., & Abedin, M. Z. (2023). Default risk prediction based on support vector machine and logit support vector machine. In Novel financial applications of machine learning and deep learning: Algorithms, product modeling, and applications (pp. 93–106). Springer. [Google Scholar]
- Suhadolnik, N., Ueyama, J., & Da Silva, S. (2023). Machine learning for enhanced credit risk assessment: An empirical approach. Journal of Risk and Financial Management, 16(12), 496. [Google Scholar] [CrossRef]
- Sun, Z., Wang, G., Li, P., Wang, H., Zhang, M., & Liang, X. (2024). An improved random forest based on the classification accuracy and correlation measurement of decision trees. Expert Systems with Applications, 237, 121549. [Google Scholar] [CrossRef]
- Thor, M., & Postek, Ł. (2024). Gated recurrent unit network: A promising approach to corporate default prediction. Journal of Forecasting, 43(5), 1131–1152. [Google Scholar] [CrossRef]
- Vairetti, C., Assadi, J. L., & Maldonado, S. (2024). Efficient hybrid oversampling and intelligent undersampling for imbalanced big data classification. Expert Systems with Applications, 246, 123149. [Google Scholar] [CrossRef]
- Wang, L. (2022). Imbalanced credit risk prediction based on SMOTE and multi-kernel FCM improved by particle swarm optimization. Applied Soft Computing, 114, 108153. [Google Scholar] [CrossRef]
- Wang, Y., & Ni, X. S. (2023). A survey of machine learning methodologies for loan evaluation in peer-to-peer (P2P) lending. In Data analytics for management, banking and finance: Theories and application (pp. 1–49). Springer. [Google Scholar]
- Xing, Q., Yu, C., Huang, S., Zheng, Q., Mu, X., & Sun, M. (2024). Enhanced credit score prediction using ensemble deep learning model. arXiv, arXiv:2410.00256. [Google Scholar] [CrossRef]
- Xu, Z., Shen, D., Nie, T., & Kou, Y. (2020). A hybrid sampling algorithm combining M-SMOTE and ENN based on random forest for medical imbalanced data. Journal of Biomedical Informatics, 107, 103465. [Google Scholar] [CrossRef]
- Yin, W., Kirkulak-Uludag, B., Zhu, D., & Zhou, Z. (2023). Stacking ensemble method for personal credit risk assessment in peer-to-peer lending. Applied Soft Computing, 142, 110302. [Google Scholar] [CrossRef]
- Yu, C., Jin, Y., Xing, Q., Zhang, Y., Guo, S., & Meng, S. (2024). Advanced user credit risk prediction model using LightGBM, XGBoost and TabNet with SMOTEENN. arXiv, arXiv:2408.03497. [Google Scholar] [CrossRef]
- Zhang, X., Ma, Y., & Wang, M. (2024). An attention-based LogisticCNN-BiLSTM hybrid neural network for credit risk prediction of listed real estate enterprises. Expert Systems, 41(2), e13299. [Google Scholar] [CrossRef]
- Zhao, Y. (2024). Investigation of the application of machine learning algorithms in credit risk assessment of medium and micro enterprises. IEEE Access, 12, 152945–152958. [Google Scholar] [CrossRef]
- Zhao, Z., Cui, T., Ding, S., Li, J., & Bellotti, A. G. (2024). Resampling techniques study on class imbalance problem in credit risk prediction. Mathematics, 12(5), 701. [Google Scholar] [CrossRef]
- Zhu, Y., Hu, Y., Liu, Q., Liu, H., Ma, C., & Yin, J. (2023). A hybrid approach for predicting corporate financial risk: Integrating SMOTE-ENN and NGBoost. IEEE Access, 11, 111106–111125. [Google Scholar] [CrossRef]
- Ziemba, P., Becker, J., Becker, A., & Radomska-Zalas, A. (2023). Framework for multi-criteria assessment of classification models for the purposes of credit scoring. Journal of Big Data, 10(1), 94. [Google Scholar] [CrossRef] [PubMed]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |