Previous Article in Journal
Interest as the Engine: Leveraging Diverse Hybrid Propagation for Influence Maximization in Interest-Based Social Networks
Previous Article in Special Issue
An Industrial Framework for Cold-Start Recommendation in Few-Shot and Zero-Shot Scenarios
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Transforming Credit Risk Analysis: A Time-Series-Driven ResE-BiLSTM Framework for Post-Loan Default Detection

1
School of Computer Science, University of Nottingham Ningbo China, Ningbo 315100, China
2
Department of Mathematical Sciences, University of Nottingham Ningbo China, Ningbo 315100, China
*
Author to whom correspondence should be addressed.
Information 2026, 17(1), 5; https://doi.org/10.3390/info17010005 (registering DOI)
Submission received: 15 October 2025 / Revised: 12 December 2025 / Accepted: 18 December 2025 / Published: 21 December 2025

Abstract

Credit risk refers to the possibility that a borrower fails to meet contractual repayment obligations, posing potential losses to lenders. This study aims to enhance post-loan default prediction in credit risk management by constructing a time-series modeling framework based on repayment behavior data, enabling the capture of repayment risks that emerge after loan issuance. To achieve this objective, a Residual Enhanced Encoder Bidirectional Long Short-Term Memory (ResE-BiLSTM) model is proposed, in which the attention mechanism is responsible for discovering long-range correlations, while the residual connections ensure the preservation of distant information. This design mitigates the tendency of conventional recurrent architectures to overemphasize recent inputs while underrepresenting distant temporal information in long-term dependency modeling. Using the real-world large-scale Freddie Mac Single-Family Loan-Level Dataset, the model is evaluated on 44 independent cohorts and compared with five baseline models, including Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent Unit (GRU), Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN) across multiple evaluation metrics. The experimental results demonstrate that ResE-BiLSTM achieves superior performance on key indicators such as F1 and AUC, with average values of 0.92 and 0.97, respectively, and demonstrates robust performance across different feature window lengths and resampling settings. Ablation experiments and SHapley Additive exPlanations (SHAP)-based interpretability analyses further reveal that the model captures non-monotonic temporal importance patterns across key financial features. This study advances time-series–based anomaly detection for credit risk prediction by integrating global and local temporal learning. The findings offer practical value for financial institutions and risk management practitioners, while also providing methodological insights and a transferable modeling paradigm for future research on credit risk assessment.

1. Introduction

The significance of anomaly detection lies in its role in identifying unusual patterns in complex data, thus mitigating potential risks in various fields such as machine failure, financial fraud, web error logs, and medical health diagnosis [1]. In financial fraud detection, fraudulent activities usually fall into four categories [2]: banking, corporate, insurance, and cryptocurrency fraud. Banking fraud includes credit card, loan, and money laundering fraud. Corporate fraud consists of financial statement fraud and securities and commodities fraud, while insurance fraud involves life and auto insurance fraud [3].
This study focuses on anomaly detection in the loan domain, a context closely related to everyday life and directly affecting the financial security of individuals and households. Moreover, given the extreme scarcity of large-scale financial panel data available in the real world, the loan sector provides authentic long-term data suitable for modeling and validation, making it an ideal and practically representative application scenario for this research. The detection of financial loan anomalies consists of two primary phases: pre-loan fraud detection, known as the “application model”, and post-loan default prediction, identified as the “behavioral model” [4]. The purpose of pre-loan fraud detection is to intercept fraudulent activities during the loan application process [5], usually based on upfront audits [6]. These activities can involve falsifying financial or identity information and misrepresenting intentions. In contrast, using data analytics and machine learning, post-loan default prediction assesses risk by analyzing historical financial data of the borrowers [7]. This prediction assists financial institutions in implementing preventative strategies and modifying loan conditions to reduce non-performing loans, thereby ensuring asset quality assurance and stability. Despite its practical importance, research on post-loan default prediction remains limited compared with pre-loan fraud detection, partly due to the scarcity of publicly available large-scale panel datasets that capture borrowers’ repayment dynamics over time. Financial institutions typically withhold such data for privacy reasons, which restricts academic exploration of temporal credit risk modeling.
Financial loan data, typically documented on a monthly basis as time series, exhibits strong temporal dependencies that must be considered during anomaly detection. This complicates modeling compared to static data [8]. While LSTM-based models are widely used for capturing temporal dependencies [9,10], their unidirectional structure prevents effective modeling of bidirectional information. BiLSTM alleviates this limitation by incorporating forward and backward layers [11,12]. However, both LSTM and BiLSTM remain constrained in their ability to model global temporal patterns, and relatively few studies have leveraged post-loan repayment behavior to design default prediction systems. Furthermore, many deep learning–based credit risk models suffer from limited interpretability, which restricts their adoption in real financial decision-making [13]. Explainable AI (XAI) provides promising techniques to address this issue by enhancing the interpretability of machine learning (ML) models, particularly deep learning–based approaches, and helping users understand the contribution of different features to prediction outcomes [14].
Motivated by these challenges, this study focuses on the following core research question: How can the classification performance of imbalanced financial time series data in post-loan credit risk prediction be effectively improved?
To address this question, this study proposes ResE-BiLSTM, a BiLSTM-based model integrated with a residual-enhanced encoder to strengthen temporal feature extraction and improve out-of-sample (OOS) prediction accuracy. The model is applied to the real-world large-scale Freddie Mac Single-Family Loan-Level Dataset [15], and its performance is compared with five mainstream baseline models under different feature window lengths and resampling strategies. SHAP [16], a widely used post-hoc interpretability method for LSTM-based models [17], is further employed to reveal how key financial features influence prediction outcomes over time.
This research contributes to both theoretical and practical domains. From a theoretical perspective, it enhances the understanding of temporal modeling in credit risk prediction by demonstrating how global and local temporal dependencies jointly shape the evolution of borrower risk. The integration of a residual-enhanced encoder with a bidirectional recurrent structure addresses the limitations of conventional sequential models that overemphasize recent inputs, offering a clearer view of how distant and recent information interact within deep temporal learning. From a practical perspective, the study provides a robust and generalizable approach to post-loan default prediction in real-world financial settings. The model’s stable performance across varying levels of class imbalance and different feature window configurations underscores its adaptability, while the interpretability analysis reveals how repayment behaviors influence risk over time, offering actionable guidance for borrower monitoring, early-warning design, and data-driven risk management.
The structure of this paper is organized as follows. Section 2 provides a review of the relevant literature. Section 3 presents the dataset, preprocessing procedures, model architecture, and evaluation metrics. Section 4 reports the empirical results, including baseline comparisons, statistical significance tests, ablation studies, and interpretability findings. Section 5 concludes the study with a summary of key insights and directions for future research.

2. Review of Literature

2.1. Benchmark Datasets and Loan Default Prediction Model

Most prior studies have validated the performance of the default model using publicly accessible benchmark datasets from two primary sources, including Freddie Mac [15] and Lending Club [18]. In contrast, some studies used private datasets, making reproduction or replication of their results challenging for research purposes.
Specifically, Zandi et al. [19] introduced dynamic multi-layer graph neural networks (DYMGNN) using the Freddie Mac dataset, achieving a loan default prediction F1 of 0.851. The study contributes by incorporating default correlations among borrowers through graph neural representations, offering a dynamic modeling perspective that has been largely overlooked in previous research. However, its limitation lies in high computational complexity and an extended one-year prediction horizon, which reduce its practical applicability in large-scale financial settings. Wang et al. [20] adopted a survival model combined with neural networks on the same dataset, providing an interpretable model that elucidates the risk of default with factors such as loan maturity, origination year, and environmental influences. Its main contribution lies in integrating survival analysis into neural network architectures, thereby enhancing the interpretability of default timing. However, survival models require the complete life cycle of loans as training data, making them highly dependent on data integrity. Moreover, their modeling objectives and data utilization strategies fundamentally differ from ours. Survival analysis constructs a full credit cycle model spanning from loan origination to closure, which is particularly suitable for long-term strategic evaluation. In contrast, our model learns borrower behavioral patterns from shorter loan sequences without relying on full historical records, offering greater applicability in scenarios where data are limited or newly issued products lack complete lifecycle information. Karthika and Senthilselvi [21] developed an Extreme Gradient Boosting-based Bidirectional Gated Recurrent Unit with a self-attention mechanism (XGB-BiGRU-SAN), achieving more than 98% mean precision and recall on the Freddie Mac and Lending Club datasets. The hybrid design effectively enhances accuracy through attention-guided feature extraction, yet it depends heavily on cross-sectional transformations of temporal data, limiting its ability to model long-term repayment dynamics. Kanimozhi et al. [22] reported 89% accuracy using a logistic regression model, 78% with ridge regression, and 76% with k-nearest neighbors to predict loan prepayment, a bank risk indicator for mortgage-backed securities (MBS), using the Freddie Mac dataset. Although this study provides useful benchmark comparisons among traditional classifiers, it primarily focuses on prepayment risk rather than default behavior, which is crucial for dynamic risk assessment.
However, the Lending Club dataset is originally cross-sectional, and although it can be reformatted into a time series, it does not constitute the panel data required for this study. In contrast, the Freddie Mac dataset closely approximates the panel data collected by financial institutions, most of which are not publicly available due to privacy constraints. Prior research leveraging Freddie Mac data has largely overlooked the monthly repayment records, focusing only on the static information available at the pre-loan application stage, thereby failing to effectively exploit the information inherent in the time series, leaving an important research gap that this study seeks to address.

2.2. Design of BiLSTM and Its Variants in Anomaly Detection

Recent research has used BiLSTM models to detect financial anomalies, typically integrating them with various mechanisms such as attention, convolutional neural networks (CNN), and Transformer networks.
Chen et al. [23] used an attention-based BiLSTM model to analyze the data sequences to discover contract flaws with an accuracy of 95.40% and an F1 of 95.38% compared to baseline models such as LSTM, GRU and CNN. The study contributes by demonstrating that the attention mechanism can enhance BiLSTM’s ability to extract temporal features, but its limitation lies in the narrow evaluation scope, being confined to detecting smart contract defects. The contribution of this study lies in demonstrating that the attention mechanism enhances the temporal feature extraction capability of BiLSTM. However, its limitation is the restricted scope of evaluation, as the model’s generalization ability has not been validated on diverse real-world financial datasets. Narayan and Ganapathisamy [24] introduced a Hybrid Sampling (HS)—Similarity Attention Layer (SAL)—BiLSTM method to improve the classification performance in the detection of credit card fraud by removing redundant samples from the majority class and adding instances to the minority class. This approach effectively combines resampling and sequential modeling to enhance anomaly detection. However, it relies on synthetic oversampling and fails to consider the temporal evolution of anomalies, which may weaken its practical effectiveness.
Several studies analyzed the integration of BiLSTM, attention, and CNN for financial anomaly detection. Agarwal et al. [25] introduced a CNN-BiLSTM-Attention where CNN handles data initially, BiLSTM provides historical context next, and the attention mechanism discerns transaction multicollinearity, tested with 97% recall in IEEE-CIS Fraud Detection Dataset. Joy and R [26] presented a BiLSTM and CNN model driven by the attention mechanism, improving feature extraction and classification, outperforming CNN and BiLSTM-with-CNN on the Talking Data dataset. Prabhakar et al. [27] developed a structure that effectively uses CNN for feature extraction and BiLSTM for sequence learning, with the focus on words. This model improves Korean voice phishing detection with 99.32% accuracy and a 99.31% F1, which outperforms the CNN, LSTM, and BiLSTM baselines. Collectively, these studies have enriched the BiLSTM research landscape by demonstrating its flexibility for combining with attention or convolutional modules to capture complex temporal or spatial relationships. Nevertheless, most of them optimize architectures for specific datasets or tasks, with limited theoretical explanation of why such hybridization improves temporal representation. Moreover, the lack of evaluation on real-world financial panel data makes it unclear how well these models generalize under dynamic and imbalanced credit environments, resulting in a research gap that this study aims to address.
Several studies have proposed the integration of BiLSTM with Transformer networks, where Transformer, an algorithm based on the multi-headed self-attention mechanism introduced by Vaswani et al. [28], is capable of capturing long-range contextual information across the entire sequence. Cai et al. [29] developed a hybrid model with BiLSTM and Transformer to improve sentiment classification. Initially, BiLSTM derives contextual features, which are trained in several independent Transformer modules. The parameters of each Transformer are optimized during training to precisely determine sentiment polarity. Experiments on the SemEval dataset showed that this model outperforms traditional models such as CNN, LSTM, and BiLSTM in sentiment classification. Boussougou and Park [30] applied a similar approach, integrating Transformer and BiLSTM for portfolio return prediction. The input data fed into a post-BiLSTM three-layer encoder to produce predicted outputs, demonstrating the effectiveness of the BiLSTM-Transformer model in portfolio return prediction. These studies demonstrate the effectiveness of Transformer models in other application domains, providing theoretical support for our research.
LSTM, GRU, CNN, and RNN are commonly used standard models in time series data-based anomaly detection studies [31]. LSTM networks, with their gated design, effectively handle the problem of vanishing gradients seen in traditional RNNs, making them useful for capturing long-term dependencies in sequence data. CNNs are adept at detecting local patterns in sequences and are often combined with RNN models to improve spatio-temporal pattern tasks. The GRU, a simplified version of RNN, simplifies the gating functions of LSTM, offering similar performance with faster training, and is popular for anomaly detection and forecasting in time-series data [32].

2.3. XAI in Loan Default Prediction

Mill et al. [33] defines XAI as “AI systems that can explain their reasoning to humans, indicate their strengths and weaknesses, and predict their future behavior”. Unlike traditional “black-box” models, XAI offers insight into the internals of complex models, improving credibility and helping to comply with regulatory requirements in sectors such as finance, healthcare and law enforcement. XAI covers an array of methods designed for different objectives, offering various levels of insight. These methods are generally divided into pre-model, in-model, and post-model techniques [17]. The post-model technique, such as SHAP, Local Interpretable Model-Agnostic Explanations (LIME) [14], and Partial Dependence Plots (PDP) [14], is frequently utilized to clarify results from pre-trained models.
SHAP, derived from cooperative game theory [16], evaluates the impact of each input feature on the output of the model by assigning importance scores, highlighting the influential input features in the predictions. Conversely, LIME makes small data perturbations and builds an interpretable surrogate model to approximate the behavior of the black-box model. PDP shows how the values of a single input feature influence the predictions on average, which is explained by its global effect. Each method is suitable for specific domains. Li et al. [17] cataloged anomaly explanation techniques over 22 years, advocating for selection based on the model type. In particular, LSTM models often use SHAP to explain anomalies where the study by Ji [13] indicated that LIME offers slightly better interpretability than SHAP in the detection of credit card fraud.
Decision trees, linear regression, and rule-based classifiers are inherently interpretable in-model techniques due to their straightforward structures, allowing for transparency via human-readable decision rules or coefficients, directly correlating predictions with input features. In loan default prediction, these models can elucidate the impact of borrower behavior or demographic factors on default risk. However, there is a balance between interpretability and predictive accuracy, as highly interpretable models often do not perform optimally [34]. Raval et al. [35] demonstrated that pre-model strategies improve transparency in data preprocessing by employing an X-LSTM model with SHAP or LIME to identify crucial training features, with results documented on a blockchain. This method streamlines input data, improves performance and interpretability, and reveals key predictive features.
The existing literature has contributed to improving model transparency and promoting regulatory compliance in finance, illustrating how interpretability enhances trust in machine learning systems. However, most XAI studies remain descriptive and post hoc, rarely linking interpretability outcomes to model architecture or the temporal behavior of features. In the context of time-series credit data, few works explore how feature importance evolves across time or how interpretability insights can guide model refinement. This study advances this line of inquiry by using SHAP not only to interpret model predictions but also to explain how the Residual-Enhanced Encoder alters temporal focus compared to a standard BiLSTM, thereby connecting interpretability with architectural design.
In general, the use of XAI in predicting loan defaults improves the transparency of the model that could help build trust between financial institutions, aligns with regulatory standards, and optimizes model performance. Choosing suitable XAI tools based on specific application contexts effectively clarifies model decisions, guaranteeing the wide applicability of these models. For complex deep learning models, SHAP is mainly used to provide intuitive feature contribution values, aiding in understanding model decisions. Thus, this study uses SHAP as the interpretability method for model explanation.

3. Materials and Methods

3.1. Data Preprocessing

This study uses the Freddie Mac Single-Family Loan-Level Dataset [15], which contains more than 50 million entries from 1999 onward. Due to the reduced significance of older data and the incomplete nature of recent data, the study focuses on monthly repayment data from loans available between 1 January 2009, and  31 December 2019. Each quarter constitutes a separate dataset, with the first 1,000,000 records selected from each. The original dataset includes a large number of features, some of which have more than 50% missing values. Features with such a high proportion of missing values were excluded during the preprocessing stage, as they lack analytical value and may introduce additional noise. For the retained features, missing values are minimal. If any borrower has missing records in the selected features, the entire time-series data for that borrower is removed to ensure data completeness and consistency. Table 1 displays the basic statistics for the 44 cohorts, including the number of loans, average and median loan history length, and default rate. Although these datasets are chronologically ordered, they are independent and represent the repayment records for a specific quarter across all subsequent years. The selected features are shown in Table 2.
The feature selection process involves removing features that mainly exhibit missing values and those linked to categorical attributes. The Current Loan Delinquency Status (CLDS) acts as the class label, where a value of 3 or more signifies that the borrower has not repaid the loan for at least 3 months, considering it a default, aligning with the industry standard definition according to Basel II guidelines [36]. Discrete features are transformed using hot encoding. Two additional features are introduced, including the difference in Interest Bearing Unpaid Principal Balance (UPB) and Current Actual UPB between the current month and the previous month. These differences generate new features labeled Interest Bearing UPB Delta and Current Actual UPB Delta, respectively.
Interest Bearing UPB, or Interest Bearing Unpaid Principal Balance, signifies the portion of a modified mortgage’s unpaid principal balance subject to interest. This amount is the basis for interest calculations and represents the remaining owed balance of a borrower. Calculating Interest Bearing UPB-Delta is valuable in loan default prediction and financial modeling, as it offers insights into repayment patterns. A negative value suggests principal repayment, indicating normal behavior, whereas a zero value might signal missed payments, indicating risk. A positive increase in principal may result from loan restructuring, deferred capitalization, or new debt, which requires further investigation.
The Current Actual UPB, combining both interest-bearing and non-interest-bearing UPB, offers a complete view of the borrower’s debt. This metric is important for risk management and thorough loan evaluation. The feature Current Actual UPB-Delta, indicating changes in deferred principal, adds further time-series insights by capturing adjustments such as additions, reductions, or re-amortizations. These elements improve the model’s capacity to differentiate typical repayment behavior from the distinct patterns linked to loan modifications.
The data is organized by Loan Sequence Number (loan ID). Within each group, a sliding window approach [37] is applied. Each time slice consists of three consecutive components: a feature window denoted as L , which serves as the input for the model; a subsequent 2-month window serving as a blank gap; and a 3-month observation period used for generating labels. According to the definition of default adopted in this study, a default is identified when CLDS 3 occurs within the observation period. Accordingly, a label of y = 1 is assigned if such an event occurs during the observation period; otherwise, y = 0 . To preserve the practical relevance of the task, samples with nonzero CLDS values (e.g., CLDS = 1 or 2) within the input feature window are excluded, ensuring that the model focuses on genuine early-stage prediction rather than detecting already evident delinquency signals. Consequently, defaults ( CLDS 3 ) are constrained to occur no earlier than the third month following the input feature window. For this reason, a 2-month blank gap is designed between the input feature window and the observation period. The data is then randomly divided into 70% as the training set and 30% as the testing set (out-of-sample test) according to the original default ratio. To further prevent data leakage, time slices from the same borrower are strictly ensured not to appear in both the training and testing sets.
To address class imbalance, which can substantially influence the performance and generalization of ML models particularly in real-world scenarios such as loan default prediction, six resampling strategies were evaluated. These include random undersampling (RUS) [38], random oversampling (ROS) [38], Synthetic Minority Over-sampling Technique (SMOTE) [39], tSMOTE [40], TimeGAN [41,42], and a baseline without resampling. All resampling operations were performed only on the training sets. In the RUS procedure, all default (minority-class) samples were retained, while an equal number of non-default (majority-class) samples were randomly selected in each trial. In contrast, the ROS, SMOTE, tSMOTE, and TimeGAN methods adopted a two-stage procedure: the majority class was first randomly undersampled to obtain a default-to-non-default ratio of 1:2, followed by oversampling of the minority class until a default-to-non-default ratio of 1:1 was achieved. This design follows the findings of Chawla et al. [39], whose experiments demonstrated that combining undersampling with oversampling yields superior classification performance, while direct oversampling would have resulted in an impractically large dataset. For all resampling methods, the random selection of majority-class samples was independently repeated ten times to reduce the influence of randomness. The resulting ten training sets were then used to conduct ten independent experimental trials, and the final evaluation metrics were reported as the average values across these trials.

3.2. Proposed ResE-BiLSTM Model

The design of the ResE-BiLSTM model is grounded in the complementary characteristics of the Residual-enhanced Encoder (ResE) and BiLSTM in temporal representation learning. Conventional recurrent architectures (such as LSTM, BiLSTM, and GRU) can model sequential dependencies through gated mechanisms; however, due to the compression of information into hidden states, they struggle to explicitly preserve multiple long-range dependencies across time steps and features. As a result, the global contextual information that reflects the gradual accumulation of risk may become weakened or lost within the latent representations. To overcome this limitation, the residual-enhanced encoder introduces a global feature extraction mechanism based on multi-head self-attention. This structure allows each time step to directly attend to all others, thereby enabling the model to learn many-to-many temporal interactions and cross-feature dependencies without relying solely on recursive propagation [43]. The residual connections further stabilize gradient flow and preserve the original temporal signals, allowing the model to construct stable and expressive global semantics in deeper layers [44]. However, relying solely on attention mechanisms, while effective for capturing global structures, may be insufficient to characterize the local temporal continuity and direction-sensitive variations inherent in borrower behavior. Therefore, the BiLSTM is employed as a local dependency learner to refine the bidirectional sequential relationships on top of the globally contextualized embeddings, thereby enhancing fine-grained temporal order features. This global–local hybrid structure achieves hierarchical complementarity: the residual-enhanced encoder captures macro-level contextual dependencies, whereas the BiLSTM strengthens micro-level sequential dynamics and directionality.
Figure 1 illustrates the structure of the proposed ResE-BiLSTM model. As shown, the model first utilizes a multi-head attention mechanism, which serves to focus on the most relevant features within the time-series data, followed by a Feedforward Neural Network (FNN) that forms the encoder, enabling the model to learn richer representations. The output from the encoder is then passed into the BiLSTM layer, which captures both forward and backward dependencies in the time sequence. In addition to these components, the ResE-BiLSTM architecture incorporates residual connections, which help mitigate the vanishing gradient problem and enhance the flow of information across layers, improving model stability and convergence. The model handles input data of dimensions ( T , F ) , with T as the sequence length and F the number of features. The pseudocode for this model is presented in Algorithm 1, which outlines the detailed process for feature extraction, temporal dependency modeling, and prediction.
Algorithm 1: Pseudocode of the proposed ResE-BiLSTM
Information 17 00005 i001

3.2.1. Residual-Enhanced Encoder (ResE) Layer

  • Multi-Head Attention
    Multi-head attention [28] is a sophisticated attention mechanism integrating several attention processes in one model. It functions by projecting input into multiple subspaces via linear transformations with learned weight matrices. Each head processes its own transformed input independently, allowing the model to concentrate on different data aspects and grasp richer contextual details. This model utilizes a self-attention mechanism that computes attention using only the input, without external data. This method efficiently captures relationships and dependencies within the input sequence. Importantly, post-attention calculation maintains the output dimensionality consistent with the input, facilitating integration with subsequent layers.
    The query vector Q, key vector K, and value vector V are initially derived from linear transformations, with Q = X W Q , K = X W K , and V = X W V , where X represents the input data and W Q , W K , W V are weight matrices randomly initialized. The algorithm utilizes h attention heads to derive the attention matrix A i via the scaled dot product Q i K i T d k , where d k denotes the key vector dimension, and subsequently employs the softmax function to produce a probability distribution. The output Z i for each attention head is derived by applying the attention weights A i to the value vector V i . Finally, combining the outputs from all attention heads and projecting back into the input space with W O completes the multi-head attention layer.
  • Normalization Layer and Residual Connection Mechanism
    The normalization layer follows the multi-head attention and feed-forward network to improve model stability and performance. The residual connection in layer normalization ensures balanced input and layer output contributions [45], preserving essential information from earlier layers and enabling deeper layers to learn more complex features. Moreover, the normalization layer mitigates issues like vanishing and exploding gradients through output standardization, improving training stability. It also reduces the influence of input scale variations on parameter updates, speeding up convergence and optimizing the efficiency of the training process. The normalization layer operates as follows:
    L N ( x ) = x μ σ 2 + ϵ γ + β
    where μ represents the mean of each feature, σ 2 indicates their variance, ϵ is a small constant (set at 1 × 10−6) to avoid division by zero, as well as γ and β are learnable parameters. As shown in Equation (1), layer normalization stabilizes the output distribution and improves training robustness.
    The residual connection mechanism incorporated within the ResE module plays a pivotal role in facilitating effective deep representation learning. Specifically, by introducing skip connections that directly add the input of a sub-layer (e.g., the multi-head attention or feed-forward layer) to its output prior to normalization, the model preserves the integrity of the original feature representations while enabling the training of deeper networks without degradation. This architectural design mitigates the vanishing gradient problem and ensures more stable and efficient gradient flow during backpropagation. Moreover, the integration of residual connections with layer normalization enhances the model’s capacity to learn complex temporal dependencies by stabilizing the output distributions across layers.
  • Feed-Forward Network
    The feed-forward network processes each time step independently, refining and improving the fine-grained features to improve feature representation [46]. Using the ReLU activation function, the network applies non-linear transformations to capture more intricate patterns and relationships within the data. The feed-forward network in this model features two layers. The initial layer is fully connected, containing 256 neurons and employing ReLU activation. The second layer reshapes the feature dimension to match the original input, maintaining compatibility with the BiLSTM layer. The resultant output is shaped as batch size, sequence length, feature dimension. This is followed by layer normalization applied to the combined outputs of the feed-forward network and attention layer, improving stability and robustness of the model.

3.2.2. BiLSTM

BiLSTM processes time series data bidirectionally, capturing temporal relationships and contextual information [47]. Equipped with forget, input, and output gates, it selectively retains and updates information to capture dependencies, enhancing its applicability to predict loan default, where temporal patterns are crucial. Moreover, BiLSTM complements the residual-enhanced encoder by learning local time-series patterns, while the residual-enhanced encoder captures global dependencies. This combination promotes robust data representation.
The BiLSTM algorithm manages sequence processing through two states: the cell state (c) for long-term memory and the hidden state (h) for short-term context and time-step output. It operates bidirectionally over the sequence, forward from t = 1 to t = T and backward from t = T to t = 1 . At each step, the hidden forward and backward states ( h f and h b ) are concatenated to form the final output, integrating the dependencies of the past and future sequences. For process initialization, c and h of both forward LSTM ( c f ( 0 ) , h f ( 0 ) ) and backward LSTM ( c b ( T + 1 ) , h b ( T + 1 ) ) start as zero vectors.
  • Forward LSTM Process
    The LSTM executes these operations at every time step:
    (a)
    Forget Gate
    This mechanism determines which part of the previous cell state ( c f ( t 1 ) ) is preserved in the cell state. The forget gate governing this mechanism is defined in Equation (2):
    f t ( f ) = σ ( W f ( f ) [ X t , h f ( t 1 ) ] + b f ( f ) )
    where σ is the sigmoid activation function mapping values to [0, 1], W f ( f ) represents the weights for the forward forget gate, X t is the input data, h f ( t 1 ) denotes the prior hidden state, and b f ( f ) is the forget gate bias.
    (b)
    Input Gate
    This operation determines the portion of current input ( X t ) to be stored in the cell state. The input gate computation is defined by Equation (3):
    i t ( f ) = σ ( W i ( f ) [ X t , h f ( t 1 ) ] + b i ( f ) )
    where W i ( f ) is the weights for the forward input gate, h f ( t 1 ) is the hidden state from the preceding time step, b i ( f ) is the bias term for the input gate.
    (c)
    Candidate Cell State
    This operation generates a candidate value ( C ˜ t ( f ) ) for potential updates to the cell state. As specified in Equation (4), this candidate value is computed by:
    C ˜ t ( f ) = tanh ( W c ( f ) [ X t , h f ( t 1 ) ] + b c ( f ) )
    where W c ( f ) represents the weights for the candidate cell state, h f ( t 1 ) is the previous time step’s hidden state, and b c ( f ) is the bias term for the candidate cell state.
    (d)
    Output Gate
    This operation delineates the cell state fraction impacting the hidden state ( h f ( t ) ). The output gate o t ( f ) is determined by Equation (5):
    o t ( f ) = σ ( W o ( f ) [ X t , h f ( t 1 ) ] + b o ( f ) )
    where W o ( f ) denotes the forward output gate weights, h f ( t 1 ) is the previous hidden state, and b o ( f ) stands for the output gate bias.
    (e)
    Updated Cell State
    The current cell state is updated by integrating information from the forget gate, input gate, and candidate cell state, as formalized in Equation (6):
    c f ( t ) = f t ( f ) c f ( t 1 ) + i t ( f ) C ˜ t ( f )
    where c f ( t 1 ) denotes the previous cell state, f t ( f ) is the forget gate values, i t ( f ) represents input gate values, and C ˜ t ( f ) is the candidate cell state.
    (f)
    Updated Hidden State
    The updated hidden state h f ( t ) is computed by applying the output gate to the newly updated cell state, as defined in Equation (7):
    h f ( t ) = o t ( f ) tanh ( c f ( t ) )
    Following these six steps, the forward process refreshes the cell state ( c f ( t ) ) and the hidden state ( h f ( t ) ), producing an output ( o t ( f ) ) through the output gate.
  • Backward LSTM Process
    The backward LSTM functions similarly, but processes in reverse, beginning from t = T to t = 1 .

3.2.3. Flatten and Output Layers

The flatten layer transforms the multi-dimensional tensor output from the BiLSTM into a one-dimensional form appropriate for the fully connected layer. Subsequently, a fully connected two-layer network is used for prediction. To limit the output between [0, 1], a sigmoid activation function is used in the output layer.

3.3. Evaluation Metrics

This study uses five metrics to evaluate the ResE-BiLSTM model, including accuracy (ACC) [48], precision (PR) [49], recall (RC) [50], F1 [51], and the area under the ROC curve (AUC) [23]. Accuracy denotes the ratio of correctly predicted samples to the total number of samples, as defined in Equation (8). Precision indicates the proportion of true positives (TP) among all predicted positives (see Equation (9)). Recall, defined in Equation (10), represents the fraction of true positives successfully identified by the model. High recall aids in regulatory compliance by helping banks fully assess risks and enforce suitable controls. Recall (RC) and precision (PR) have a trade-off; increasing recall tends to reduce precision [52]. To balance precision and recall, the F1 reconciles these conflicting metrics through its formulation as the harmonic mean (see Equation (11). The AUC, ranging from 0 to 1, quantifies the model’s ability to differentiate between classes, with values near 1 indicating better performance.
The evaluation metrics are formally defined as follows:
ACC = TP + TN TP + TN + FP + FN
PR = TP TP + FP
RC = TP TP + FN
F 1 = 2 · PR · RC PR + RC
where TP, TN, FP, and FN denote true positives, true negatives, false positives, and false negatives, respectively.
Given that different evaluation metrics might yield varying results, using a multi-metric approach ensures a comprehensive assessment of model performance. This study uses A v g R [53] to evaluate overall model performance across different indicators. Models are first ranked according to their performance in ACC, PR, RC, F1, and AUC across different groups (e.g., quarterly or yearly). These rankings are then averaged to obtain the final A v g R (see Section 4.3), where a lower A v g R indicates better classifier performance.

4. Experiment Results Analysis

Five experiments are conducted to comprehensively evaluate the models from multiple perspectives, including the overall effects of different resampling methods across all models (Section 4.1), the influence of feature window lengths (Section 4.2), the detailed performance of the proposed ResE-BiLSTM model (Section 4.3), the ablation study (Section 4.4), and the interpretability analysis (Section 4.5).

4.1. Resampling Methods Performance Analysis

To illustrate the impact of different resampling techniques on model performance, this subsection fixes the feature window lengths at 12 and 18 months, following the convention widely adopted in previous research on time series modeling. The average performance of the models is expressed by the ranking metric A v g R R , where smaller rank values correspond to superior performance, and is defined as:
Let D denote the set of data cohorts ( | D | = 44 ), R ( | R | = 6 ) the set of resampling methods applied to the training set, and M ( | M | = 6 ) the set of models, including one proposed model and five baselines. Let A C C d , r , m , P R d , r , m , R C d , r , m , F 1 d , r , m , and A U C d , r , m represent the mean accuracy, precision, recall, F1, and AUC over 10 independent trials obtained by applying resampling method r R and model m M to data cohort d D . The quantities A C C _ R R , M ( d , r , m ) , P R _ R R , M ( d , r , m ) , R C _ R R , M ( d , r , m ) , F 1 _ R R , M ( d , r , m ) , and A U C _ R R , M ( d , r , m ) indicate their respective global rankings among all corresponding metrics computed across all resampling methods and models within each cohort. The total average ranking for data cohort d under resampling method r and model m, denoted by A v g R R , M ( d , r , m ) , is then given as [53,54]:
A v g R R , M ( d , r , m ) = A C C _ R R , M ( d , r , m ) + P R _ R R , M ( d , r , m ) + R C _ R R , M ( d , r , m ) + F 1 _ R R , M ( d , r , m ) + A U C _ R R , M ( d , r , m ) 5
where d D , r R , and m M .
The obtained ranking values are subsequently grouped according to the resampling methods R = {RUS, SMOTE, TimeGAN, tSMOTE, ROS, and No Resampling}. For each resampling method r R , its overall performance is quantified by averaging the rankings across all models, thereby facilitating the identification of the most generally effective method for subsequent experiments. The resulting average ranking is denoted as A v g R R and defined as [54]:
A v g R R ( d , r ) = 1 | M ( r ) | m M ( r ) A v g R R , M ( d , r , m )
where M ( r ) M denotes the subset of models evaluated under resampling method r, and | M ( r ) | represents the number of models within this subset.
The comparative effects of different resampling techniques on the average predictive rankings of six models across 44 data cohorts are summarized in Table 3 and Table 4, corresponding to feature window lengths of 12 and 18. The results indicate that, under the experimental conditions of this study, RUS consistently achieves the lowest A v g R R values, demonstrating superior overall performance across both window configurations. tSMOTE and TimeGAN exhibit comparable yet slightly inferior outcomes. Therefore, RUS is adopted as the resampling strategy in subsequent experiments to ensure methodological efficiency and robustness in performance evaluation.
In general, employing a feature window length of 18 results in higher evaluation scores across most data cohorts compared with a length of 12. As a representative case, Figure 2 presents radar charts of the average F1 and AUC values obtained from 10 independent trials for the six models using cohort 2019Q4, based on the 18-month feature window setting. The radar charts reveal that regardless of the resampling method applied, our proposed ResE-BiLSTM consistently outperforms the baseline models across the key evaluation metrics.

4.2. Feature Window Lengths Performance Analysis

To examine how the feature window length affects model performance, the feature window lengths are set to 12, 14, 16, 18, 20, 22, and 24 months. Each feature window serves as input to predict default events occurring within a 3-month observation period that follows a 2-month blank gap after the feature window. This analysis identifies the optimal feature window configuration for the benchmark datasets used in this study and further provides a sensitivity analysis of the models under different feature window lengths.
Let L represent the set of feature window lengths, where | L | = 7 . The sets D and M follow the same definitions as provided in Section 4.1. For each data cohort d D , feature window length l L , and model m M , let A C C d , l , m , P R d , l , m , R C d , l , m , F 1 d , l , m , and A U C d , l , m denote the mean accuracy, precision, recall, F1, and AUC values obtained from 10 independent trials. The corresponding global rankings, A C C _ R L , M ( d , l , m ) , P R _ R L , M ( d , l , m ) , R C _ R L , M ( d , l , m ) , F 1 _ R L , M ( d , l , m ) , and A U C _ R L , M ( d , l , m ) , capture their relative positions among all feature window lengths and models within each data cohort. The overall average ranking for d, when using feature window length l and model m, is expressed as A v g R L , M ( d , l , m ) and defined as [53,54]:
A v g R L , M ( d , l , m ) = A C C _ R L , M ( d , l , m ) + P R _ R L , M ( d , l , m ) + R C _ R L , M ( d , l , m ) + F 1 _ R L , M ( d , l , m ) + A U C _ R L , M ( d , l , m ) 5
where d D , m M , and l L .
Afterward, the obtained ranking values are aggregated with respect to the feature window lengths defined in L . The corresponding within-group averages are computed and denoted as A v g R L [54]:
A v g R L ( d , l ) = 1 | M ( l ) | m M ( l ) A v g R L , M ( d , l , m )
where l L , M ( l ) M represents the subset of models applied with feature window length l, and | M ( l ) | represents the number of models within this subset.
Table 5 presents the A v g R L ( d , l ) values of 44 data cohorts under different feature window lengths, reflecting the average performance across all models. Lower values indicate better performance. Overall, shorter to medium-length feature windows (12–18 months) generally yield higher predictive accuracy, while longer windows tend to result in performance degradation. Among these, the 14-month feature window performs best across most cohorts, achieving the lowest average A v g R L values and demonstrating an optimal balance between feature richness and noise suppression. Specifically, in 31 out of 44 data cohorts (approximately 70.5%), the 14-month feature window produces the minimum A v g R L ( d , l ) value. Based on this empirical finding, the subsequent experimental analyses adopt the 14-month feature window as the representative configuration for further model comparison and performance evaluation.
To provide a clearer visualization of how different feature window lengths affect model performance, Figure 3 illustrates the variation of average F1 and AUC values over 10 independent trials using the 2019Q3 dataset. Across all tested window lengths, the proposed ResE-BiLSTM consistently achieves superior performance on both metrics, exhibiting smoother trends that indicate higher robustness to temporal span variation. When the feature window extends from 12 to 14 months, all models reach their performance peaks, suggesting that this range captures sufficient behavioral and temporal information for accurate prediction. However, as the window length further increases (16–24 months), performance declines across all models, particularly for RNN and CNN, which are more sensitive to temporal noise. This degradation arises because overly long windows incorporate outdated behavioral patterns, diluting the relevance of recent credit risk signals. In contrast, the ResE-BiLSTM benefits from residual enhancement and bidirectional temporal learning, enabling it to retain long-term dependencies while mitigating the impact of historical noise.

4.3. ResE-BiLSTM Model Performance Analysis

Based on the optimal resampling method (RUS) identified in Section 4.1 and the optimal feature window length verified in Section 4.2, Table A1, Table A2, Table A3, Table A4 and Table A5 in the Appendix A display the average performance of the six models on five metrics, based on 10 independent trials per cohort. The analysis reveals that although individual model performance varied across cohorts, ResE-BiLSTM consistently outperformed the other models on all metrics.
ResE-BiLSTM achieved the highest accuracy in 38 cohorts, or 86.36% of the total, clearly outperforming other models, highlighting the ability of ResE-BiLSTM to capture complex features. In contrast, models such as BiLSTM and GRU showed the best performance in two cohorts, and other models achieved the highest performance in two cohorts. Furthermore, ResE-BiLSTM led in precision, with the highest precision in 26 cohorts, representing 59.09% of the total, compared to LSTM, BiLSTM, GRU, CNN and RNN, which outperformed in 1, 3, 4, 1 and 9 cohorts respectively.
ResE-BiLSTM achieved the highest recall in 37 cohorts, highlighting its effectiveness in reducing false negatives. Furthermore, it achieved the highest F1 in 39 out of 44 cohorts (88.64%), showing an excellent precision-recall balance. The results of the AUC demonstrated that ResE-BiLSTM maintained high true positive rates and low false positive rates in 36 cohorts.

4.3.1. AvgR Performance Analysis

The sets D and M follow the same definitions as provided in Section 4.1. For each data cohort d D and model m M , let A C C d , m , P R d , m , R C d , m , F 1 d , m , and A U C d , m denote the mean accuracy, precision, recall, F1, and AUC values obtained from 10 independent trials. A C C _ R M ( d , m ) , P R _ R M ( d , m ) , R C _ R M ( d , m ) , F 1 _ R M ( d , m ) , and A U C _ R M ( d , m ) denote the rank of model m among all models in M with respect to accuracy, precision, recall, F1, and AUC, respectively. Lower ranks indicate better performance. The overall average ranking of model m in cohort d is defined as [53,54]:
A v g R M ( d , m ) = A C C _ R M ( d , m ) + P R _ R M ( d , m ) + R C _ R M ( d , m ) + F 1 _ R M ( d , m ) + A U C _ R M ( d , m ) 5
where d D , and m M .
Table 6 shows the A v g R M ( d , m ) for each model in five evaluation metrics for 44 data cohorts. The ResE-BiLSTM model significantly outperforms other models, obtaining the top ranking in 37 of 44 cohorts. In contrast, although other models perform well on specific cohorts, their overall rankings are particularly lower. Specifically, BiLSTM, GRU, and RNN have the highest average rank in two cohorts each, CNN in one, and none for LSTM.

4.3.2. Ranking Performance Grouped by Year

Cohorts within the same year, while independently collected, can be effectively grouped by year for model performance evaluation. This approach is valid since all four cohorts from the same year are likely to be affected by similar social and market conditions. Factors such as macroeconomic trends, policy shifts, and industry-specific cycles may similarly influence data across these cohorts. By analyzing data from one year collectively, this study gains a more complete assessment of model performance stability throughout the entire year, rather than examining each quarter separately.
Using the year as a grouping unit helps mitigate the effects of seasonal variations, unexpected events, and short-term economic changes that could impact the independence of cohorts, thus improving the robustness and generalizability of the model analysis. In finance, model stability and adaptability across years are crucial due to the significant fluctuations in financial markets and economic activities. The aggregation of yearly data avoids overemphasis on fluctuations in a single quarter, offering a more suitable evaluation of the performance of the model. This approach supports a more comprehensive performance assessment, reducing the influence of individual quarter volatility.
For each data cohort d D and model m M , let A C C d , m , P R d , m , R C d , m , F 1 d , m , and A U C d , m denote the mean accuracy, precision, recall, F1, and AUC values obtained from 10 independent trials. The average ranking method enables a hierarchical evaluation process, in which cohort-level results are first grouped annually and then ranked within each group. Specifically, the 44 quarterly cohorts are derived from 11 consecutive years, with four cohorts per year and six models evaluated under five metrics (accuracy, precision, recall, F1, and AUC). For a given year Y and a given metric, all 24 values (4 cohorts × 6 models) are ranked, such that the rank value represents the relative position of a model within these 24 results for that year and metric, where a lower rank indicates better performance. Subsequently, for each model m, the average of its four quarterly rankings within year Y is computed to obtain the annual ranking under each metric, denoted as A C C _ R M ( Y , m ) , P R _ R M ( Y , m ) , R C _ R M ( Y , m ) , F 1 _ R M ( Y , m ) , and A U C _ R M ( Y , m ) respectively. Finally, the overall average ranking of model m in year Y is determined by averaging its annual rankings across the five metrics, as defined below [54]:
A v g R M ( Y , m ) = A C C _ R M ( Y , m ) + P R _ R M ( Y , m ) + R C _ R M ( Y , m ) + F 1 _ R M ( Y , m ) + A U C _ R M ( Y , m ) 5
where m M , Y Y = { 2009, …, 2019}.
Table 7 presents the relative performance ranking of six models in different years grouped from the 44 cohorts. The results reveal that ResE-BiLSTM outperformed the other models in 10 of 11 years, accounting for 90.91% of the total years. In contrast, while models such as BiLSTM displayed strong performance in some individual years, they generally exhibited lower performance compared to ResE-BiLSTM. This highlights the good consistency of the proposed ResE-BiLSTM model in delivering high performance in different annual cohorts.

4.3.3. Wilcoxon Signed-Rank Tests

To further substantiate the empirical finding that ResE-BiLSTM ranks first in most cohorts, non-parametric statistical significance tests are conducted to determine whether its superiority over the two best-performing baseline models (LSTM and BiLSTM) is statistically significant. The Wilcoxon signed-rank test [55] is employed for this comparison using the two key evaluation metrics, F1 and AUC. One-sided tests are performed, with the null hypothesis stating that the proposed ResE-BiLSTM exhibits equal or worse performance compared to the baseline models, while the alternative hypothesis posits that the ResE-BiLSTM model performs better. A small p-value indicates a statistically significant difference between the two models, with the significance level set to α = 0.005 . Each cohort was tested independently, and the proportion of significant results was reported as evidence of consistency, rather than as multiple evaluations of a single overarching hypothesis; therefore, no cross-cohort corrections for the family-wise error rate or the false discovery rate were applied.
Table 8 reports the results of the Wilcoxon signed-rank tests comparing ResE-BiLSTM with the top two baseline models, LSTM and BiLSTM, using F1 and AUC as evaluation metrics. The results demonstrate that ResE-BiLSTM achieves statistically significant improvements ( p < 0.005 ) in 32 out of 44 cohorts (72.7%) for F1 and 23 out of 44 cohorts (52.3%) for AUC. When comparing against LSTM alone, significant superiority is observed in 65.9% of cohorts for F1 and 43.2% for AUC, while against BiLSTM, the significance rates are 61.4% and 45.5%, respectively. At the 0.5% significance level, these results indicate that the null hypothesis, which states that ResE-BiLSTM performs no better than the baseline models, can be rejected with 99.5% confidence in the majority of cases. The consistently low p-values suggest that the observed performance improvements are unlikely to have occurred by chance, demonstrating the robustness and generalizability of the proposed model’s superiority across different data cohorts.

4.4. Ablation Study

An ablation study was conducted to evaluate the behavior of ResE-BiLSTM by excluding specific components. Four model variations were created: M1 omits the residual connection mechanism, M2 omits the feedforward network, M3 omits the residual-enhanced encoder, and M4 removes the bidirectional feature of the BiLSTM. Data were grouped by merging three-year periods into cohorts, minimizing previous partition bias and ensuring generalizability of the results. Table 9 concisely presents the results of the ablation study, demonstrating that all the variations of the model underperformed compared to the ResE-BiLSTM model.
The ResE-BiLSTM model consistently achieves an accuracy of over 92% across all cohorts. In contrast, E-BiLSTM (M1) shows slightly lower performance, indicating that removing the residual connections has some impact on overall performance, but it is not a decisive factor. A-BiLSTM (M2) exhibits the most significant performance drop, suggesting that the feedforward neural network (FNN) plays a more critical role in enhancing the model’s predictive capability. Although the M2 model incorporates an attention mechanism on BiLSTM, the absence of the FNN support leads to the attention output failing to effectively convert into discriminative features. Instead, it may increase the focus on noise or the majority class, resulting in worse performance compared to the basic BiLSTM and LSTM models. This phenomenon emphasizes the importance of the collaborative relationship between modules in this task.
Moreover, ResE-BiLSTM demonstrates excellent precision, recall, and F1 across all cohorts, validating the effectiveness of its structural design. E-BiLSTM (M1) shows a performance decline after the removal of residual connections, especially in recall, indicating that residual connections play a significant role in capturing deep temporal information and improving the recognition of the minority class. In contrast, A-BiLSTM (M2) experiences a more drastic performance drop after the removal of the feedforward neural network (FNN), with an average recall decrease of 23.48% across the four cohorts, highlighting the critical importance of FNN in enhancing feature discriminability. Although BiLSTM and LSTM, which do not incorporate residual or feedforward structures, show relatively stable performance, they consistently fall short of ResE-BiLSTM in terms of all evaluation metrics.
Overall, the ablation study results indicate that each key component in the ResE-BiLSTM structure plays an irreplaceable role in model performance. Removing any of these modules leads to performance degradation across various dimensions, providing crucial insights for structural optimization in future model design.

4.5. Interpretability Performance Analysis

4.5.1. Barplot Analysis

Figure A1a–f in the Appendix A show SHAP barplots for the proposed ResE-BiLSTM and five baseline models, ranked over 238 features. These barplots are derived from the third cohort in the ablation study, covering years 2015 to 2017. The barplots reveal that the models prioritize different features with varying emphasis on their temporal order. Table 10 presents the number of months each feature appears among the top 50 in feature importance rankings. The findings indicate variations in feature emphasis in all six models, which explain the differences in their contributions.
For the ResE-BiLSTM model, six key features consistently rank among the top-50 over 14 months. Features such as Interest Bearing UPB-Delta, Current Actual UPB-Delta, and Estimated Loan to Value (ELTV) were significant for 14 months, 14 months and 12 months, respectively, making up 80% of these top features, with no direct relation between feature importance and time. In contrast, the BiLSTM model identifies eight features in the top 50, with Interest Bearing UPB-Delta prominent for 14 months. Unlike ResE-BiLSTM, BiLSTM ranks feature importance chronologically, generally decreasing from recent to past months, with minor fluctuations in some months.
The difference in temporal feature importance between the ResE-BiLSTM and BiLSTM models arises from their distinct emphases in modeling temporal dependencies. In the ResE-BiLSTM, the attention mechanism is responsible for discovering long-range correlations, while the residual connections ensure the preservation of distant information, an advantage that the BiLSTM architecture lacks. Specifically, the residual-enhanced encoder employs multi-head self-attention to establish pairwise associations and dynamic weighting across the entire temporal dimension, enabling the model to recognize both distant and recent key signals simultaneously. For example, the model can concurrently focus on early-stage variations in the ELTV and recent changes in Current Actual UPB or Interest Bearing UPB, thereby linking long-term asset risk with short-term repayment pressure. The residual connections further preserve the original temporal cues, preventing early features from being completely diluted in deeper representations. This not only stabilizes gradient flow numerically but also reshapes the temporal attention distribution mechanistically. In this structure, each layer’s output is added to the original input, allowing the attention mechanism to repeatedly leverage earlier temporal signals when redistributing weights. Through this mechanism, the attention pattern along the timeline becomes less monotonic and often exhibits a flatter or even bimodal form, reinforcing the model’s responsiveness to both distant and recent time steps. In other words, the model learns to dynamically revisit distant periods, transitioning from a short-term recency bias toward a more globally balanced focus on both long- and short-term signals.
In the LSTM model, six features are most prominent, with Interest Bearing UPB-Delta, Current Actual UPB-Delta, and ELTV consistently appearing over 14 months, making up 84% of the top 50 features. Unlike ResE-BiLSTM, the Current Actual UPB-Delta was identified as the most significant feature. The GRU model highlights six features, similar to the ResE-BiLSTM model, where the importance of the feature is not linearly related to the time order in both models. However, the GRU’s ranking of feature importance over time is more unpredictable and lacks a consistent pattern. Moreover, the GRU prioritizes Current Actual UPB-Delta over Interest Bearing UPB-Delta.
The results of the RNN model are the same as those of the GRU model, with the current actual UPB-Delta as the key feature. However, the ranking of feature importance throughout the sequence varies from the GRU model, showing less regularity. In contrast, the CNN model concentrates on four features, highlighting Interest Bearing UPB-Delta as most significant. For all six models, Interest Bearing UPB-Delta, Current Actual UPB-Delta, and ELTV are the most significant features. Line charts showing how feature importance evolves across months are presented in Figure 4a–c for further analysis. The horizontal axis represents the month index, where each value corresponds to a specific month (e.g., 14 indicates the 14th month, which is the month closest to the observation period in our experimental setting), while the vertical axis shows the feature importance in the top 50 rankings, where 50 signifies the most important feature.
For Interest Bearing UPB-Delta, the ResE-BiLSTM and GRU models indicate that data from both distant and recent times are important for predictions, whereas intermediate periods are less significant. In contrast, the BiLSTM and CNN models show a nearly linear decrease in feature importance from recent to past data. In contrast, the LSTM and RNN models demonstrate a variable pattern without consistent changes in the importance of the features.
For Current Actual UPB-Delta, all six models show a double-peak pattern in feature importance over time, initially decreasing from recent to distant points, then rising and falling again. This pattern suggests that both recent and distant data may contain meaningful signals. Interestingly, in the GRU and CNN models, feature importance starts to increase at the month 8 (6 months before the most recent point), while in other models, this rise begins at the month 5 (9 months earlier).
For ELTV, the general importance of the features over the 14 months is less than the previous two features, usually ranking in the bottom half of the top 50 with mild month-to-month variation. Besides the CNN model, the other five models show lower feature importance in the mid-periods, increasing at both timeline extremes. Moreover, the evaluation of the importance of the features in different models reveals similar trends in similar time frames.

4.5.2. SHAP Summary Plot Analysis

The SHAP summary plot (Figure A2a–f in Appendix A) depicts the influence and significance of each feature on the model outcome, both positively and negatively. The distribution of positions and colors reveals how variations in feature values affect prediction results. Specifically, each dot represents a sample, with the vertical stacking of dots indicating sample density. The horizontal position corresponds to the SHAP value of the feature, which reflects the magnitude and direction of the feature’s contribution to the model prediction. The vertical axis ranks features based on the sum of SHAP values across all samples, following the same order as in the bar plot. A SHAP value positioned further to the right indicates a stronger positive contribution to the prediction, while values further to the left indicate a stronger negative contribution. A larger horizontal spread signifies that the range of the feature’s values has a more significant impact on the model’s predictions. The color coding represents the magnitude of the feature values, with red indicating higher values and blue indicating lower values.
The findings demonstrate that the lower values of Interest Bearing UPB-Delta and Current Actual UPB-Delta significantly improve the model computation, while the lower value of ELTV negatively impacts it, as the blue dots are concentrated on the negative side (left). Furthermore, five models (besides the CNN model) displayed varying patterns, suggesting that these features may impact positively or negatively based on the model. Moreover, the CNN model uniquely assessed the significance of these two features. The CNN model’s differing assessment of the importance of these two key features may explain why its performance is inferior to that of the other models.
The six models also differ in how they assess the impact of ELTV on prediction outcomes. The ResE-BiLSTM model indicates that ELTV has a dual effect, sometimes exerting minimal influence. For BiLSTM and LSTM models, ELTV’s impact fluctuates, potentially due to the sensitivity of the time series data or shifts in its relationship with the prediction target over time. For features of lesser importance, such as Delinquency due to Disaster_Y, all models similarly evaluate their contribution. Likewise, Current Deferred UPB exhibited both positive and negative impacts across models. Thus, these features are not the main factors differentiating model performance.
In summary, the different models show variation in their assessments of feature importance, and the impact of features on prediction results changes over time. The differences in model responses and the assessment of feature importance reveal the models’ varying abilities to capture feature complexity and time-series characteristics.

5. Conclusions and Discussion

This study addresses loan default prediction by introducing the ResE-BiLSTM model, which consistently outperforms baseline models in accuracy, precision, recall, F1, and AUC across most cohorts, demonstrating the effectiveness of combining the residual-enhanced encoder and BiLSTM components in capturing complex temporal dependencies and improving prediction accuracy. The interpretability analysis examines the significance of features and their temporal variations, providing insight into the model’s internal mechanisms and guiding future optimization.

5.1. Theoretical Implications

From a theoretical perspective, this study builds upon prior research on time-series-based financial anomaly detection, such as the works of Chen et al. [23], Agarwal et al. [25], and Boussougou and Park [30], by clarifying how a residual-enhanced encoder modifies the temporal focus of conventional recurrent architectures in modeling sequential financial data. Previous studies demonstrated that BiLSTM and attention mechanisms could improve predictive performance but rarely explained the underlying reason for such improvement. This study contributes new theoretical insight by showing that the residual-enhanced encoder reshapes temporal attention through residual information flow, allowing the model to revisit and preserve distant information that traditional recurrent models tend to underemphasize. The SHAP-based interpretability analysis further supports this finding by revealing a flatter or dual-peaked temporal importance pattern in ResE-BiLSTM, indicating that both distant and recent time steps contribute meaningfully to risk assessment. This explains why the proposed model achieves more balanced temporal representation and stronger predictive stability in credit risk modeling.

5.2. Practical Implications

From a practical perspective, the ResE-BiLSTM provides a data-driven framework that can be directly applied to post-loan default prediction for financial institutions. The model can dynamically identify the evolution of repayment risks within loan portfolios and reveal how the temporal importance of key repayment features such as Current Actual UPB, Interest Bearing UPB, and ELTV changes over time. This helps institutions determine when risk signals are most pronounced, enabling the optimization of early-warning thresholds and repayment adjustment strategies. The model demonstrates stable performance across different feature window lengths and resampling settings, indicating its adaptability to class imbalance and the variability of financial environments. Banks and other lending institutions can incorporate this model into their monitoring systems for long-term loans to enhance borrower risk tracking, reduce non-performing loans, and improve overall asset quality. Moreover, practitioners can extend the model by incorporating additional features that reflect product- or region-specific characteristics, thereby improving its suitability for different credit products and market contexts.

5.3. Research Limitations

Despite the positive outcomes achieved, certain limitations should be acknowledged. The analysis in this study is based on the Freddie Mac Single-Family Loan-Level dataset, which primarily contains records of U.S. residential mortgage loans. This dataset features long-term, monthly panel data that reflect loan structures secured by single-family housing assets with fixed or adjustable interest rates. Such data typically exhibit relatively stable repayment patterns, consistent borrower characteristics, and transparent reporting standards, which are representative of the U.S. mortgage market but may not adequately capture the characteristics of other forms of consumer or commercial credit.
These characteristics may constrain the model’s direct applicability to other loan products, such as small business loans or credit card portfolios. In these contexts, borrowing behavior is often influenced by short-term cash flow fluctuations, revolving credit limits, and heterogeneous financing purposes. Compared with residential mortgages, these credit products are driven more by transactional dynamics and liquidity conditions than by collateral value or long-term contractual obligations. Consequently, the temporal dependency patterns learned from mortgage data provide a transferable modeling framework for other credit domains, though practical applications may require appropriate adjustments based on product-specific characteristics. For example, the temporal window length and observation intervals could be redefined according to the repayment cycle and data availability of the target loan type. At the model level, retraining or fine-tuning on domain-specific data could help adapt the model to new temporal dependency structures. At the feature level, incorporating variables that better reflect domain-specific risk characteristics, such as cash flow indicators in small business lending or transaction frequency and credit utilization in credit card data, would further enhance the model’s ability to capture product-specific risk dynamics.
Furthermore, the Freddie Mac dataset reflects the regulatory environment, institutional framework, and macroeconomic conditions specific to the U.S. housing finance system. Differences in legal structures, credit scoring practices, and market maturity across countries suggest that applying the model to other geographical markets would require recalibration or fine-tuning based on local economic structures and risk characteristics. For example, this may involve adjusting the feature window length or redistributing feature weights using local loan data to account for differences in borrower behavior patterns, repayment cycles, interest rate policies, and default triggers. In addition, incorporating locally relevant variables, such as regional economic indices or sectoral indicators, could further enhance the model’s ability to capture market-specific risk dynamics.

5.4. Directions for Future Research

Future research could therefore extend this work by validating the model’s transferability across multiple datasets and credit products to evaluate its robustness and adaptability under diverse lending conditions. However, such cross-market validation is often constrained by limited data availability. Due to privacy and confidentiality concerns, relevant datasets in many countries are not publicly disclosed, which restricts the practical feasibility of future research. Hence, financial institutions are encouraged to share properly anonymized datasets with researchers to facilitate advances in credit risk modeling and promote broader academic–industry collaboration in this field.
In addition, this study primarily addresses class imbalance at the data level. Model-level techniques, such as cost-sensitive learning and class weighting, have not yet been incorporated and could be explored in future work to further enhance model robustness under imbalanced conditions. Building on these directions, future research will focus on developing more efficient architectures for real-time anomaly detection and mitigating the negative effects of concept drift. These efforts aim to assist financial institutions in identifying high-risk borrowers, minimizing non-performing loans, and ultimately improving asset quality and financial stability.

Author Contributions

Conceptualization: Y.Y.; Methodology: Y.Y., B.G.L., A.B.; Formal analysis: Y.Y., Y.L., Y.Z., Z.S., C.C.G., T.F.; Data curation: Y.Y., Y.L.; Writing—original draft: Y.Y.; Writing—review and editing: Y.Y., A.B., B.G.L.; Supervision: A.B., B.G.L.; Funding acquisition: A.B., B.G.L. All authors have read and agreed to the published version of the manuscript.

Funding

We are grateful to Ningbo Municipal Government for funding of this project as part of grant number 2021B-008-C.

Data Availability Statement

The dataset analyzed during the current study is available in the Freddie Mac repository, https://www.freddiemac.com/research/datasets/sf-loanlevel-dataset (accessed on 24 December 2024).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Abbreviations

ACC  Accuracy
AUC  Area under the ROC curve
BiGRU  Bidirectional gated recurrent unit
BiLSTM  Bidirectional long-short-term memory
CNN  Convolutional neural network
ELTV  Estimated loan to value
FN  False negative
FP  False positive
GRU  Gated recurrent unit
LIME  Local Interpretable Model-Agnostic Explanation
LSTM  Long-short-term memory
ML  Machine learning
OOS  Out-of-sample
PR  Precision
RC  Recall
RNN  Recurrent neural network
SAN  Self-attention
SHAP  SHapley Additive exPlanations
TN  True negative
TP  True positive
UPB  Unpaid principal balance
XAI  Explainable artificial intelligence
XGB  Extreme gradient boosting

Appendix A

Table A1. Accuracy of 6 models on 44 cohorts of the Freddie Mac dataset based on average results from 10 trials.
Table A1. Accuracy of 6 models on 44 cohorts of the Freddie Mac dataset based on average results from 10 trials.
CohortLSTM
[8]
BiLSTM
[12]
GRU
[56]
CNN
[57]
RNN
[58]
ResE-BiLSTM
2009Q10.9180.9130.9140.8920.9140.923
2009Q20.8950.8800.8870.8970.8830.921
2009Q30.9140.9140.9150.9110.9190.924
2009Q40.9260.9240.9450.9060.9370.924
2010Q10.9390.9350.9380.9350.9400.946
2010Q20.9010.9000.9030.8870.9010.913
2010Q30.9090.9190.9130.8930.9110.922
2010Q40.8990.9000.8960.9110.9100.890
2011Q10.9250.9330.9270.9040.9240.903
2011Q20.9220.9210.9190.8950.9200.923
2011Q30.9220.9290.9350.8960.9100.906
2011Q40.9200.9160.9120.8860.9200.925
2012Q10.9040.8900.9050.8750.9190.894
2012Q20.8300.8170.8270.8480.8550.863
2012Q30.9110.9100.9110.8630.9020.913
2012Q40.9480.9530.9490.9350.9470.954
2013Q10.8780.8780.8840.8680.8650.916
2013Q20.9260.9310.9230.8890.9130.932
2013Q30.8930.8900.8970.8570.8850.909
2013Q40.9140.9140.9100.8870.9240.930
2014Q10.9210.9150.9240.8860.9270.931
2014Q20.9180.9170.9120.8850.9180.924
2014Q30.9310.9330.9240.9190.9270.935
2014Q40.8750.8650.8620.8690.8840.891
2015Q10.8960.8850.8790.8920.9100.913
2015Q20.9110.9120.9040.8900.9030.916
2015Q30.9260.9300.9240.8940.9140.931
2015Q40.9270.9360.9280.8840.9280.908
2016Q10.9130.9200.9160.8860.9190.927
2016Q20.9080.9090.9130.8860.9130.923
2016Q30.9120.9010.9090.8790.9070.915
2016Q40.9300.9250.9340.9130.9400.941
2017Q10.9270.9250.9230.9020.9290.933
2017Q20.9210.9180.9100.8970.9090.930
2017Q30.9160.9160.9140.8830.9140.923
2017Q40.9250.9250.9230.9010.9230.930
2018Q10.9370.9360.9370.9150.9350.939
2018Q20.9280.9260.9250.8990.9230.934
2018Q30.9300.9300.9300.9110.9320.935
2018Q40.9190.9230.9220.9000.9260.927
2019Q10.9260.9240.9230.9070.9270.930
2019Q20.9320.9300.9270.9230.9370.942
2019Q30.9440.9490.9430.9210.9460.951
2019Q40.9510.9410.9470.9030.9530.955
Note: Values in bold indicate the best performance metric within each data cohort.
Table A2. Precision of 6 models on 44 cohorts of the Freddie Mac dataset based on average results from 10 trials.
Table A2. Precision of 6 models on 44 cohorts of the Freddie Mac dataset based on average results from 10 trials.
CohortLSTM
[8]
BiLSTM
[12]
GRU
[56]
CNN
[57]
RNN
[58]
ResE-BiLSTM
2009Q10.9500.9490.9450.9500.9500.951
2009Q20.8970.8810.8920.9010.8850.953
2009Q30.9210.9190.9220.9250.9220.927
2009Q40.9640.9510.9650.9590.9640.956
2010Q10.9780.9800.9770.9740.9820.983
2010Q20.8950.8930.9000.8880.9040.915
2010Q30.8940.9190.9110.8960.9170.920
2010Q40.9230.9210.9150.9700.9540.919
2011Q10.9290.9280.9350.9150.9150.862
2011Q20.9690.9730.9760.9340.9850.988
2011Q30.9410.9350.9540.9490.9080.902
2011Q40.9400.9250.9190.9450.9560.935
2012Q10.8930.8630.8930.8590.9230.863
2012Q20.7850.7700.7830.8070.8160.828
2012Q30.9200.9190.9160.8510.9320.933
2012Q40.9850.9850.9830.9710.9790.990
2013Q10.8540.8520.8620.8790.8340.911
2013Q20.9810.9870.9780.9220.9810.980
2013Q30.8690.8650.8770.8290.8550.903
2013Q40.9190.9130.9180.9010.9370.920
2014Q10.9310.9170.9390.8830.9460.948
2014Q20.9300.9290.9250.9310.9440.936
2014Q30.9520.9500.9390.9450.9450.949
2014Q40.8500.8340.8270.8720.8700.874
2015Q10.8750.8540.8470.8920.9050.901
2015Q20.9250.9230.9190.9220.9130.936
2015Q30.9320.9380.9260.8900.9120.936
2015Q40.9310.9460.9350.9100.9400.941
2016Q10.9280.9350.9280.9050.9610.941
2016Q20.9160.9280.9270.8870.9210.941
2016Q30.9150.8930.9070.8810.9040.929
2016Q40.9500.9390.9540.9330.9710.958
2017Q10.9550.9410.9380.9340.9550.957
2017Q20.9260.9130.8980.9270.8940.939
2017Q30.9430.9440.9410.9250.9420.947
2017Q40.9560.9570.9500.9270.9590.961
2018Q10.9770.9750.9760.9430.9790.963
2018Q20.9580.9530.9570.9140.9550.963
2018Q30.9680.9630.9650.9390.9700.964
2018Q40.9150.9270.9230.9030.9380.940
2019Q10.9570.9560.9580.9320.9580.947
2019Q20.9110.9090.9040.9260.9220.931
2019Q30.9400.9530.9400.9300.9490.967
2019Q40.9400.9180.9290.8970.9490.953
Note: Values in bold indicate the best performance metric within each data cohort.
Table A3. Recall rate of 6 models on 44 cohorts of the Freddie Mac dataset based on average results from 10 trials.
Table A3. Recall rate of 6 models on 44 cohorts of the Freddie Mac dataset based on average results from 10 trials.
CohortLSTM
[8]
BiLSTM
[12]
GRU
[56]
CNN
[57]
RNN
[58]
ResE-BiLSTM
2009Q10.8830.8720.8800.8290.8740.883
2009Q20.8920.8800.8810.8920.8790.896
2009Q30.9050.9090.9070.8940.9150.921
2009Q40.8860.8940.9230.8490.9080.889
2010Q10.8980.8890.8970.8940.8960.908
2010Q20.9100.9090.9060.8870.8970.911
2010Q30.9290.9180.9160.8910.9050.934
2010Q40.8700.8760.8730.8470.8610.856
2011Q10.9210.9380.9180.8920.9360.961
2011Q20.8730.8660.8600.8510.8540.874
2011Q30.9020.9230.9150.8370.9140.912
2011Q40.8970.9070.9050.8200.8800.914
2012Q10.9170.9270.9220.9000.9130.937
2012Q20.9140.9070.9080.9160.9190.922
2012Q30.9000.8990.9050.8810.8680.912
2012Q40.9100.9200.9140.8970.9130.928
2013Q10.9120.9140.9150.8540.9120.922
2013Q20.8690.8730.8650.8510.8430.883
2013Q30.9250.9260.9250.9000.9280.917
2013Q40.9090.9150.9000.8700.9080.941
2014Q10.9100.9120.9080.8910.9050.913
2014Q20.9030.9040.8970.8310.8900.910
2014Q30.9090.9130.9070.8910.9070.920
2014Q40.9110.9120.9170.8660.9020.919
2015Q10.9240.9280.9290.8940.9160.932
2015Q20.8950.8980.8870.8530.8910.899
2015Q30.9180.9210.9210.8980.9170.926
2015Q40.9230.9250.9210.8530.9150.931
2016Q10.8950.9020.9020.8630.8730.904
2016Q20.8980.8880.8960.8860.9030.890
2016Q30.9090.9110.9130.8770.9110.897
2016Q40.9090.9090.9130.8910.9060.923
2017Q10.8970.9060.9060.8670.9010.907
2017Q20.9160.9240.9250.8640.9270.930
2017Q30.8850.8840.8840.8330.8820.896
2017Q40.8910.8900.8930.8710.8850.907
2018Q10.8960.8940.8960.8840.8890.908
2018Q20.8950.8960.8910.8820.8890.903
2018Q30.8900.8940.8920.8790.8910.904
2018Q40.9230.9180.9200.8960.9130.923
2019Q10.8920.8880.8850.8780.8950.911
2019Q20.9570.9570.9570.9200.9550.958
2019Q30.9480.9440.9460.9100.9420.942
2019Q40.9640.9690.9680.9100.9570.970
Note: Values in bold indicate the best performance metric within each data cohort.
Table A4. Binary F1 of 6 models on 44 cohorts of the Freddie Mac dataset based on average results from 10 trials.
Table A4. Binary F1 of 6 models on 44 cohorts of the Freddie Mac dataset based on average results from 10 trials.
CohortLSTM
[8]
BiLSTM
[12]
GRU
[56]
CNN
[57]
RNN
[58]
ResE-BiLSTM
2009Q10.9150.9090.9110.8850.9100.916
2009Q20.8940.8800.8860.8960.8820.924
2009Q30.9130.9140.9140.9090.9190.924
2009Q40.9230.9210.9430.9000.9350.921
2010Q10.9360.9320.9350.9320.9370.944
2010Q20.9020.9010.9030.8870.9000.913
2010Q30.9110.9180.9140.8930.9110.927
2010Q40.8960.8980.8940.9040.9050.886
2011Q10.9250.9330.9260.9030.9250.909
2011Q20.9180.9160.9140.8900.9150.927
2011Q30.9210.9290.9340.8890.9110.907
2011Q40.9180.9160.9110.8770.9160.924
2012Q10.9050.8940.9070.8780.9180.898
2012Q20.8440.8330.8400.8580.8640.873
2012Q30.9100.9090.9110.8650.8990.922
2012Q40.9460.9510.9470.9320.9450.958
2013Q10.8820.8820.8870.8660.8710.917
2013Q20.9210.9270.9180.8850.9070.929
2013Q30.8960.8940.9000.8630.8900.910
2013Q40.9140.9140.9090.8850.9220.930
2014Q10.9200.9150.9230.8870.9250.930
2014Q20.9160.9160.9110.8780.9160.923
2014Q30.9300.9320.9220.9170.9250.934
2014Q40.8790.8710.8690.8690.8860.896
2015Q10.8990.8900.8850.8920.9100.916
2015Q20.9090.9100.9030.8860.9020.917
2015Q30.9250.9290.9230.8940.9140.931
2015Q40.9270.9350.9280.8800.9270.936
2016Q10.9110.9180.9140.8840.9150.922
2016Q20.9070.9070.9110.8860.9120.915
2016Q30.9120.9020.9100.8790.9070.913
2016Q40.9290.9240.9330.9110.9380.940
2017Q10.9250.9230.9220.8980.9270.931
2017Q20.9210.9180.9110.8940.9100.935
2017Q30.9130.9130.9110.8760.9110.921
2017Q40.9220.9220.9210.8980.9200.933
2018Q10.9340.9330.9340.9130.9310.935
2018Q20.9250.9240.9230.8970.9210.932
2018Q30.9270.9270.9270.9080.9290.933
2018Q40.9190.9220.9220.8990.9250.931
2019Q10.9230.9210.9200.9040.9250.928
2019Q20.9340.9320.9300.9230.9380.944
2019Q30.9440.9480.9430.9200.9450.954
2019Q40.9520.9430.9480.9030.9530.961
Note: Values in bold indicate the best performance metric within each data cohort.
Table A5. AUC values of 6 models on 44 cohorts of the Freddie Mac dataset based on average results from 10 trials.
Table A5. AUC values of 6 models on 44 cohorts of the Freddie Mac dataset based on average results from 10 trials.
CohortLSTM
[8]
BiLSTM
[12]
GRU
[56]
CNN
[57]
RNN
[58]
ResE-BiLSTM
2009Q10.9680.9670.9690.9540.9640.971
2009Q20.9550.9560.9580.9520.9480.969
2009Q30.9700.9710.9690.9660.9710.972
2009Q40.9770.9760.9800.9630.9790.972
2010Q10.9850.9830.9830.9770.9850.986
2010Q20.9570.9580.9610.9460.9550.963
2010Q30.9700.9700.9680.9570.9640.972
2010Q40.9630.9650.9520.9750.9570.955
2011Q10.9720.9750.9700.9620.9720.974
2011Q20.9730.9730.9750.9640.9790.981
2011Q30.9670.9690.9710.9600.9640.966
2011Q40.9680.9660.9650.9580.9710.973
2012Q10.9680.9670.9700.9430.9720.966
2012Q20.9300.9250.9300.9280.9430.945
2012Q30.9690.9710.9680.9380.9650.974
2012Q40.9750.9760.9790.9730.9790.981
2013Q10.9560.9590.9600.9310.9550.962
2013Q20.9770.9770.9780.9590.9770.979
2013Q30.9580.9580.9590.9380.9510.960
2013Q40.9680.9680.9670.9490.9720.974
2014Q10.9690.9680.9700.9490.9720.973
2014Q20.9700.9700.9670.9480.9720.977
2014Q30.9680.9700.9650.9640.9700.971
2014Q40.9520.9510.9460.9320.9490.953
2015Q10.9610.9580.9570.9480.9640.966
2015Q20.9620.9620.9600.9490.9590.963
2015Q30.9660.9650.9630.9480.9600.967
2015Q40.9710.9720.9710.9460.9700.965
2016Q10.9570.9650.9650.9480.9650.969
2016Q20.9520.9500.9520.9440.9530.949
2016Q30.9590.9550.9600.9390.9560.958
2016Q40.9660.9640.9670.9600.9710.971
2017Q10.9600.9600.9600.9520.9620.970
2017Q20.9600.9620.9590.9450.9590.963
2017Q30.9580.9600.9570.9380.9590.964
2017Q40.9630.9640.9620.9460.9610.973
2018Q10.9690.9700.9690.9600.9710.972
2018Q20.9670.9660.9660.9530.9670.969
2018Q30.9650.9670.9670.9560.9690.973
2018Q40.9700.9690.9690.9590.9710.973
2019Q10.9750.9730.9740.9680.9780.978
2019Q20.9800.9800.9790.9720.9800.983
2019Q30.9780.9790.9780.9690.9810.983
2019Q40.9840.9800.9820.9560.9830.986
Note: Values in bold indicate the best performance metric within each data cohort.
Figure A1. Barplot of (a) ResE-BiLSTM, (b) BiLSTM, (c) LSTM, (d) GRU, (e) RNN and (f) CNN. The vertical axis shows the importance rankings of the top 100 features in each month, with higher rankings indicating greater importance.
Figure A1. Barplot of (a) ResE-BiLSTM, (b) BiLSTM, (c) LSTM, (d) GRU, (e) RNN and (f) CNN. The vertical axis shows the importance rankings of the top 100 features in each month, with higher rankings indicating greater importance.
Information 17 00005 g0a1aInformation 17 00005 g0a1b
Figure A2. Summary plot of (a) ResE-BiLSTM, (b) BiLSTM, (c) LSTM, (d) GRU, (e) RNN and (f) CNN, showing dots for each sample. The vertical axis indicates sample density, while the horizontal axis shows the SHAP value of the features. Rightward SHAP values suggest a stronger positive prediction impact, leftward, a stronger negative impact. Color coding reflects feature values, with red as high and blue as low.
Figure A2. Summary plot of (a) ResE-BiLSTM, (b) BiLSTM, (c) LSTM, (d) GRU, (e) RNN and (f) CNN, showing dots for each sample. The vertical axis indicates sample density, while the horizontal axis shows the SHAP value of the features. Rightward SHAP values suggest a stronger positive prediction impact, leftward, a stronger negative impact. Color coding reflects feature values, with red as high and blue as low.
Information 17 00005 g0a2aInformation 17 00005 g0a2b

References

  1. Hilal, W.; Gadsden, S.A.; Yawney, J. Financial Fraud: A Review of Anomaly Detection Techniques and Recent Advances. Expert Syst. Appl. 2021, 193, 116429. [Google Scholar] [CrossRef]
  2. Al-Hashedi, K.G.; Al-Hashedi, K.G.; Magalingam, P.; Magalingam, P. Financial fraud detection applying data mining techniques: A comprehensive review from 2009 to 2019. Comput. Sci. Rev. 2021, 40, 100402. [Google Scholar] [CrossRef]
  3. Goh, C.C.; Yang, Y.; Bellotti, A.; Hua, X. Machine Learning for Chinese Corporate Fraud Prediction: Segmented Models Based on Optimal Training Windows. Information 2025, 16, 397. [Google Scholar] [CrossRef]
  4. Zhang, A.; Wang, S.; Liu, B.; Liu, P. How fintech impacts pre- and post-loan risk in Chinese commercial banks. Int. J. Financ. Econ. 2020, 27, 2514–2529. [Google Scholar] [CrossRef]
  5. Gupta, A.; Pant, V.; Kumar, S.; Bansal, P.K. Bank Loan Prediction System using Machine Learning. In Proceedings of the 2020 9th International Conference System Modeling and Advancement in Research Trends (SMART), Moradabad, India, 4–5 December 2020; pp. 423–426. [Google Scholar] [CrossRef]
  6. Madaan, M.; Kumar, A.; Keshri, C.; Jain, R.; Nagrath, P. Loan default prediction using decision trees and random forest: A comparative study. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1022, 012042. [Google Scholar] [CrossRef]
  7. Elmasry, M. Machine Learning Approach for Credit Score Analysis: A Case Study of Predicting Mortgage Loan Defaults. Master’s Thesis, Universidade NOVA de Lisboa, Lisbon, Portugal, 2019. [Google Scholar]
  8. Tang, Q.; Shi, R.; Fan, T.; Ma, Y.; Huang, J. Prediction of Financial Time Series Based on LSTM Using Wavelet Transform and Singular Spectrum Analysis. Math. Probl. Eng. 2021, 2021, 9942410. [Google Scholar] [CrossRef]
  9. Pardeshi, K.; Gill, S.S.; Abdelmoniem, A.M. Stock Market Price Prediction: A Hybrid LSTM and Sequential Self-Attention based Approach. arXiv 2023. [Google Scholar] [CrossRef]
  10. Gao, X.; Yang, X.; Zhao, Y. Rural micro-credit model design and credit risk assessment via improved LSTM algorithm. PeerJ Comput. Sci. 2023, 9, e1588. [Google Scholar] [CrossRef] [PubMed]
  11. Siami-Namini, S.; Siami-Namini, S.; Tavakoli, N.; Tavakoli, N.; Namin, A.S.; Namin, A.S. A Comparative Analysis of Forecasting Financial Time Series Using ARIMA, LSTM, and BiLSTM. arXiv 2019. [Google Scholar] [CrossRef]
  12. Siami-Namini, S.; Tavakoli, N.; Namin, A.S. The Performance of LSTM and BiLSTM in Forecasting Time Series. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; pp. 3285–3292. [Google Scholar] [CrossRef]
  13. Ji, Y. Explainable AI Methods for Credit Card Fraud Detection: Evaluation of LIME and SHAP Through a User Study. Master’s Thesis, University of Skövde, Skövde, Sweden, 2021. [Google Scholar]
  14. Sai, C.V.; Das, D.; Elmitwally, N.; Elezaj, O.; Islam, M.B. Explainable AI-Driven Financial Transaction Fraud Detection Using Machine Learning and Deep Neural Networks. SSRN 2023, preprint. [Google Scholar] [CrossRef]
  15. Mac, F. Freddie Mac Dataset. 2024. Available online: https://freddiemac.embs.com (accessed on 25 January 2024).
  16. Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  17. Li, Z.; Zhu, Y.; Van Leeuwen, M. A Survey on Explainable Anomaly Detection. ACM Trans. Knowl. Discov. Data 2023, 18, 23. [Google Scholar] [CrossRef]
  18. LendingClub. Lending-Club Dataset. 2024. Available online: https://github.com/matmcreative/Lending-Club-Loan-Analysis/ (accessed on 25 January 2024).
  19. Zandi, S.; Korangi, K.; Óskarsdóttir, M.; Mues, C.; Bravo, C. Attention-based Dynamic Multilayer Graph Neural Networks for Loan Default Prediction. arXiv 2024. [Google Scholar] [CrossRef]
  20. Wang, H.; Bellotti, A.; Qu, R.; Bai, R. Discrete-Time Survival Models with Neural Networks for Age–Period–Cohort Analysis of Credit Risk. Risks 2024, 12, 31. [Google Scholar] [CrossRef]
  21. Karthika, J.; Senthilselvi, A. An integration of deep learning model with Navo Minority Over-Sampling Technique to detect the frauds in credit cards. Multimed. Tools Appl. 2023, 82, 21757–21774. [Google Scholar] [CrossRef]
  22. Kanimozhi, P.; Parkavi, S.; Kumar, T.A. Predicting Mortgage-Backed Securities Prepayment Risk Using Machine Learning Models. In Proceedings of the 2023 2nd International Conference on Smart Technologies and Systems for Next Generation Computing (ICSTSN), Villupuram, India, 21–22 April 2023; pp. 1–8. [Google Scholar]
  23. Qian, C.; Hu, T.; Li, B. A BiLSTM-Attention Model for Detecting Smart Contract Defects More Accurately. In Proceedings of the International Conference on Software Quality, Reliability and Security, Guangzhou, China, 5–9 December 2022. [Google Scholar] [CrossRef]
  24. Narayan, V.; Ganapathisamy, S. Hybrid Sampling and Similarity Attention Layer in Bidirectional Long Short Term Memory in Credit Card Fraud Detection. Int. J. Intell. Eng. Syst. 2022, 15, 35–44. [Google Scholar] [CrossRef]
  25. Agarwal, A.; Iqbal, M.; Mitra, B.; Kumar, V.; Lal, N. Hybrid CNN-BILSTM-Attention Based Identification and Prevention System for Banking Transactions. Nat. Volatiles Essent. Oils 2021, 8, 2552–2560. [Google Scholar]
  26. Joy, B.; R, D. A Tensor Based Approach for Click Fraud Detection on Online Advertising Using BiLSTM and Attention based CNN. In Proceedings of the 2023 International Conference on Self Sustainable Artificial Intelligence Systems (ICSSAS), Erode, India, 18–20 October 2023. [Google Scholar] [CrossRef]
  27. Prabhakar, K.; Giridhar, M.S.; Amrita, T.; Joshi, T.M.; Pal, S.; Aswal, U. Comparative Evaluation of Fraud Detection in Online Payments Using CNN-BiGRU-A Approach. In Proceedings of the 2023 International Conference on Self Sustainable Artificial Intelligence Systems (ICSSAS), Erode, India, 18–20 October 2023. [Google Scholar] [CrossRef]
  28. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
  29. Cai, T.; Cai, T.; Yu, B.; Yu, B.; Xu, W.; Xu, W. Transformer-Based BiLSTM for Aspect-Level Sentiment Classification. In Proceedings of the 2021 4th International Conference on Robotics, Control and Automation Engineering (RCAE), Wuhan, China, 4–6 November 2021. [Google Scholar] [CrossRef]
  30. Boussougou, M.K.M.; Park, D. Attention-Based 1D CNN-BiLSTM Hybrid Model Enhanced with FastText Word Embedding for Korean Voice Phishing Detection. Mathematics 2023, 11, 3217. [Google Scholar] [CrossRef]
  31. Fang, W.; Jia, X.; Zhang, W.; Sheng, V.S. A New Distributed Log Anomaly Detection Method based on Message Middleware and ATT-GRU. KSII Trans. Internet Inf. Syst. 2023, 17, 486–503. [Google Scholar] [CrossRef]
  32. ALMahadin, G.; Aoudni, Y.; Shabaz, M.; Agrawal, A.V.; Yasmin, G.; Alomari, E.S.; Al-Khafaji, H.M.R.; Dansana, D.; Maaliw, R.R. VANET Network Traffic Anomaly Detection Using GRU-Based Deep Learning Model. IEEE Trans. Consum. Electron. 2024, 70, 4548–4555. [Google Scholar] [CrossRef]
  33. Mill, E.; Garn, W.; Ryman-Tubb, N.F.; Turner, C.J. Opportunities in Real Time Fraud Detection: An Explainable Artificial Intelligence (XAI) Research Agenda. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 1172–1186. [Google Scholar] [CrossRef]
  34. Nazir, Z.; Kaldykhanov, D.; Tolep, K.K.; Park, J.G. A Machine Learning Model Selection considering Tradeoffs between Accuracy and Interpretability. In Proceedings of the 2021 13th International Conference on Information Technology and Electrical Engineering (ICITEE), Chiang Mai, Thailand, 14–15 October 2021; pp. 63–68. [Google Scholar] [CrossRef]
  35. Raval, J.; Bhattacharya, P.; Jadav, N.K.; Tanwar, S.; Sharma, G.; Bokoro, P.N.; Elmorsy, M.; Tolba, A.; Raboaca, M.S. RaKShA: A Trusted Explainable LSTM Model to Classify Fraud Patterns on Credit Card Transactions. Mathematics 2023, 11, 1901. [Google Scholar] [CrossRef]
  36. Basel Committee on Banking Supervision (BCBS). Basel II: International Convergence of Capital Measurement and Capital Standards. 2006. Available online: https://www.bis.org/publ/bcbs128.htm (accessed on 12 January 2024).
  37. Menggang, L.; Zhang, Z.; Ming, L.; Jia, X.; Liu, R.; Zhou, X.; Zhang, Y. Internet Financial Credit Risk Assessment with Sliding Window and Attention Mechanism LSTM Model. Teh. Vjesn.—Tech. Gaz. 2023, 30, 1–7. [Google Scholar]
  38. Mukherjee, A.; Qazani, M.R.C.; Jewel Rana, B.; Akter, S.; Mohajerzadeh, A.; Sathi, N.J.; Ali, L.E.; Khan, M.S.; Asadi, H. SMOTE-ENN resampling technique with Bayesian optimization for multi-class classification of dry bean varieties. Appl. Soft Comput. 2025, 181, 113467. [Google Scholar] [CrossRef]
  39. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  40. Baumgartner, A.; Molani, S.; Wei, Q.; Hadlock, J. Imputing missing observations with time sliced synthetic minority oversampling technique. arXiv 2022, arXiv:2201.05634. [Google Scholar] [CrossRef]
  41. Öztürk, C. Enhancing Financial Time-Series Analysis with TimeGAN: A Novel Approach. In Proceedings of the 2024 9th International Conference on Computer Science and Engineering (UBMK), Antalya, Turkiye, 26–28 October 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 447–450. [Google Scholar]
  42. Kim, W.; Jeon, J.; Kim, S.; Jang, M.; Lee, H.; Yoo, S.; Oh, K.J. Prediction of index futures movement using TimeGAN and 3D-CNN: Empirical evidence from Korea and the United States. Appl. Soft Comput. 2025, 171, 112748. [Google Scholar] [CrossRef]
  43. Kim, D.; Park, J.; Lee, J.; Kim, H. Are self-attentions effective for time series forecasting? Adv. Neural Inf. Process. Syst. 2024, 37, 114180–114209. [Google Scholar]
  44. Sun, W.; Liu, Z.; Yuan, C.; Zhou, X.; Pei, Y.; Wei, C. RCSAN residual enhanced channel spatial attention network for stock price forecasting. Sci. Rep. 2025, 15, 21800. [Google Scholar] [CrossRef] [PubMed]
  45. Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
  46. Cao, K.; Zhang, T.; Huang, J. Advanced hybrid LSTM-transformer architecture for real-time multi-task prediction in engineering systems. Sci. Rep. 2024, 14, 4890. [Google Scholar] [CrossRef]
  47. Schuster, M.; Paliwal, K.K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef]
  48. Mohammed, R.; Rawashdeh, J.; Abdullah, M. Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results. In Proceedings of the 2020 11th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 7–9 April 2020; pp. 243–248. [Google Scholar] [CrossRef]
  49. Chamseddine, E.; Mansouri, N.; Soui, M.; Abed, M. Handling class imbalance in COVID-19 chest X-ray images classification: Using SMOTE and weighted loss. Appl. Soft Comput. 2022, 129, 109588. [Google Scholar] [CrossRef]
  50. Doan, Q.H.; Mai, S.H.; Do, Q.T.; Thai, D.K. A cluster-based data splitting method for small sample and class imbalance problems in impact damage classification. Appl. Soft Comput. 2022, 120, 108628. [Google Scholar] [CrossRef]
  51. Zheng, Z.; Cai, Y.; Li, Y. Oversampling Method for Imbalanced Classification. Comput. Inform. 2015, 34, 1017–1037. [Google Scholar]
  52. Lei, J.Z.; Ghorbani, A.A. Improved competitive learning neural networks for network intrusion and fraud detection. Neurocomputing 2012, 75, 135–145. [Google Scholar] [CrossRef]
  53. Lessmann, S.; Baesens, B.; Seow, H.V.; Thomas, L.C. Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. Eur. J. Oper. Res. 2015, 247, 124–136. [Google Scholar] [CrossRef]
  54. Yang, Y.; Fang, T.; Hu, J.; Goh, C.C.; Zhang, H.; Cai, Y.; Bellotti, A.G.; Lee, B.G.; Ming, Z. A comprehensive study on the interplay between dataset characteristics and oversampling methods. J. Oper. Res. Soc. 2025, 76, 1981–2002. [Google Scholar] [CrossRef]
  55. Zhu, T.; Lin, Y.; Liu, Y. Improving interpolation-based oversampling for imbalanced data learning. Knowl. Based Syst. 2020, 187, 104826. [Google Scholar] [CrossRef]
  56. Touzani, Y.; Douzi, K. An LSTM and GRU based trading strategy adapted to the Moroccan market. J. Big Data 2021, 8, 126. [Google Scholar] [CrossRef]
  57. Kakkar, S.; B, S.; Reddy, L.S.; Pal, S.; Dimri, S.; Nishant, N. Analysis of Discovering Fraud in Master Card Based on Bidirectional GRU and CNN Based Model. In Proceedings of the 2023 International Conference on Self Sustainable Artificial Intelligence Systems (ICSSAS), Erode, India, 18–20 October 2023. [Google Scholar] [CrossRef]
  58. Khalid, A.; Mustafa, G.; Rana, M.R.R.; Alshahrani, S.M.; Alymani, M. RNN-BiLSTM-CRF based amalgamated deep learning model for electricity theft detection to secure smart grids. PeerJ Comput. Sci. 2024, 10, e1872. [Google Scholar] [CrossRef]
Figure 1. Architecture of the proposed ResE-BiLSTM model.
Figure 1. Architecture of the proposed ResE-BiLSTM model.
Information 17 00005 g001
Figure 2. Example of (a) F1 and (b) AUC values for each model using different resampling methods in case of cohort 2019Q4.
Figure 2. Example of (a) F1 and (b) AUC values for each model using different resampling methods in case of cohort 2019Q4.
Information 17 00005 g002
Figure 3. Example of (a) F1 and (b) AUC for each model under different feature window lengths in case of cohort 2019Q3.
Figure 3. Example of (a) F1 and (b) AUC for each model under different feature window lengths in case of cohort 2019Q3.
Information 17 00005 g003
Figure 4. SHAP feature importance of the top three features over time: (a) Interest Bearing UPB-Delta, (b) Current Actual UPB-Delta, and (c) ELTV. Note: The horizontal axis denotes the month index (e.g., 14 = 14th month), while the vertical axis shows the feature importance in the top 50 rankings, where 50 signifies the most important feature.
Figure 4. SHAP feature importance of the top three features over time: (a) Interest Bearing UPB-Delta, (b) Current Actual UPB-Delta, and (c) ELTV. Note: The horizontal axis denotes the month index (e.g., 14 = 14th month), while the vertical axis shows the feature importance in the top 50 rankings, where 50 signifies the most important feature.
Information 17 00005 g004
Table 1. Summary of the 44 independent cohorts in the Freddie Mac Single-Family Loan-Level Dataset.
Table 1. Summary of the 44 independent cohorts in the Freddie Mac Single-Family Loan-Level Dataset.
CohortNumber of
Loans
Average Loan
Length (Months)
Median Loan
Length (Months)
Default
Rate
2009Q117,60456.805421.755%
2009Q216,73059.773421.470%
2009Q315,72863.581442.893%
2009Q416,08062.189422.674%
2010Q115,77963.375422.884%
2010Q216,13261.989403.149%
2010Q315,52564.412462.209%
2010Q412,95777.178671.443%
2011Q113,96971.587612.098%
2011Q216,21161.687472.807%
2011Q315,19665.807552.165%
2011Q412,63179.170761.362%
2012Q112,30481.274851.756%
2012Q211,56686.460921.816%
2012Q311,20989.214951.963%
2012Q410,93991.416971.901%
2013Q111,13889.783962.182%
2013Q211,44487.382922.386%
2013Q312,73378.536832.592%
2013Q415,04566.467652.951%
2014Q116,00262.492623.412%
2014Q216,28761.399612.923%
2014Q315,71563.633672.660%
2014Q415,77863.379662.903%
2015Q115,63863.947672.954%
2015Q214,59268.531682.947%
2015Q315,92362.802633.410%
2015Q415,86063.052623.140%
2016Q117,18058.207593.423%
2016Q216,10262.104593.198%
2016Q315,87163.008603.459%
2016Q415,96162.653613.390%
2017Q119,38651.584494.354%
2017Q220,06649.836454.625%
2017Q320,01949.953444.311%
2017Q420,19449.520444.600%
2018Q122,99943.480394.913%
2018Q226,34337.961314.555%
2018Q329,62933.751274.398%
2018Q432,09531.158244.518%
2019Q136,70427.245224.795%
2019Q234,22629.218224.885%
2019Q332,06131.191244.429%
2019Q430,08333.241294.168%
Table 2. Overview of the features in the Freddie Mac Single-Family Loan-Level Dataset.
Table 2. Overview of the features in the Freddie Mac Single-Family Loan-Level Dataset.
No.FeatureDescription
1Loan Sequence NumberUnique ID allocated for every loan.
2Current Actual UPBIndicates the reported final balance of the mortgage.
3Current Loan Delinquency StatusDays overdue relative to the due date of the most recent payment made.
4Defect Settlement DateDate for resolution of Underwriting or Servicing Defects that are pending confirmation.
5Modification FlagSignifies that the loan has been altered.
6Current Interest Rate (Current IR)Displays the present interest rate on the mortgage note, with any modifications included.
7Current Deferred UPBThe current non-interest bearing UPB of the modified loan.
8Due Date Of Last Paid Installment (DDLPI)Date until which the principal and interest on a loan are paid.
9Estimated Loan To Value (ELTV)LTV ratio using Freddie Mac’s AVM value.
10Delinquency Due To DisasterIndicator for hardship associated with disasters as reported by the Servicer.
11Borrower Assistance Status CodeType of support arrangement for interim loan payment mitigation.
12Current Month Modification CostMonthly expense resulting from rate adjustment or UPB forbearance.
13Interest Bearing UPBThe interest-bearing UPB of the adjusted loan.
Table 3. A v g R R ( d , r ) of the 44 data cohorts under different resampling methods when feature window length is 12 (lower values indicate better performance).
Table 3. A v g R R ( d , r ) of the 44 data cohorts under different resampling methods when feature window length is 12 (lower values indicate better performance).
Data CohortsRUSSMOTETimeGANtSMOTEROSNo Resampling
2009Q112.921.314.013.024.525.2
2009Q214.321.413.615.221.525.0
2009Q314.420.314.513.822.325.6
2009Q412.622.215.014.121.525.6
2010Q112.020.014.414.124.725.8
2010Q211.721.115.314.623.025.3
2010Q313.620.014.613.922.926.1
2010Q413.421.014.413.522.326.5
2011Q112.521.213.415.023.225.7
2011Q213.621.114.313.622.625.9
2011Q313.120.014.213.123.926.8
2011Q413.821.514.513.122.026.1
2012Q112.720.116.614.122.924.5
2012Q214.821.315.113.521.524.8
2012Q313.322.114.014.122.125.4
2012Q411.320.414.915.223.026.2
2013Q111.321.415.514.321.926.6
2013Q213.521.413.514.323.025.2
2013Q311.620.915.614.621.626.6
2013Q412.421.614.313.522.426.8
2014Q113.420.314.313.823.425.9
2014Q214.820.113.314.223.525.1
2014Q313.022.213.213.823.325.5
2014Q412.220.415.614.322.426.0
2015Q112.021.014.813.822.726.6
2015Q213.821.114.413.722.625.3
2015Q314.221.413.213.821.526.9
2015Q413.221.814.013.522.725.7
2016Q113.421.213.213.923.026.3
2016Q212.920.215.214.522.625.7
2016Q314.420.114.413.323.025.8
2016Q413.720.314.115.222.025.6
2017Q111.921.213.815.423.025.8
2017Q212.420.414.814.623.125.7
2017Q314.321.513.513.221.926.6
2017Q411.321.215.315.422.425.3
2018Q113.920.814.413.822.925.2
2018Q212.320.415.213.822.526.7
2018Q313.920.013.513.424.126.2
2018Q412.820.813.413.923.426.6
2019Q111.420.414.414.423.826.7
2019Q213.220.415.214.121.726.4
2019Q312.222.313.713.022.826.9
2019Q412.722.114.313.323.625.0
Average13.021.014.314.022.825.9
Note: Values in bold indicate the best performance metric within each data cohort.
Table 4. A v g R R ( d , r ) of the 44 data cohorts under different resampling methods when feature window length is 18 (lower values indicate better performance).
Table 4. A v g R R ( d , r ) of the 44 data cohorts under different resampling methods when feature window length is 18 (lower values indicate better performance).
Data CohortsRUSSMOTETimeGANtSMOTEROSNo Resampling
2009Q18.119.912.913.625.231.3
2009Q210.618.614.213.723.730.1
2009Q36.819.712.812.526.332.8
2009Q410.018.814.214.124.129.8
2010Q16.619.712.412.426.833.1
2010Q28.519.113.413.425.131.4
2010Q39.918.813.914.324.229.8
2010Q410.118.513.713.824.130.9
2011Q110.519.114.314.123.829.2
2011Q29.219.413.713.724.430.7
2011Q311.318.714.914.522.928.7
2011Q410.519.614.313.723.829.2
2012Q111.219.114.514.722.928.5
2012Q211.119.014.714.323.228.7
2012Q310.418.914.714.623.528.8
2012Q48.419.613.112.925.331.8
2013Q110.819.414.714.623.428.1
2013Q210.518.714.314.024.029.6
2013Q311.519.114.314.323.428.4
2013Q410.318.814.214.623.929.2
2014Q110.019.313.913.923.929.9
2014Q29.819.213.414.024.729.9
2014Q37.319.113.213.025.632.7
2014Q410.519.014.114.223.729.6
2015Q111.119.014.214.423.528.7
2015Q211.418.512.512.324.831.5
2015Q39.818.814.114.324.229.8
2015Q410.119.314.114.524.029.1
2016Q110.318.814.614.323.929.0
2016Q29.519.013.713.424.431.1
2016Q39.919.014.214.323.929.7
2016Q49.418.414.313.524.331.0
2017Q19.119.313.714.024.430.5
2017Q210.418.814.314.423.829.4
2017Q312.819.412.312.124.330.1
2017Q49.319.114.314.023.930.4
2018Q17.719.613.313.625.231.6
2018Q28.919.213.813.225.230.7
2018Q311.619.711.312.825.130.5
2018Q49.418.713.913.425.030.5
2019Q18.819.313.312.725.231.7
2019Q29.519.513.212.225.331.4
2019Q39.119.013.613.924.830.6
2019Q410.619.414.214.623.828.4
Average9.819.113.813.724.330.2
Note: Values in bold indicate the best performance metric within each data cohort.
Table 5. A v g R L ( d , l ) of the 44 data cohorts under different feature window lengths (lower values indicate better performance).
Table 5. A v g R L ( d , l ) of the 44 data cohorts under different feature window lengths (lower values indicate better performance).
Data CohortsMonths
12 14 16 18 20 22 24
2009Q119.711.815.216.527.528.731.1
2009Q222.68.413.018.224.927.635.8
2009Q320.313.212.912.424.430.237.1
2009Q418.110.613.515.725.527.239.9
2010Q119.57.614.214.827.827.838.8
2010Q218.711.914.414.524.527.339.2
2010Q317.511.613.313.425.729.639.4
2010Q418.914.913.217.424.627.234.3
2011Q120.810.88.713.428.229.539.1
2011Q217.311.812.317.027.130.634.4
2011Q321.59.413.913.828.227.835.9
2011Q421.28.712.812.627.728.039.5
2012Q120.510.713.312.823.928.041.3
2012Q221.010.611.518.425.828.434.8
2012Q317.814.315.713.622.830.635.7
2012Q421.89.212.815.026.529.735.5
2013Q117.511.812.515.928.227.836.8
2013Q219.49.28.418.027.627.440.5
2013Q321.610.515.218.324.624.136.2
2013Q421.611.914.49.224.727.141.6
2014Q118.88.315.416.527.929.534.1
2014Q217.211.411.418.627.729.934.3
2014Q319.210.714.917.826.129.931.9
2014Q421.312.612.114.425.327.737.1
2015Q119.212.87.718.924.230.637.1
2015Q219.38.715.617.624.531.633.2
2015Q318.39.511.717.926.827.838.5
2015Q420.910.915.112.024.727.539.4
2016Q121.810.67.716.224.330.939.0
2016Q217.712.78.216.527.230.038.2
2016Q319.08.714.615.125.628.039.5
2016Q420.39.312.617.626.528.935.3
2017Q118.712.513.712.424.330.438.5
2017Q218.711.313.712.825.731.337.0
2017Q319.912.413.717.928.429.828.4
2017Q417.712.19.513.725.329.243.0
2018Q117.311.613.514.227.629.037.3
2018Q220.19.011.515.326.128.240.3
2018Q320.212.214.618.227.231.726.4
2018Q419.911.514.414.524.427.438.4
2019Q120.710.715.212.125.827.338.7
2019Q220.411.911.415.927.627.136.2
2019Q318.77.612.418.027.129.836.9
2019Q419.312.713.413.726.428.736.3
Average19.610.912.815.426.128.836.9
Note: Values in bold indicate the best performance metric within each data cohort.
Table 6. Summary of the overall average ranking A v g R M ( d , m ) for model m within each cohort (lower values indicate better performance).
Table 6. Summary of the overall average ranking A v g R M ( d , m ) for model m within each cohort (lower values indicate better performance).
QuarterLSTMBiLSTMGRUCNNRNNResE-BiLSTM
2009Q12.24.83.45.44.21
2009Q235.23.62.85.41
2009Q34.843.85.22.21
2009Q43.44.215.624.8
2010Q12.84.84.25.62.61
2010Q23.242.664.21
2010Q342.23.65.84.41
2010Q43.42.64.82.22.65.4
2011Q13.41.635.43.64
2011Q23.23.4463.41
2011Q33.42.21.25.24.24.8
2011Q42.83.84.85.22.61.6
2012Q1342.461.83.8
2012Q2464.63.421
2012Q333.635.84.61
2012Q44.22.43.264.21
2013Q13.63.82.255.41
2013Q23.42.23.85.84.21.6
2013Q333.62.464.21.8
2013Q43.63.24.862.21.2
2014Q13.84.43.262.61
2014Q233.855.42.61.2
2014Q32.82.25.25.63.81.4
2014Q43.244.84.83.21
2015Q13.44.454.62.41.2
2015Q22.62.44.45.651
2015Q3323.8651.2
2015Q441.4363.82.8
2016Q14.62.63.862.81.2
2016Q23.83.82.8622.4
2016Q32.44.42.263.82.2
2016Q444.62.862.41.2
2017Q13.63.44.662.41
2017Q233.245.24.61
2017Q332.44.464.21
2017Q43.22.6464.21
2018Q12.43.83.463.61.8
2018Q22.43.4464.21
2018Q33.63.83.862.21.6
2018Q443.63.862.61
2019Q134.24621.8
2019Q23.23.84.85.231
2019Q33.62.44.2631.8
2019Q434.43.862.81
Note: Values in bold indicate the best performance metric within each data cohort.
Table 7. Summary of the annual average ranking A v g R M ( Y , m ) for model m in year Y (lower values indicate better performance).
Table 7. Summary of the annual average ranking A v g R M ( Y , m ) for model m in year Y (lower values indicate better performance).
YearLSTMBiLSTMGRUCNNRNNResE-BiLSTM
200912.1014.1011.6517.3012.407.45
201012.1511.8012.7015.7012.3510.30
201110.759.5510.5520.2012.2511.65
201212.5012.8012.1016.4511.809.35
201312.1511.4011.8019.4013.207.05
201411.8512.4013.9518.5511.406.85
201511.309.9512.7520.6512.907.45
201612.6012.7010.7020.2510.658.05
201711.0510.7513.7021.7512.605.15
201811.4511.8012.2021.6011.606.35
201911.1012.4512.8020.8510.157.65
Note: Values in bold indicate the best performance metric within each data cohort.
Table 8. Wilcoxon signed-rank test results (p-values) for comparing ResE-BiLSTM with the baseline models LSTM and BiLSTM. Values in bold indicate statistical significance at the 0.5% significance level.
Table 8. Wilcoxon signed-rank test results (p-values) for comparing ResE-BiLSTM with the baseline models LSTM and BiLSTM. Values in bold indicate statistical significance at the 0.5% significance level.
Data Cohortp-Value for F1p-Value for AUC
LSTM BiLSTM LSTM BiLSTM
2009Q14.12   ×   10 1 2.68   ×   10 2 2.28   ×   10 1 5.87   ×   10 1
2009Q2 6.84   ×   10 3 9.77   ×   10 4 2.49   ×   10 3 1.10   ×   10 3
2009Q3 3.24   ×   10 3 2.47   ×   10 3 1.98   ×   10 2 1.14   ×   10 2
2009Q47.40   ×   10 1 3.21   ×   10 1 6.25   ×   10 1 4.85   ×   10 1
2010Q1 2.91   ×   10 3 4.11   ×   10 3 1.33   ×   10 1 2.88   ×   10 1
2010Q2 4.62   ×   10 3 2.18   ×   10 3 4.74   ×   10 3 9.33   ×   10 3
2010Q3 2.89   ×   10 3 2.28   ×   10 1 3.60   ×   10 2 2.28   ×   10 1
2010Q45.12   ×   10 1 6.18   ×   10 1 7.40   ×   10 1 8.32   ×   10 1
2011Q16.25   ×   10 1 3.60   ×   10 2 9.83   ×   10 2 5.20   ×   10 2
2011Q2 1.68   ×   10 3 9.77   ×   10 4 9.77   ×   10 4 1.60   ×   10 3
2011Q33.60   ×   10 1 3.28   ×   10 1 2.28   ×   10 1 3.60   ×   10 2
2011Q4 2.05   ×   10 3 3.29   ×   10 3 2.61   ×   10 3 1.42   ×   10 3
2012Q14.85   ×   10 1 6.25   ×   10 1 4.12   ×   10 1 3.60   ×   10 2
2012Q2 2.14   ×   10 3 4.81   ×   10 3 3.41   ×   10 3 9.77   ×   10 4
2012Q3 1.86   ×   10 3 4.03   ×   10 3 4.82   ×   10 3 1.01   ×   10 1
2012Q4 1.39   ×   10 3 7.24   ×   10 3 1.93   ×   10 3 4.28   ×   10 3
2013Q1 2.74   ×   10 3 4.57   ×   10 3 3.44   ×   10 3 1.18   ×   10 3
2013Q2 3.05   ×   10 3 8.21   ×   10 2 3.60   ×   10 2 4.85   ×   10 1
2013Q3 2.63   ×   10 3 1.41   ×   10 3 4.91   ×   10 1 2.28   ×   10 1
2013Q4 1.58   ×   10 3 1.02   ×   10 3 1.24   ×   10 3 1.72   ×   10 3
2014Q13.60   ×   10 2 2.28   ×   10 3 1.01   ×   10 2 4.33   ×   10 3
2014Q25.47   ×   10 3 6.41   ×   10 3 3.18   ×   10 3 2.24   ×   10 3
2014Q32.24   ×   10 2 1.51   ×   10 2 2.17   ×   10 2 1.83   ×   10 2
2014Q4 2.06   ×   10 3 1.42   ×   10 3 7.36   ×   10 3 5.57   ×   10 3
2015Q1 1.13   ×   10 3 4.22   ×   10 3 1.21   ×   10 3 1.07   ×   10 3
2015Q2 3.29   ×   10 3 2.87   ×   10 3 2.38   ×   10 2 1.64   ×   10 2
2015Q3 1.99   ×   10 3 8.91   ×   10 2 1.28   ×   10 2 1.09   ×   10 3
2015Q4 1.82   ×   10 3 1.06   ×   10 2 7.53   ×   10 1 6.76   ×   10 1
2016Q12.28   ×   10 1 6.25   ×   10 1 2.548   ×   10 3 1.01   ×   10 1
2016Q2 3.61   ×   10 3 4.12   ×   10 3 1.01   ×   10 1 4.61   ×   10 1
2016Q33.22   ×   10 2 2.44   ×   10 3 3.59   ×   10 1 1.95   ×   10 1
2016Q4 2.57   ×   10 3 1.91   ×   10 3 4.28   ×   10 3 4.95   ×   10 3
2017Q19.71   ×   10 2 2.18   ×   10 3 2.07   ×   10 3 1.43   ×   10 3
2017Q2 1.92   ×   10 3 1.25   ×   10 3 1.86   ×   10 2 1.52   ×   10 2
2017Q33.41   ×   10 2 7.13   ×   10 2 2.95   ×   10 3 1.88   ×   10 3
2017Q4 2.49   ×   10 3 1.33   ×   10 3 1.62   ×   10 3 1.11   ×   10 3
2018Q15.20   ×   10 2 6.80   ×   10 2 4.91   ×   10 2 6.39   ×   10 2
2018Q2 2.71   ×   10 3 1.93   ×   10 3 2.12   ×   10 2 4.57   ×   10 3
2018Q3 1.64   ×   10 3 1.08   ×   10 3 1.26   ×   10 3 1.02   ×   10 3
2018Q4 2.36   ×   10 3 1.79   ×   10 3 4.34   ×   10 3 4.17   ×   10 3
2019Q12.85   ×   10 2 2.23   ×   10 2 1.58   ×   10 2 4.13   ×   10 3
2019Q2 1.42   ×   10 3 1.03   ×   10 3 1.12   ×   10 2 9.77   ×   10 3
2019Q3 1.27   ×   10 3 1.86   ×   10 3 3.05   ×   10 3 2.98   ×   10 3
2019Q4 4.12   ×   10 3 3.25   ×   10 3 2.08   ×   10 2 3.44   ×   10 3
Table 9. Overview evaluation of ablation study performance with proposed ResE-BiLSTM, E-BiLSTM, A-BiLSTM, BiLSTM, and LSTM.
Table 9. Overview evaluation of ablation study performance with proposed ResE-BiLSTM, E-BiLSTM, A-BiLSTM, BiLSTM, and LSTM.
CohortMetricsResE-BiLSTME-BiLSTM
(M1)
A-BiLSTM
(M2)
BiLSTM
(M3)
LSTM
(M4)
2009; 2010; 2011Accuracy0.92830.91510.75140.91210.9040
Precision0.96140.94930.94510.94670.9534
Recall0.89170.86700.53470.87340.8497
F10.92520.90630.68120.90850.8984
AUC0.97090.96180.87020.96140.9594
2012; 2013; 2014Accuracy0.93110.91840.74600.90860.9079
Precision0.93170.91910.74210.89300.8957
Recall0.94040.92670.77500.92860.9234
F10.93600.92290.75350.91040.9093
AUC0.97240.96120.84750.95770.9556
2015; 2016; 2017Accuracy0.92030.90500.70470.89330.8882
Precision0.89450.88430.68390.88110.8696
Recall0.93120.91840.77630.90940.9132
F10.91250.90100.72410.89500.8909
AUC0.96780.95720.81010.95610.9549
2018; 2019; 2020Accuracy0.93310.91960.79500.91540.9120
Precision0.97910.96870.95990.95790.9579
Recall0.86710.84960.69670.83250.8257
F10.91970.90520.80740.89080.8869
AUC0.97360.96190.90590.95930.9599
Table 10. The number of months each feature appears in the top 50 feature importance rankings (up to a maximum of 14).
Table 10. The number of months each feature appears in the top 50 feature importance rankings (up to a maximum of 14).
FeatureResE-BiLSTMBiLSTMLSTMGRURNNCNN
Interest Bearing UPB-Delta141414141414
Current Actual UPB-Delta141414141414
Estimated Loan to Value (ELTV)121114141114
Borrower Assistance Status Code_F33433-
Delinquency Due To Disaster_Y43323-
Current Deferred UPB33-348
Delinquency Due To Disaster_NAN-1--1-
Borrower Assistance Status Code_NAN-1----
Current Interest Rate--1---
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, Y.; Lin, Y.; Zhang, Y.; Su, Z.; Goh, C.C.; Fang, T.; Bellotti, A.; Lee, B.G. Transforming Credit Risk Analysis: A Time-Series-Driven ResE-BiLSTM Framework for Post-Loan Default Detection. Information 2026, 17, 5. https://doi.org/10.3390/info17010005

AMA Style

Yang Y, Lin Y, Zhang Y, Su Z, Goh CC, Fang T, Bellotti A, Lee BG. Transforming Credit Risk Analysis: A Time-Series-Driven ResE-BiLSTM Framework for Post-Loan Default Detection. Information. 2026; 17(1):5. https://doi.org/10.3390/info17010005

Chicago/Turabian Style

Yang, Yue, Yuxiang Lin, Ying Zhang, Zihan Su, Chang Chuan Goh, Tangtangfang Fang, Anthony Bellotti, and Boon Giin Lee. 2026. "Transforming Credit Risk Analysis: A Time-Series-Driven ResE-BiLSTM Framework for Post-Loan Default Detection" Information 17, no. 1: 5. https://doi.org/10.3390/info17010005

APA Style

Yang, Y., Lin, Y., Zhang, Y., Su, Z., Goh, C. C., Fang, T., Bellotti, A., & Lee, B. G. (2026). Transforming Credit Risk Analysis: A Time-Series-Driven ResE-BiLSTM Framework for Post-Loan Default Detection. Information, 17(1), 5. https://doi.org/10.3390/info17010005

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop