2. Related Works
Credit default prediction task can be essentially treated as a binary classification problem (default vs non-default). Based on recent review studies [
6,
15,
16], existing methods to predict credit default can generally be grouped into the following:
The field of credit scoring boasts a variety of statistical methods, including discriminant analysis [
17], logistic regression [
18] and linear probability models [
15]. These techniques are popular due to their ease of use and ability to provide insights into borrower behaviors [
19]. There are also studies that implement traditional machine learning models such as support vector machine (SVM), Naive Bayes (NB), and k-nearest-neighbor (KNN), etc., to predict credit default [
9,
20].
Refs. [
21,
22] are among the early works to compare the different methods for credit scoring on different real-world datasets and suggest that neural networks could potentially be an effective approach to the problem. Ref. [
23] proposed transformer-based networks. Refs. [
24,
25] proposed networks that combine the power of transformer and convolutional networks (CNN). Note that for these studies, the majority of the credit risk datasets (e.g., [
26,
27]) are in a tabular format where each row represents a unique borrower or a specific loan application and each column corresponds to a particular piece of information about that borrower (e.g., financial information, loan details, etc.).
Ensemble techniques in machine learning aim to improve the prediction performance by combining the outputs of several base models and can be generally classified into three types: bagging (e.g., random forest), boosting (e.g., XGboost, lightGBM), and stacking. Recent studies such as [
28,
29] suggest that tree-based boosting algorithms generally perform better than others (including deep tabular models such as TabNet or SAINT) on tabular datasets and therefore become more dominant in credit risk prediction. For datasets that are not originally exactly tabular (e.g., [
30,
31]), many studies (e.g., [
32,
33,
34,
35]) also tend to transform the raw data into a single wide tabular dataset first and then apply tree-based boosting models.
To further improve prediction performance, recent studies gradually shift focus to explore non-tabular data input. Ref. [
36] proposed RNN model to specifically process raw transactional data. Ref. [
37] proposed an edge weight-shared graph convolutional network to process transactional data. Ref. [
10] introduced attentive feature fusion for credit default prediction. Ref. [
38] proposed the use of a BERT-based NLP to process text features. Ref. [
39] proposed a transformer-based network to specifically process user behavior events. Ref. [
40] proposed dynamic graph learning to process user-merchant and user-user interactions on two real-world private datasets.
Furthermore, Ref. [
41] proposed that advanced representation learning and anomaly detection techniques, such as optimized enhanced stacked autoencoders, have been demonstrated to be highly effective in related financial risk domains like banking fraud detection. Similarly, Ref. [
42] proposed that capturing temporal dynamics through deep learning is vital in these broader contexts, as demonstrated by the successful use of transactional-behavior-based hierarchical gated networks for credit card fraud detection. Ref. [
43] proposed a tabular transformer to process multivariate time series and verified the efficiency on a synthetic credit card transaction dataset to detect fraud as well as on a real pollution dataset to predict atmospheric pollutant concentrations.
In summary, the majority of existing studies focus on tree-based approaches for tabular credit risk dataset and hence require extensive efforts on feature engineering. This makes them less favorable for deployment in real-world setting due to time and resource constraints. While some studies have started to explore use of different neural networks to process raw transactional data or user behavior events often as a supplement to the heavily feature engineered approaches or require flattening of the sequential records into a 1D vector. In contrast, our proposed architecture aims to directly ingest the multivariate credit data as a 2D sequence. Furthermore, unlike many studies relying on proprietary financial data, our framework is validated entirely on publicly available datasets to ensure full reproducibility.
4. Method
We propose a transformer-based neural network that can process multi-variate time-series credit risk data without the need for elaborate feature engineering which renders it very favorable for deployment in real world settings.
4.1. Model Training and Prediction Pipeline Overview
Figure 2 shows the steps involved in the model training pipeline. The raw data are pre-processed and then split 5-fold for cross-validation purposes and then passed into the proposed model for training.
Figure 3 shows the overview of the inference pipeline. The raw data are similarly pre-processed as per during training and then passed into the trained model to generate the prediction results. Note that the pre-processing step is dataset-specific.
4.2. Data Pre-Processing of the Amex Dataset
4.2.1. De-Noise
For the Amex dataset, an analysis of the features such as ’B_2’ in
Figure 4 shows that there is random uniform noise in the range [0, 0.01] for all features (probably introduced by the data provider).
Hence, all feature columns in the raw data are de-noised by the following equation.
4.2.2. Missing Value Handling
Null values are replaced with zeros.
4.2.3. One-Hot Encoding
One-hot encoding is performed for categorical features such as “B_30” and “B_38”.
4.3. Data Pre-Processing of the Taiwan Dataset
4.3.1. Train-Test Split
Owing to the lack of a predefined train–test split and evaluation metrics in the Taiwan dataset, prior studies such as [
9,
45], adopt their own evaluation metrics, making direct benchmarking challenging. To evaluate the proposed method’s generalization capability on this dataset, we employ a standard random train–test split with a 70%–30% ratio and use AUC as the evaluation metric to facilitate comparison.
4.3.2. Dataframe Reshaping
Three time series columns are generated from the following to adapt to the multivariate time series input framework of the proposed model:
The six variables about the status of the past payments;
The six variables about the amount of past bill statement;
The six variables about the amount of paid bills.
4.4. Five-Fold Cross-Validation
We adopted 5-fold cross-validation in our model training. More details can be found in the
Appendix A.
4.5. Proposed Transformer-Based Model
The transformer was first introduced in [
46]. More details on transformer/attention/self-attention/multi-head-attention are given in the
Appendix A.
Figure 5 shows the overall architecture of our proposed transformer-based model. It comprises five blocks, namely, ’linear layer and normalization’, ’positional encoding’, ’transformer encoder’, ’bidirectional GRU’ and ’feedforward neural network’. The model takes in the processed input data from the data pre-processing module and the outputs a prediction probability of whether it is a default or not.
The linear layer converts the preprocessed time series data into a fixed hidden dimension. Layer normalization is applied to maintain a consistent distribution and mitigate the internal covariate shift that can occur during training. Positional encoding is applied to incorporate temporal information of the relative position in the time series. It is computed as follows [
46]:
Encoder-only transformer structure is adopted here, and the transformer encoder block consists of a stack of N transformer encoder layers specifically designed to capture the temporal dynamics and other latent features in the credit card data.
Figure 6 shows the implementation of the encoder layer comprising a multi-head self-attention mechanism.
To account for the varying lengths of the time series data among the customers, a bidirectional GRU is employed as a specialized pooling mechanism for each customer.
The feedforward neural network block takes in the latent features from the previous step and generates the final prediction probability of whether the customer will default or not. The details are illustrated in
Figure 7.
4.6. Optional Feature Augmentation Extension
Given that existing research and deployment for credit card default prediction rely heavily on engineered features from domain expertise and industry practice, users are often hesitant to relinquish these familiar inputs. To facilitate adoption, we designed an optional feature augmentation extension called the hidden feature block. The block is specifically built to process high-dimensional, user-crafted features, using its internal feedforward architecture (as shown in
Figure 8) to distill them into a single, powerful, and dense feature vector. This parallel design is effective because it distills high-dimensional engineered features independently, preventing them from interfering with the transformer’s native temporal learning.
Figure 9 illustrates how this module is integrated into the complete model architecture. The hidden feature block operates in parallel with the transformer block and is designed for flexibility; it is only activated when engineered features are available. Specifically, the raw 2D time-series data is processed exclusively by the transformer block to extract temporal patterns, while the pre-calculated engineered features are routed in parallel through the hidden feature block. The resulting dense latent representations from both distinct pathways are then concatenated before being passed into the final FNN block for prediction.
We engineered over 6000 features to be used in the experiments in this study. The results will be reported in the Experiment section. Feature engineering methods such as statistical transformation and knowledge distillation are applied, and details can be found in the
Appendix A. It should be noted that feature engineering is very time-consuming, which motivates us to propose the transformer model without the need for feature engineering. The model with feature engineering also requires significantly more computation resources, as shown in
Table 1 below.
4.7. Evaluating Model Performance in an Ensemble
To evaluate the performance of our proposed transformer-based model in an ensemble, we integrate our model with the SOTA LightGBM in an ensemble framework. The proposed ensemble structure comprises the following sub-models:
The proposed transformer-based model (without feature engineering);
The proposed transformer-based model with feature augmentation extension to use engineered features;
A SOTA LightGBM using engineered features.
Figure 10 shows the overview of the proposed ensemble architecture.
For the Taiwan dataset, due to its small size, a reduced ensemble structure is adopted, incorporating only the proposed transformer-based model and the SOTA LightGBM model.
6. Path to Deployment and Practical Impact
Our proposed model directly addresses a critical bottleneck in deploying credit risk models. Traditional methods, including SOTA academic ensembles such as [
32,
33] and commercial systems e.g., Upstart [
47], Zest AI [
48], often rely on thousands of manually handcrafted features. This practice slows development, complicates maintenance and hinders scalability in real-world financial systems.
By processing raw multivariate credit data as a 2D time series, our transformer-based model eliminates the arduous feature engineering efforts. This approach yields immediate practical benefits, including accelerated development cycles, drastically reduced maintenance overhead, lower inference latency (
Table 1), and streamlined integration into existing credit scoring pipelines. The lightweight architecture is designed for deployment, and its effectiveness is validated across the diverse AMEX and Taiwan Bank datasets, showing its promise for different banking contexts.
We do recognize that existing users who are accustomed to engineered features from domain expertise and industry practice will be reluctant to give up engineered features completely. We therefore built a feature augmentation extension to the proposed model to facilitate adoption of our model by existing users. Our model is user-friendly and can leverage hybrid learning to support both user-crafted and model-learned features to improve model performance and facilitate current user buy-in. Our proposed model with the feature augmentation extension also helps to bridge the gap until current users are comfortable to transition to more advanced, feature-learning paradigms without abandoning their existing highly interpretable, regulatory-compliant workflows.
This work, therefore, lays the groundwork for practical adoption in production credit systems. Future efforts to enhance interpretability will further support its adoption in regulated environments, solidifying a clear path from emerging application to deployed solution.
7. Conclusions and Future Work
This paper presents a transformer-based neural network for credit card default prediction that processes raw 2D time-series credit data without the need for labor-intensive feature engineering. The proposed model outperforms LightGBM baselines and achieves performance comparable to SOTA methods with engineered features, demonstrating strong generalization across two real-world datasets.
In addition, we have incorporated a feature augmentation extension to the proposed model to facilitate adoption by existing users who are accustomed to engineered features from domain expertise and industry practice. Hence, our model is user-friendly and supports hybrid learning to support both user crafted and model learned features to improve model performance and facilitate deployment. We also evaluated our proposed model in an ensemble framework with LightGBM, and its performance surpasses existing ensemble approaches. These results highlight the model’s robustness, simplicity, generalizability and suitability for scalable deployment in financial applications.
For future work, we will explore mitigating potential dataset biases and scalability limits of self-attention on long sequences, as well as improving model interpretability and integrating transformer-encoded features into tree-based frameworks for enhanced performance and compliance readiness.