1. Introduction
Credit card fraud has become an increasingly severe threat to financial institutions and consumers worldwide, with global losses exceeding
$32 billion annually [
1]. The rapid digitalization of payment systems and the proliferation of e-commerce platforms have created unprecedented opportunities for fraudulent activities, making automated fraud detection systems indispensable for modern financial security [
2]. However, detecting fraudulent transactions remains a formidable challenge due to several inherent characteristics: extreme class imbalance (typically less than 1% of transactions are fraudulent), evolving fraud patterns as attackers adapt to detection systems, and the need for real-time decision-making with minimal false positives [
3,
4]. Another important challenge in fraud detection is concept drift, where fraudulent behaviors evolve over time as attackers adapt their strategies to evade detection systems. This leads to distribution shifts between historical training data and future transactions, often causing models to degrade when evaluated on out-of-time data. To better reflect this real-world scenario, our framework models transaction histories as temporal sequences and the experiments adopt a chronological train–validation–test split, where models are trained on earlier transactions and evaluated on later ones.
Traditional machine learning approaches to fraud detection, including logistic regression [
4], random forests [
5], and support vector machines [
4], have demonstrated reasonable performance on balanced datasets. However, these methods predominantly treat each transaction as an independent event, ignoring the rich temporal context embedded in user transaction histories [
6]. In reality, fraudulent behavior often manifests as anomalous sequences of transactions that deviate from a user’s established spending patterns [
7]. A genuine cardholder’s transaction sequence typically exhibits temporal coherence and behavioral consistency, whereas fraudulent sequences frequently display sudden changes in transaction amounts, geographical locations, merchant categories, or transaction frequencies [
8].
Recent advances in deep learning have shown promising results in fraud detection [
9]. Convolutional Neural Networks (CNNs) [
10] and Recurrent Neural Networks (RNNs) [
6] have been applied to extract complex features from transaction data. Long Short-Term Memory (LSTM) networks [
11] and Gated Recurrent Units (GRUs) [
8] have been employed to model sequential dependencies in transaction streams. However, these architectures face significant limitations: RNN-based models suffer from vanishing gradients when processing long sequences and cannot effectively parallelize during training [
12]. Moreover, existing deep learning approaches often struggle with the extreme class imbalance problem, where fraudulent transactions constitute only 0.1–2% of the total dataset [
3].
The class imbalance problem has been addressed through various resampling techniques. The Synthetic Minority Over-sampling Technique (SMOTE) [
13] and its variants generate synthetic minority samples by interpolating between existing instances. Adaptive Synthetic Sampling (ADASYN) [
14] adjusts the generation of synthetic samples based on local density distributions. However, these interpolation-based methods produce simplistic synthetic samples that fail to capture the complex, multi-modal distributions characteristic of fraudulent transactions [
15]. Furthermore, they may introduce noise by generating samples in overlapping regions between classes [
16].
Generative Adversarial Networks (GANs) [
17] have emerged as powerful tools for learning complex data distributions and generating high-quality synthetic samples. Conditional Tabular GAN (CTGAN) [
18] specifically addresses the challenges of generating tabular data with mixed discrete and continuous variables, making it particularly suitable for financial transaction data. Recent work has demonstrated the effectiveness of GANs in handling imbalanced classification tasks [
19,
20]. However, existing GAN-based fraud detection systems typically apply generative models in isolation, without leveraging the sequential nature of transaction data [
19].
Transformer architectures [
12], originally designed for natural language processing, have revolutionized sequence modeling through their self-attention mechanisms. Unlike RNNs, Transformers process sequences in parallel and can capture long-range dependencies without suffering from gradient vanishing [
21]. BERT [
22] demonstrated the power of bidirectional Transformers for learning contextual representations. Recently, Transformers have been successfully applied to time-series forecasting [
23,
24] and anomaly detection [
25]; however, their application to fraud detection with explicit temporal awareness remains underexplored.
In this paper, we propose a integrated framework that synergistically combines a Time-Aware Transformer encoder with CTGAN for credit card fraud detection. Our approach makes the following contributions:
- 1.
Sequential Transaction Modeling: We formulate fraud detection as a sequence classification problem, where each user’s transaction history is modeled as a temporal sequence. We develop a Time-Aware Transformer encoder that incorporates explicit temporal encoding to capture both the chronological order and time intervals between transactions, enabling the model to learn complex temporal dependencies and detect abrupt behavioral changes indicative of fraud.
- 2.
Generative Oversampling Strategy: We employ CTGAN to generate synthetic fraudulent transactions that capture important statistical characteristics and conditional structure of fraud cases in tabular data. Unlike interpolation-based methods, CTGAN learns a generative model of the minority class through adversarial training, producing diverse synthetic samples that effectively augment the training set in our experimental setting.
- 3.
Unified Framework: We integrate the generative oversampling and sequential modeling components into a cohesive training pipeline. The synthetic samples generated by CTGAN are organized into realistic transaction sequences, which are then used to train the Time-Aware Transformer, creating a robust detector that benefits from both improved class balance and enhanced sequential pattern recognition.
- 4.
Extensive Evaluation: We conduct extensive experiments on the IEEE-CIS Fraud Detection dataset and compare our approach against representative traditional machine learning baselines, standalone deep learning models, and supervised sequential architectures. Through ablation studies, we validate the individual contributions of each component and analyze the framework’s behavior under various configurations. We note that stronger contemporary directions, such as graph neural network-based fraud detection [
26,
27,
28], self-supervised/contrastive sequence learning [
29,
30,
31,
32], and diffusion-based tabular generation [
33,
34], are not included as direct baselines in the present study and remain important directions for future comparison.
The remainder of this paper is organized as follows:
Section 2 reviews related work in fraud detection, sequential modeling, and generative approaches for imbalanced learning.
Section 3 provides preliminary background on Transformer architectures and GANs.
Section 4 details our proposed methodology, including the Time-Aware Transformer design and the CTGAN-based oversampling strategy.
Section 5 presents experimental results and analysis.
Section 6 concludes the paper with discussions on limitations and future research directions.
4. Methodology
In this section, we present our proposed framework that synergistically combines Time-Aware Transformer encoding with CTGAN-based generative oversampling for credit card fraud detection. The framework addresses two fundamental challenges: capturing complex temporal dependencies in transaction sequences and mitigating severe class imbalance through high-quality synthetic sample generation. We first describe the overall architecture, followed by detailed descriptions of the Time-Aware Transformer encoder and the CTGAN-based oversampling strategy, concluding with the integrated training pipeline.
4.1. Framework Overview
Our framework operates in two main stages: generative oversampling and sequential classification, as shown in
Figure 1. In the first stage, CTGAN learns the complex conditional distribution of fraudulent transactions from the minority class and generates synthetic fraud samples to balance the training dataset. The motivation for this design is that traditional oversampling methods like SMOTE produce simplistic interpolations that fail to capture the multi-modal, high-dimensional distributions characteristic of real fraud patterns. By employing adversarial training, CTGAN learns the underlying data manifold and generates diverse, realistic samples that preserve critical statistical properties.
In the second stage, we construct temporal transaction sequences for each user by aggregating their historical transactions in chronological order. These sequences, now balanced through the inclusion of synthetic fraudulent transactions, are fed into a Time-Aware Transformer encoder that explicitly models both the sequential order and temporal intervals between transactions. The rationale behind this sequential formulation is that fraud detection should not treat transactions as isolated events but rather as manifestations of user behavior patterns that evolve over time. Legitimate users typically exhibit consistent spending patterns, while fraudulent activities often introduce abrupt deviations in transaction amounts, merchant categories, geographical locations, or temporal frequencies. The Time-Aware Transformer’s self-attention mechanism enables it to identify such anomalous subsequences within the broader context of user history.
The synergy between these two components is crucial: CTGAN provides the model with sufficient fraudulent examples to learn discriminative patterns, while the Time-Aware Transformer leverages sequential context to distinguish subtle behavioral anomalies that point-wise classifiers would miss. The framework outputs a binary classification indicating whether a given transaction sequence contains fraudulent activity.
4.2. Time-Aware Transformer Encoder
The Time-Aware Transformer encoder extends the standard Transformer architecture [
12] with explicit temporal awareness to capture the time-dependent nature of transaction sequences. Given a user’s transaction sequence
, where
represents the feature vector of the
i-th transaction and
denotes its timestamp, our goal is to learn a representation that encodes both transactional features and temporal dynamics.
4.2.1. Temporal Feature Embedding
We first construct a comprehensive embedding for each transaction that incorporates both raw features and temporal information. Each transaction’s feature vector
contains attributes such as transaction amount, merchant category code, device information, and geographical indicators extracted from the IEEE-CIS dataset. We apply a linear projection to map these features to a higher-dimensional space:
where
and
are learnable parameters, and
is the model dimension.
From a modeling perspective, transaction histories in fraud detection can be viewed as irregular event sequences rather than uniformly sampled time series. The temporal spacing between transactions therefore contains important behavioral information. In particular, sudden bursts of activity or unusual temporal gaps often indicate deviations from normal spending behavior. To capture such irregular temporal dynamics, we explicitly model the time interval between consecutive transactions. To incorporate temporal information between transactions, we compute time intervals between consecutive transactions. The time difference
captures the spending frequency, which is a critical indicator for fraud detection since fraudulent transactions often occur in rapid succession. We encode these intervals using a temporal embedding:
where
is a non-linear transformation function that maps time differences to a fixed-dimensional representation. Specifically, we use a logarithmic scaling combined with binning to handle the wide range of possible time intervals:
where
are indicator functions for predefined time bins (e.g., within 1 h, 1–24 h, 1–7 days, etc.), and the logarithmic term captures fine-grained temporal variations. This design is motivated by the observation that fraud patterns exhibit different characteristics at different time scales: some fraud involves rapid consecutive transactions, while others maintain seemingly normal intervals to avoid detection. The temporal bin boundaries are chosen to reflect typical transaction time scales observed in credit card usage patterns, distinguishing rapid transaction bursts from normal daily or weekly spending intervals. Such discretization provides a simple yet effective way to capture behavioral differences across time scales while keeping the temporal representation compact.
We also incorporate absolute positional encoding to preserve the sequential order of transactions within the sequence. Following the Transformer architecture, we use sinusoidal position encodings:
The final embedding for each transaction combines feature, temporal, and positional information:
This additive composition allows the model to simultaneously reason about transaction content, temporal dynamics, and sequential order.
4.2.2. Time-Aware Multi-Head Self-Attention
The core of our encoder is a modified multi-head self-attention mechanism that explicitly incorporates temporal awareness. Standard self-attention computes attention weights based solely on content similarity between query and key vectors. However, in fraud detection, the temporal relationship between transactions is equally important: a large transaction immediately following another large transaction may indicate fraud, while the same pair separated by weeks may be legitimate.
We augment the attention computation with temporal bias terms that modulate attention weights based on time intervals. For the
l-th layer, we compute queries, keys, and values as:
where
is the output of the previous layer.
The temporal bias is computed based on the time difference between transactions
i and
j:
where
and
are learnable parameters. This bias encourages the model to attend more strongly to temporally proximate transactions, which is crucial for detecting fraud patterns that unfold over short time windows.
The time-aware attention for each head
h is computed as:
where
is the temporal bias matrix with entries
. The outputs from all
H heads are concatenated and linearly transformed:
Following the attention sublayer, we apply layer normalization and a position-wise feed-forward network:
where
is a two-layer feed-forward network with ReLU activation.
4.2.3. Sequence-Level Classification
After
L transformer layers, we obtain contextualized representations for all transactions in the sequence
. To perform sequence-level classification, we aggregate these representations using a weighted pooling mechanism that emphasizes more recent transactions, as fraudulent activities are often detected through the most recent behavioral patterns. We compute attention weights over the sequence:
and aggregate the representations:
The final classification is performed through a multi-layer perceptron:
where
is the sigmoid function and
represents the predicted probability of fraud.
4.3. CTGAN-Based Generative Oversampling
The severe class imbalance in fraud detection datasets poses significant challenges for training effective classifiers. Traditional oversampling methods generate synthetic samples through simple interpolation, which fails to capture the complex distributions of fraudulent transactions. We employ CTGAN [
18] to learn the intricate conditional distribution of fraud samples and generate high-quality synthetic transactions.
From a probabilistic perspective, the generative model approximates the conditional distribution of minority-class transactions. By learning this distribution from observed fraud samples, CTGAN can generate synthetic instances that preserve joint statistical dependencies among heterogeneous tabular variables, including both numerical and categorical attributes. This distributional modeling capability is particularly important in fraud detection, where fraudulent patterns often arise from complex interactions among multiple transaction features rather than simple local variations that can be captured by interpolation-based oversampling methods.
4.3.1. Mode-Specific Normalization
Credit card transaction data exhibits multi-modal continuous distributions. For example, transaction amounts may cluster around common price points such as gas station fills, grocery purchases, or online subscription fees. Standard normalization techniques like min–max scaling or z-score normalization fail to preserve these multi-modal characteristics, leading to poor generative quality.
CTGAN addresses this through mode-specific normalization. For each continuous variable
x, we fit a Gaussian Mixture Model with
K components:
where
,
, and
are the mixture weight, mean, and standard deviation of the
k-th component. In our implementation,
K is treated as a fixed upper bound shared across continuous features for modeling consistency. However, the effective number of active modes may vary by feature, since components with negligible mixture weights contribute little to the final representation. Thus,
K provides sufficient flexibility for complex multi-modal variables without forcing all features to use the same effective level of complexity. Each value
x is then represented by a tuple
where
m is a one-hot vector indicating the most likely mode:
and
v is the normalized value within that mode:
This transformation preserves the multi-modal structure during generation. When sampling from the generator, we first sample a mode from the one-hot vector, then denormalize the continuous value according to that mode’s parameters.
4.3.2. Conditional Vector Sampling
Fraudulent transactions often exhibit highly imbalanced categorical attributes. For instance, certain merchant category codes or device types may be heavily associated with fraud but represent only a small fraction of the overall data. To ensure comprehensive coverage during training, CTGAN employs conditional vector sampling.
For each categorical variable with categories
, we construct a conditional vector
that specifies which category should be generated. During training, we sample categories with probability proportional to their log-frequency:
where
is the frequency of category
in the training data. This logarithmic weighting ensures that rare categories receive sufficient training signal without completely ignoring frequent categories.
In our framework, this conditional sampling mechanism is used to generate individual synthetic fraudulent transactions rather than complete transaction sequences. Each synthetic sample follows the same tabular schema as the original data, including the transaction attributes later used for sequence construction, such as card-related identifiers and the transaction timestamp feature. After generation, the synthetic fraudulent transactions are merged with the real training transactions, and the augmented data are then reorganized into sequences using the same procedure applied to the original dataset: transactions are grouped by card identifiers, sorted chronologically according to TransactionDT, and the temporal gaps between consecutive transactions are recomputed. Therefore, the Time-Aware Transformer operates on reconstructed chronological sequences in the augmented training set, rather than on sequences generated directly by CTGAN. Although the generator operates on tabular transactions rather than complete sequences, temporal coherence is preserved during sequence reconstruction because the generated samples retain timestamp-related attributes and are reorganized into chronological event streams before being processed by the sequential model. While this design preserves compatibility between tabular generation and temporal sequence modeling, the generator itself does not explicitly enforce sequence-level temporal coherence, which remains an important direction for future work.
4.3.3. Generator and Discriminator Architecture
The generator
G takes as input a random noise vector
and a conditional vector
specifying the target class (fraud) and categorical constraints. The generator outputs a synthetic transaction:
The discriminator
D receives both real and generated transactions along with their conditional vectors and learns to distinguish between them:
The training objective combines adversarial loss with conditional consistency:
To stabilize training, we employ gradient penalty regularization following the Wasserstein GAN framework. The discriminator loss is augmented with:
where
is a random interpolation between real and generated samples, and
controls the penalty strength.
4.4. Integrated Training Pipeline
The complete framework integrates CTGAN-based oversampling with Time-Aware Transformer classification through a two-stage training procedure. In the first stage, we train CTGAN exclusively on fraudulent transactions to learn their distribution. Once CTGAN converges, we generate synthetic fraud samples to balance the training dataset. The number of synthetic samples is determined to achieve a target fraud ratio, typically between 30 and 50% of the total training set, which we find provides sufficient signal without overwhelming the model with purely synthetic data.
In the second stage, we construct temporal transaction sequences by grouping transactions by user ID and sorting them chronologically. For users with fraudulent transactions (both real and synthetic), we create sequences that include contextual legitimate transactions preceding the fraud event, simulating realistic fraud scenarios. For legitimate users, we randomly sample transaction windows. This sequence construction is motivated by the need to provide the model with realistic behavioral context: fraudulent transactions rarely occur in isolation but are preceded by legitimate usage patterns that make the anomaly detectable.
The Time-Aware Transformer is then trained on these sequences using binary cross-entropy loss with class weights to account for any remaining imbalance after oversampling:
where
and
are class weights inversely proportional to class frequencies in the balanced dataset.
The complete training algorithm is summarized in Algorithm 1.
| Algorithm 1 Time-Aware Transformer with CTGAN Training. |
- 1:
Input: Transaction dataset , target fraud ratio - 2:
Output: Trained Time-Aware Transformer model - 3:
// Stage 1: Generative Oversampling - 4:
Extract fraud transactions: - 5:
Train CTGAN on until convergence - 6:
Generate synthetic fraud samples to achieve ratio - 7:
- 8:
// Stage 2: Sequential Modeling - 9:
Group transactions by user: for each user u - 10:
for each user u with fraud transactions do - 11:
Create sequence with context: - 12:
Label sequence: if contains fraud, else 0 - 13:
end for - 14:
for each user u without fraud do - 15:
Sample random transaction window: , label - 16:
end for - 17:
Initialize Time-Aware Transformer parameters - 18:
while not converged do - 19:
Sample batch of sequences - 20:
Compute predictions - 21:
Update to minimize - 22:
end while - 23:
return
|
This integrated approach leverages the complementary strengths of both components: CTGAN provides diverse, high-quality fraudulent samples that enrich the training distribution, while the Time-Aware Transformer exploits sequential context to identify subtle anomalies that distinguish fraud from legitimate behavior patterns. The framework is end-to-end trainable and can be updated as new fraud patterns emerge, making it adaptable to the evolving nature of financial fraud.
5. Experiments
In this section, we conduct extensive experiments to evaluate the effectiveness of our proposed framework. We first describe the dataset and experimental setup, then present comparative results against baseline methods, followed by ablation studies and detailed analysis of key components.
5.1. Experiment Setup
5.1.1. Dataset Description
We evaluate our framework on the IEEE-CIS Fraud Detection dataset (
https://www.kaggle.com/c/ieee-fraud-detection/data (accessed on 1 February 2026)) released through the IEEE Computational Intelligence Society (IEEE-CIS) fraud detection competition on Kaggle. The dataset was provided by Vesta Corporation (Weatogue, CT, USA), a payment services company, and contains real-world online transaction records collected from an e-commerce fraud prevention system. Although many variables have been anonymized to protect sensitive financial information and user privacy, the dataset originates from real transaction logs rather than synthetic or generative data. It represents one of the most comprehensive and realistic credit card fraud detection benchmarks available. The dataset comprises two main tables: transaction information and identity information.
The transaction table contains 590,540 records and 394 columns in the raw release, including transaction-related attributes, engineered variables, identifier fields, and the target label. Some repository descriptions report 393 transaction features depending on whether the target label is counted separately; in this work, we refer to the raw transaction table as released, which contains 394 columns including the label field. Key features include transaction amount (TransactionAmt), product code (ProductCD), transaction date and time (TransactionDT), and various anonymized categorical and numerical features labeled as C1–C14 (counting features), D1–D15 (time delta features), M1–M9 (match features), and V1–V339 (Vesta engineered features). The identity table contains 144,233 records and 41 columns, including TransactionID and identity-related attributes that provide additional information about user devices and digital signatures, such as DeviceType, DeviceInfo, and identity verification features (id_01 to id_38). Among these identity features, some variables are numerical while others are treated as categorical attributes during preprocessing.
The dataset exhibits severe class imbalance with only 20,663 fraudulent transactions out of 590,540 total transactions, resulting in a fraud ratio of approximately 3.5%. This imbalance ratio is realistic for production fraud detection systems but poses significant challenges for model training. We merge the transaction and identity tables using TransactionID as the key, resulting in a unified dataset where each transaction contains both transactional and identity features when available. For reproducibility, we note that the raw feature counts refer to the original released columns before preprocessing. After merging the two tables, identifier fields and the target label are excluded from the model input, categorical attributes are encoded using standard categorical encoding, numerical missing values are imputed with the mean, and categorical missing values are imputed with the mode. The resulting processed feature matrix is then used as the input to both the CTGAN module and the sequential classifier.
For temporal sequence construction, we group transactions by card identifiers (derived from card1–card6 features) and sort them chronologically using TransactionDT. The average sequence length is 12.4 transactions per card, with a standard deviation of 8.7, minimum of 1, and maximum of 156 transactions. We filter out cards with fewer than 3 transactions to ensure sufficient sequential context. The final dataset comprises 89,432 transaction sequences, with 8941 sequences containing at least one fraudulent transaction (10% sequence-level fraud ratio).
We split the data temporally to simulate realistic deployment scenarios where models are trained on historical data and evaluated on future transactions. Specifically, we use the first 70% of transactions (chronologically) for training, 10% for validation, and the remaining 20% for testing. This temporal split ensures that the model does not have access to future information during training, reflecting real-world constraints where fraud patterns may evolve over time. While it is possible to construct synthetic fraud datasets using transaction simulators or generative models, such datasets may not fully capture the complex statistical properties of real financial systems. Therefore, we focus on evaluating the proposed framework using the publicly available IEEE-CIS dataset, which provides anonymized real-world transaction records.
5.1.2. Baseline Methods
We compare our proposed framework against three categories of baselines: traditional machine learning methods, deep learning approaches without sequential modeling, and existing sequential models.
Traditional Machine Learning: We implement Logistic Regression (LR) as a linear baseline, Random Forest (RF) [
5] with 200 trees as a strong ensemble baseline, and XGBoost (XGB) [
53] as a representative gradient boosting method. For these methods, we engineer aggregate features from transaction sequences including statistical summaries (mean, standard deviation, minimum, maximum of transaction amounts), temporal features (average time interval, transaction frequency), and recency features (features from the most recent transaction). We apply SMOTE [
13] oversampling to handle class imbalance for fair comparison.
Deep Learning without Sequences: We implement a Multi-Layer Perceptron (MLP) with three hidden layers as a neural baseline operating on engineered aggregate features. We also implement CTGAN + MLP, which uses our CTGAN oversampling strategy combined with the MLP classifier to isolate the contribution of generative oversampling from sequential modeling.
Sequential Models: We compare against LSTM [
6], a widely used recurrent architecture for sequence modeling in fraud detection, and vanilla Transformer without temporal awareness to demonstrate the importance of our time-aware modifications. We also implement GRU [
8] as an alternative recurrent baseline and Temporal Convolutional Network (TCN) [
54] to represent convolutional approaches to sequential modeling.
All baseline methods are carefully tuned using grid search on the validation set to ensure fair comparison. For methods requiring class imbalance handling, we experiment with both oversampling (SMOTE) and class weighting, selecting the better-performing strategy for each method. We selected these baselines to cover classical tabular classifiers, oversampling-based neural models, and supervised sequential architectures under a unified experimental pipeline. We note, however, that the present evaluation does not include several strong contemporary alternatives, including graph neural network-based fraud detection, self-supervised or contrastive sequence models, and diffusion-based tabular generators. Accordingly, we interpret our empirical findings within this evaluation scope and leave broader benchmarking against these methods to future work.
5.1.3. Evaluation Metrics
Given the severe class imbalance in fraud detection, accuracy is insufficient as a sole metric since a trivial classifier that labels all transactions as legitimate would achieve over 96% accuracy. We therefore employ multiple complementary metrics that capture different aspects of model performance.
We report Area Under the Receiver Operating Characteristic Curve (AUC-ROC), which measures the model’s ability to rank fraudulent transactions higher than legitimate ones across all possible classification thresholds. This metric is particularly valuable for fraud detection where the operating threshold may be adjusted based on business requirements. We also compute Area Under the Precision–Recall Curve (AUC-PR), which is more informative than AUC-ROC for highly imbalanced datasets as it focuses on the positive class performance.
For threshold-dependent metrics, we report Precision (the fraction of predicted fraud cases that are actual fraud), Recall (the fraction of actual fraud cases that are detected), and F1-Score (the harmonic mean of precision and recall). We select the classification threshold on the validation set to maximize F1-score, which balances precision and recall. Additionally, we report False Positive Rate (FPR) at 90% recall, which is a critical operational metric indicating how many false alarms the system generates while maintaining high fraud detection coverage.
All experiments are repeated with 5 different random seeds to account for stochasticity in model initialization and training. We report mean performance with standard deviations to provide statistically meaningful comparisons.
5.1.4. Implementation Details
Our framework is implemented in PyTorch 2.0. The Time-Aware Transformer encoder consists of 4 layers with model dimension , 8 attention heads, and feed-forward dimension of 1024. We use dropout with rate 0.1 after each sublayer for regularization. The temporal encoding uses 10 time bins with boundaries at 1 h, 6 h, 1 day, 3 days, 1 week, 2 weeks, 1 month, 3 months, 6 months, and 1 year. Sequences are padded or truncated to a maximum length of 50 transactions.
For CTGAN, we use a generator with 2 residual blocks of dimension 256 and a discriminator with 2 layers of dimension 256. We train CTGAN for 300 epochs using Adam optimizer with learning rate and batch size 500. The gradient penalty coefficient is set to 10. We generate synthetic fraud samples to achieve a 40% fraud ratio in the balanced training set, which we found optimal through validation set tuning.
The integrated framework is trained end-to-end using Adam optimizer with learning rate , batch size 32, and early stopping with patience of 10 epochs based on validation AUC-ROC. We use class weights inversely proportional to class frequencies in the balanced dataset. Training converges within 50–80 epochs on average. All experiments are conducted on NVIDIA A100 GPUs with 40 GB memory.
5.2. Main Results
Table 1 presents the extensive comparison between our proposed framework and all baseline methods. Our Time-Aware Transformer with CTGAN (TAT-CTGAN) achieves the best performance across all metrics, demonstrating the effectiveness of combining generative oversampling with temporal sequential modeling.
Our method achieves an AUC-ROC of 0.982, significantly outperforming the vanilla Transformer (0.956) and the best RNN baseline LSTM (0.931). The improvement is even more pronounced in AUC-PR (0.911 vs 0.834), demonstrating superior precision–recall trade-offs. The F1-score of 0.883 represents a substantial improvement over XGBoost (0.775), the best traditional method, indicating that our approach better balances precision and recall.
Notably, our method achieves an FPR of only 3.8% at 90% recall, which is critically important for production deployment where excessive false alarms can overwhelm fraud investigation teams. The vanilla Transformer achieves 6.1% FPR at the same recall level, while LSTM reaches 8.9%, demonstrating the practical advantage of our time-aware design and generative oversampling strategy.
The comparison between CTGAN + MLP (0.903 AUC-ROC) and MLP + SMOTE (0.874 AUC-ROC) isolates the contribution of CTGAN oversampling, showing a 2.9 percentage point improvement. Similarly, comparing vanilla Transformer (0.956) with LSTM (0.931) demonstrates the advantage of attention mechanisms over recurrent architectures. Our full framework combines both advantages, achieving synergistic improvements beyond either component alone.
The relatively low standard deviations across all metrics (typically less than 1% of the mean) indicate that our results are robust to random initialization and training stochasticity. Traditional methods show slightly higher variance, particularly in precision and recall, suggesting greater sensitivity to the specific training samples selected during cross-validation splits.
We further evaluate the sensitivity of the framework to key hyperparameters, including the number of temporal bins, Transformer embedding dimension, and GAN training iterations. The results indicate that the proposed framework remains stable across a reasonable range of parameter settings.
5.3. Ablation Study
To understand the individual contributions of our framework’s components, we conduct comprehensive ablation studies.
Table 2 presents results for various configurations.
Replacing CTGAN with SMOTE reduces AUC-ROC from 0.982 to 0.961, indicating that CTGAN-based oversampling is more effective than interpolation-based oversampling in our downstream fraud detection pipeline. The gap is even larger in AUC-PR (6.2 percentage points), indicating that CTGAN-generated samples particularly improve precision. Removing oversampling entirely drops performance to 0.943 AUC-ROC, demonstrating that some form of class balancing is essential for optimal performance.
Removing temporal encoding decreases AUC-ROC by 1.4 percentage points to 0.968, showing that explicit time interval modeling is valuable for capturing fraud patterns. The temporal bias in attention contributes an additional 0.9 percentage points, indicating that allowing the model to modulate attention based on temporal proximity helps focus on relevant transaction windows. Interestingly, removing positional encoding has a larger impact (2.3 percentage point drop) than removing temporal encoding, suggesting that sequential order is even more critical than absolute time intervals. This makes intuitive sense: the sequence of events matters more than their exact timing.
Using standard attention without any temporal modifications reduces performance to 0.956, essentially matching the vanilla Transformer baseline. This confirms that our time-aware modifications—temporal encoding, temporal bias, and positional encoding—collectively contribute the improvement over standard Transformers.
Architecture depth analysis shows that reducing to 2 layers decreases AUC-ROC by 1.1 percentage points, indicating that deeper representations are beneficial. However, increasing to 6 layers provides only marginal additional improvement (0.2 percentage points) while significantly increasing computational cost, suggesting that 4 layers represents a good trade-off between performance and efficiency.
5.4. Impact of Synthetic Sample Ratio
We investigate how the ratio of synthetic fraud samples affects model performance.
Figure 2 shows performance curves as we vary the target fraud ratio in the balanced dataset from 10% (minimal oversampling) to 70% (aggressive oversampling).
Performance improves rapidly as we increase the fraud ratio from 10% to 40%, with AUC-ROC rising from 0.951 to 0.982 and F1-score increasing from 0.846 to 0.883. This demonstrates that the model benefits substantially from additional synthetic fraud examples up to a certain point. However, beyond 40–50% fraud ratio, performance plateaus and even slightly decreases at 70% fraud ratio (0.979 AUC-ROC, 0.878 F1-score).
This pattern suggests an optimal balance between synthetic and real samples. With too few synthetic samples, the model still suffers from class imbalance and cannot learn the full diversity of fraud patterns. With too many synthetic samples, the model may overfit to the synthetic data distribution, which inevitably contains some artifacts from the generative process. The plateau around 40–50% indicates that at this ratio, the model has sufficient fraud examples to learn robust patterns without being dominated by synthetic artifacts.
Interestingly, precision shows more sensitivity to the synthetic ratio than recall. At 70% fraud ratio, precision drops to 0.872 while recall remains at 0.881, suggesting that excessive synthetic samples may introduce false patterns that increase false positive rates. This reinforces the importance of carefully tuning the oversampling ratio based on validation set performance rather than maximizing fraud representation.
5.5. Sequence Length Analysis
Transaction sequence length varies widely across users, ranging from 3 to 156 transactions in our dataset. We analyze how sequence length affects detection performance by stratifying the test set into length buckets and evaluating performance separately for each bucket.
Figure 3 presents the results.
For very short sequences (3–5 transactions), AUC-ROC is 0.941 with F1-score of 0.827, indicating that limited historical context constrains the model’s ability to identify behavioral anomalies. As sequence length increases to 6–10 transactions, performance improves to 0.968 AUC-ROC and 0.859 F1-score. The improvement continues for sequences of 11–20 transactions (0.981 AUC-ROC, 0.881 F1-score) and 21–30 transactions (0.986 AUC-ROC, 0.892 F1-score).
Beyond 30 transactions, performance stabilizes with minimal further improvement, suggesting that most relevant behavioral patterns are captured within a 30-transaction window. Very long sequences (50+ transactions) show comparable performance (0.985 AUC-ROC, 0.889 F1-score) to medium-length sequences, indicating that our attention mechanism successfully focuses on relevant segments even when processing extensive histories.
This analysis has practical implications for deployment: the model can achieve strong performance with moderate sequence lengths (20–30 transactions), which most users accumulate within a few weeks of activity. New users with limited history will have reduced but still reasonable detection capability, while long-term users benefit from richer behavioral context.
Comparing our method to LSTM on this metric reveals an interesting pattern: LSTM performance degrades slightly for very long sequences (50+ transactions), likely due to vanishing gradient issues, whereas our Transformer-based approach maintains stable performance. This validates our architectural choice for handling long-range dependencies.
5.6. Temporal Dynamics Visualization
To understand what temporal patterns the model learns, we visualize the attention weights for selected fraud cases.
Figure 4 shows attention heatmaps for two representative sequences.
In the first case (
Figure 4a), we observe a burst pattern where a user’s card was stolen and used for three rapid transactions within 2 h. The attention heatmap shows that these fraudulent transactions (positions 8–10) attend strongly to each other, with attention weights exceeding 0.3 for intra-fraud connections compared to less than 0.1 for connections to legitimate transactions. This demonstrates that the model has learned to identify rapid succession as a fraud indicator, consistent with known fraud patterns where stolen cards are exploited quickly before being blocked.
The second case (
Figure 4b) illustrates a more sophisticated pattern where a fraudster gradually escalates transaction amounts over several days to avoid triggering rule-based systems. The final fraudulent transaction (position 15,
) attends most strongly to previous transactions at positions 11 (723), 13 (
), and 14 (
), all of which show increasing amounts. The attention weights to these transactions (0.28, 0.31, 0.35) are significantly higher than to earlier legitimate transactions (typically below 0.15), indicating that the model recognizes the escalation pattern as anomalous.
Interestingly, legitimate transactions show more diffuse attention patterns, distributing attention relatively evenly across recent history. This suggests the model interprets consistent, stable patterns as indicators of legitimate behavior, whereas focused attention on specific subsequences signals anomalies.
These visualizations provide interpretability for fraud investigators, who can inspect which historical transactions influenced the model’s decision, facilitating more efficient case review and validation.
5.7. Comparison of Oversampling Quality
To validate that CTGAN generates higher-quality synthetic samples than SMOTE, we conduct a quantitative comparison of the synthetic data distributions. We measure distribution similarity using Maximum Mean Discrepancy (MMD) between synthetic and real fraud samples, and evaluate discriminability using a held-out classifier’s ability to distinguish synthetic from real fraud (lower accuracy indicates more realistic synthetic samples).
Table 3 shows that CTGAN achieves significantly lower MMD (8.3) compared to SMOTE (18.4) and ADASYN (16.7), indicating that CTGAN-generated samples more closely match the distribution of real fraud cases. The discriminator accuracy for CTGAN is 54.7%, barely better than random guessing, whereas SMOTE samples are correctly identified 73.2% of the time. This demonstrates that CTGAN produces samples that are much harder to distinguish from real fraud.
Feature coverage measures the fraction of unique categorical feature combinations observed in synthetic samples compared to real fraud. CTGAN covers 89.2% of real fraud feature combinations, substantially higher than SMOTE’s 64.1%. This indicates that CTGAN better captures the diversity of fraud patterns, including rare combinations that SMOTE’s interpolation cannot generate.
We also visualize the distributions using t-SNE dimensionality reduction in
Figure 5, projecting real fraud, synthetic fraud from CTGAN, synthetic fraud from SMOTE, and legitimate transactions into 2D space. The visualization confirms that CTGAN samples occupy similar regions to real fraud samples, including the multi-modal structure visible in the real fraud distribution. SMOTE samples, by contrast, concentrate in a tighter cluster that represents the interpolated region between real fraud samples but misses the full distributional complexity. Legitimate transactions form a separate cluster, validating that fraud and legitimate transactions have distinct feature distributions that our model can learn to distinguish.
5.8. Computational Efficiency
While our framework achieves superior detection performance, computational efficiency is critical for practical deployment.
Table 4 compares training and inference times across methods.
Our complete framework requires 5.4 h for full training, including 1.5 h for CTGAN pre-training and 3.9 h for Transformer training. While this is longer than XGBoost (42 min) or LSTM (2.8 h), the training is a one-time offline cost. For production deployment with periodic retraining (e.g., weekly), this overhead is acceptable given the substantial performance gains.
Inference time is 0.71 s per 1000 sequences, corresponding to 1408 sequences per second or approximately 17,000 individual transactions per second (given average sequence length of 12). This throughput is sufficient for real-time fraud detection in most payment processing scenarios. The Transformer architecture’s parallelizability enables efficient batch processing, with inference time scaling sub-linearly with batch size.
Memory requirements (9.6 GB) are moderate and well within the capacity of modern GPUs or can be accommodated on high-memory CPU servers if necessary. The model contains 5.1 million parameters, making it deployable on edge devices or cloud services without specialized infrastructure.
Compared to LSTM, our method incurs 36% longer inference time but delivers a 5.5-percentage-point higher AUC-ROC, representing a favorable accuracy–efficiency trade-off. For applications requiring maximum throughput, model distillation or quantization could further reduce inference costs while maintaining most of the performance gain.
5.9. Error Analysis
To understand the limitations of our approach, we analyze false positives and false negatives on the test set. We randomly sample 100 errors of each type and manually inspect their characteristics.
False Positives: The most common false positive pattern (38% of cases) involves legitimate users making unusual but valid purchases such as expensive one-time purchases (electronics, jewelry, travel bookings) that deviate significantly from their typical spending patterns. The model correctly identifies these as anomalous but cannot distinguish genuine lifestyle changes from fraud without additional verification signals. Another frequent pattern (27%) is shared card usage within families, where multiple individuals use the same card with different purchasing patterns. For example, a parent’s card used by a teenager for gaming purchases generates attention patterns similar to account takeover fraud. These cases highlight the limitation of transaction data alone without user identity verification. Geographic anomalies account for 19% of false positives, where users traveling to new locations generate legitimate transactions that appear suspicious due to sudden location changes. Cross-border transactions, in particular, trigger higher fraud scores even when legitimate. These observations suggest that incorporating additional contextual signals, such as long-term behavioral profiles, user identity verification features, or external risk indicators, may help reduce false positives and improve the robustness of fraud detection systems.
False Negatives: The most challenging missed fraud cases (41%) involve fraudsters who carefully mimic the victim’s normal spending patterns, making small purchases at familiar merchant categories. These attacks evade detection precisely because they avoid behavioral anomalies that our model relies on. Technical fraud using stolen valid card details accounts for 31% of false negatives. When fraudsters obtain complete card information including CVV and billing address, their transactions appear identical to legitimate ones in our feature space. Detecting these requires additional signals like device fingerprinting or behavioral biometrics, which are outside our model’s scope. Very-short-lived attacks (18%) where fraud occurs immediately after card issuance provide minimal historical context, limiting our sequential model’s effectiveness. These cases would benefit from incorporating card-not-present indicators, IP geolocation, and other auxiliary features.
This error analysis reveals opportunities for improvement through multi-modal fusion incorporating device signals, biometric authentication logs, and external fraud intelligence, which we discuss in the conclusion as future work directions.
5.10. Further Discussions
5.10.1. Practical Deployment for Real-Time Fraud Detection
In practical fraud detection systems, models are required to process incoming transactions with low latency. In the proposed framework, the CTGAN component is used only during the training stage to generate additional fraudulent samples and mitigate class imbalance. Once training is completed, the deployed system relies solely on the time-aware Transformer model for inference. During real-time operation, transactions are processed sequentially as they arrive. For each new transaction associated with a particular card or account, the system constructs a transaction sequence using the most recent historical records and feeds it into the trained model to compute a fraud probability score. Since inference involves only a forward pass through the neural network, predictions can be generated efficiently and integrated into existing fraud monitoring pipelines. This design ensures that the computationally intensive generative modeling stage does not affect real-time inference performance, making the proposed approach compatible with practical fraud detection environments.
5.10.2. Computational Cost and Resource Requirements
The proposed framework introduces additional computational cost during the training stage due to the use of generative modeling and deep neural architectures. Training the CTGAN model to generate synthetic fraud samples requires iterative adversarial optimization, while the time-aware Transformer model involves multi-head attention computations for sequential modeling. These training processes may require GPU acceleration for efficient optimization on large datasets. However, the generative model is used only during the training phase to address class imbalance. Once the model is trained, the deployed system relies solely on the time-aware Transformer for inference. Prediction therefore involves only a forward pass through the neural network, which can be executed efficiently and integrated into existing fraud monitoring pipelines. Consequently, the proposed approach remains practical for real-world fraud detection systems where low-latency predictions are required.
6. Conclusions and Future Works
This paper presents a hybrid fraud detection framework that integrates generative adversarial networks with time-aware Transformer architectures. While the individual modeling components are not entirely new, our contribution lies in the systematic integration of generative tabular augmentation and temporal sequence modeling within a unified fraud detection pipeline. By formulating fraud detection as a sequential classification problem and modeling transaction histories as temporal sequences, our approach captures complex behavioral patterns and temporal dependencies that point-wise classifiers cannot detect. The Time-Aware Transformer explicitly incorporates temporal encodings and time-aware attention mechanisms to distinguish legitimate spending patterns from fraudulent anomalies, while CTGAN generates synthetic fraud samples that enrich the minority-class training distribution and help address the severe class imbalance inherent in fraud detection datasets. Experiments on the IEEE-CIS Fraud Detection dataset demonstrate strong performance against the representative baselines considered in this study, achieving 0.982 AUC-ROC and 0.883 F1-score. Ablation studies further confirm the contributions of both the generative oversampling and sequential modeling components, and attention visualizations provide interpretability for fraud investigation.
Despite these promising results, several important limitations should be explicitly acknowledged. First, the current evaluation is conducted on a single benchmark dataset (IEEE-CIS). While this dataset is widely used and reflects realistic transaction scenarios, relying on a single dataset constrains the extent to which the results can be generalized to other fraud detection environments. Therefore, the empirical findings of this work should be interpreted within this experimental setting, and broader validation across multiple datasets remains necessary. Second, the experimental comparisons are limited to traditional machine learning models, standard deep learning approaches, and supervised sequential architectures. We do not include several strong contemporary directions, such as graph neural network-based fraud detection, self-supervised or contrastive sequence learning, and diffusion-based tabular generative models. While this design allows for controlled comparison within a supervised sequential learning framework, it also limits the scope of empirical claims. As such, the reported performance should not be interpreted as establishing superiority over all existing approaches, but rather within the specific evaluation scope considered in this study. Third, the integration of CTGAN-generated synthetic samples into temporal transaction sequences introduces a conceptual limitation. The generative model operates at the level of individual tabular transactions and does not explicitly enforce sequence-level temporal or behavioral coherence. Although synthetic transactions are reorganized into chronological sequences during preprocessing, there is no guarantee that generated fraud events are fully consistent with the surrounding legitimate transaction context. As a result, the Time-Aware Transformer is trained on partially synthetic sequences that may not perfectly reflect real-world behavioral dynamics. This limitation highlights an important gap between tabular generative modeling and sequence-aware data generation. Fourth, while the reported performance is strong and stable across multiple runs, reproducibility is an important consideration. To facilitate transparency and independent verification, we provide the implementation code as
Supplementary Material accompanying this submission, and it will be made publicly available upon publication. In addition to these limitations, the framework requires sufficient transaction history for effective sequential modeling, which may reduce performance for new accounts with limited behavioral records. Error analysis also indicates that sophisticated fraud patterns closely mimicking legitimate behavior and cases involving fully compromised credentials remain challenging to detect using transaction features alone. Furthermore, the computational overhead introduced by generative modeling may pose challenges for scenarios requiring frequent retraining or real-time adaptation.
Future research directions include several promising avenues. First, extending the framework to incorporate relational inductive biases through graph neural networks could better capture interactions among entities such as cards, merchants, and devices. Second, developing sequence-aware generative models that explicitly preserve temporal coherence may address the limitations of tabular-only generation. Third, evaluating the framework across multiple public datasets would strengthen its generalizability. Fourth, exploring self-supervised or contrastive pretraining for transaction sequences may improve representation learning under limited labels. Fifth, investigating alternative generative models such as diffusion-based tabular synthesis could further enhance sample quality. Finally, federated and continual learning strategies may enable adaptive and privacy-preserving fraud detection in real-world deployment settings. Beyond fraud detection, the proposed paradigm of combining generative augmentation with temporal sequence modeling may generalize to other imbalanced sequential classification problems, including cybersecurity intrusion detection, rare disease diagnosis from patient histories, predictive maintenance, and anomaly detection in financial systems.