Temporal Transformer with Conditional Tabular GAN for Credit Card Fraud Detection: A Sequential Deep Learning Approach

Chen, Jiaying; Liang, Yiwen; Liu, Jingyi; Zhou, Mengjie

doi:10.3390/math14071183

Open AccessFeature PaperArticle

Temporal Transformer with Conditional Tabular GAN for Credit Card Fraud Detection: A Sequential Deep Learning Approach

¹

SC Johnson Graduate School of Management, Cornell University, Ithaca, NY 14853, USA

²

Department of Computer Science, University of Bristol, Bristol BS8 1SS, UK

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(7), 1183; https://doi.org/10.3390/math14071183

Submission received: 9 February 2026 / Revised: 26 March 2026 / Accepted: 27 March 2026 / Published: 1 April 2026

Download

Browse Figures

Versions Notes

Abstract

Credit card fraud detection remains a critical challenge in financial security, characterized by severe class imbalance and the need to capture complex temporal patterns in transaction sequences. Traditional machine learning approaches treat transactions as independent events, failing to model the sequential nature of user behavior and suffering from inadequate handling of minority class samples. In this paper, we propose an integrated framework that combines generative modeling and time-aware sequential learning for credit card fraud detection. Our approach addresses two fundamental limitations: (1) we model transaction histories as temporal sequences using a Transformer-based architecture that captures both long-term dependencies and abrupt behavioral changes through multi-head self-attention mechanisms, and (2) we employ CTGAN to generate high-quality synthetic fraudulent samples, providing more effective oversampling than conventional techniques like SMOTE. The Time-Aware Transformer incorporates temporal encoding and position-aware attention to preserve transaction order and time intervals, while CTGAN learns the complex conditional distributions of fraudulent transactions to produce realistic synthetic samples. We evaluate our framework on the IEEE-CIS Fraud Detection dataset, demonstrating significant improvements over representative classical and sequential deep-learning baselines. Experimental results show that our method achieves superior performance with an AUC-ROC of 0.982, precision of 0.891, recall of 0.876, and F1-score of 0.883, outperforming the representative baselines considered in this study, including traditional machine learning models, standalone deep learning architectures, and supervised sequential neural models. Ablation studies confirm the individual contributions of both the sequential modeling component and the generative oversampling strategy. Our work demonstrates that combining temporal sequence modeling with generative synthesis provides a robust solution for imbalanced fraud detection, with potential applications extending to other domains requiring sequential pattern recognition under extreme class imbalance.

Keywords:

credit card fraud detection; temporal transformer; Generative Adversarial Networks; sequential modeling; class imbalance; deep learning

MSC:

68T07; 68T09; 62P05

1. Introduction

Credit card fraud has become an increasingly severe threat to financial institutions and consumers worldwide, with global losses exceeding $32 billion annually [1]. The rapid digitalization of payment systems and the proliferation of e-commerce platforms have created unprecedented opportunities for fraudulent activities, making automated fraud detection systems indispensable for modern financial security [2]. However, detecting fraudulent transactions remains a formidable challenge due to several inherent characteristics: extreme class imbalance (typically less than 1% of transactions are fraudulent), evolving fraud patterns as attackers adapt to detection systems, and the need for real-time decision-making with minimal false positives [3,4]. Another important challenge in fraud detection is concept drift, where fraudulent behaviors evolve over time as attackers adapt their strategies to evade detection systems. This leads to distribution shifts between historical training data and future transactions, often causing models to degrade when evaluated on out-of-time data. To better reflect this real-world scenario, our framework models transaction histories as temporal sequences and the experiments adopt a chronological train–validation–test split, where models are trained on earlier transactions and evaluated on later ones.

Traditional machine learning approaches to fraud detection, including logistic regression [4], random forests [5], and support vector machines [4], have demonstrated reasonable performance on balanced datasets. However, these methods predominantly treat each transaction as an independent event, ignoring the rich temporal context embedded in user transaction histories [6]. In reality, fraudulent behavior often manifests as anomalous sequences of transactions that deviate from a user’s established spending patterns [7]. A genuine cardholder’s transaction sequence typically exhibits temporal coherence and behavioral consistency, whereas fraudulent sequences frequently display sudden changes in transaction amounts, geographical locations, merchant categories, or transaction frequencies [8].

Recent advances in deep learning have shown promising results in fraud detection [9]. Convolutional Neural Networks (CNNs) [10] and Recurrent Neural Networks (RNNs) [6] have been applied to extract complex features from transaction data. Long Short-Term Memory (LSTM) networks [11] and Gated Recurrent Units (GRUs) [8] have been employed to model sequential dependencies in transaction streams. However, these architectures face significant limitations: RNN-based models suffer from vanishing gradients when processing long sequences and cannot effectively parallelize during training [12]. Moreover, existing deep learning approaches often struggle with the extreme class imbalance problem, where fraudulent transactions constitute only 0.1–2% of the total dataset [3].

The class imbalance problem has been addressed through various resampling techniques. The Synthetic Minority Over-sampling Technique (SMOTE) [13] and its variants generate synthetic minority samples by interpolating between existing instances. Adaptive Synthetic Sampling (ADASYN) [14] adjusts the generation of synthetic samples based on local density distributions. However, these interpolation-based methods produce simplistic synthetic samples that fail to capture the complex, multi-modal distributions characteristic of fraudulent transactions [15]. Furthermore, they may introduce noise by generating samples in overlapping regions between classes [16].

Generative Adversarial Networks (GANs) [17] have emerged as powerful tools for learning complex data distributions and generating high-quality synthetic samples. Conditional Tabular GAN (CTGAN) [18] specifically addresses the challenges of generating tabular data with mixed discrete and continuous variables, making it particularly suitable for financial transaction data. Recent work has demonstrated the effectiveness of GANs in handling imbalanced classification tasks [19,20]. However, existing GAN-based fraud detection systems typically apply generative models in isolation, without leveraging the sequential nature of transaction data [19].

Transformer architectures [12], originally designed for natural language processing, have revolutionized sequence modeling through their self-attention mechanisms. Unlike RNNs, Transformers process sequences in parallel and can capture long-range dependencies without suffering from gradient vanishing [21]. BERT [22] demonstrated the power of bidirectional Transformers for learning contextual representations. Recently, Transformers have been successfully applied to time-series forecasting [23,24] and anomaly detection [25]; however, their application to fraud detection with explicit temporal awareness remains underexplored.

In this paper, we propose a integrated framework that synergistically combines a Time-Aware Transformer encoder with CTGAN for credit card fraud detection. Our approach makes the following contributions:

1.: Sequential Transaction Modeling: We formulate fraud detection as a sequence classification problem, where each user’s transaction history is modeled as a temporal sequence. We develop a Time-Aware Transformer encoder that incorporates explicit temporal encoding to capture both the chronological order and time intervals between transactions, enabling the model to learn complex temporal dependencies and detect abrupt behavioral changes indicative of fraud.
2.: Generative Oversampling Strategy: We employ CTGAN to generate synthetic fraudulent transactions that capture important statistical characteristics and conditional structure of fraud cases in tabular data. Unlike interpolation-based methods, CTGAN learns a generative model of the minority class through adversarial training, producing diverse synthetic samples that effectively augment the training set in our experimental setting.
3.: Unified Framework: We integrate the generative oversampling and sequential modeling components into a cohesive training pipeline. The synthetic samples generated by CTGAN are organized into realistic transaction sequences, which are then used to train the Time-Aware Transformer, creating a robust detector that benefits from both improved class balance and enhanced sequential pattern recognition.
4.: Extensive Evaluation: We conduct extensive experiments on the IEEE-CIS Fraud Detection dataset and compare our approach against representative traditional machine learning baselines, standalone deep learning models, and supervised sequential architectures. Through ablation studies, we validate the individual contributions of each component and analyze the framework’s behavior under various configurations. We note that stronger contemporary directions, such as graph neural network-based fraud detection [26,27,28], self-supervised/contrastive sequence learning [29,30,31,32], and diffusion-based tabular generation [33,34], are not included as direct baselines in the present study and remain important directions for future comparison.

The remainder of this paper is organized as follows: Section 2 reviews related work in fraud detection, sequential modeling, and generative approaches for imbalanced learning. Section 3 provides preliminary background on Transformer architectures and GANs. Section 4 details our proposed methodology, including the Time-Aware Transformer design and the CTGAN-based oversampling strategy. Section 5 presents experimental results and analysis. Section 6 concludes the paper with discussions on limitations and future research directions.

2. Related Works

This section reviews the existing literature in three key areas relevant to our work: credit card fraud detection methodologies, sequential modeling approaches for temporal data, and generative models for handling class imbalance.

2.1. Credit Card Fraud Detection

Credit card fraud detection has been extensively studied using various machine learning paradigms. Early approaches relied on rule-based systems and statistical methods [35], which required manual feature engineering and domain expertise. The introduction of machine learning techniques marked a significant advancement, with researchers exploring logistic regression [4], and ensemble methods such as random forests [5] and gradient boosting machines [36]. While these traditional methods achieved reasonable performance on balanced datasets, they struggled with the extreme class imbalance inherent in fraud detection scenarios.

To address class imbalance, various sampling strategies have been proposed. Dal Pozzolo et al. [3] conducted a comprehensive practitioner study highlighting the importance of appropriate evaluation metrics and the challenges posed by concept drift in fraud patterns. Cost-sensitive learning approaches [37] assign higher misclassification costs to fraudulent transactions, guiding the model to prioritize minority class detection. Ensemble methods combining multiple weak learners have also shown promise, with techniques like BalanceCascade [38] and EasyEnsemble demonstrating improved detection of rare fraud cases.

Graph-based approaches have emerged as a powerful paradigm for fraud detection by leveraging relational information. Van Vlasselaer et al. [7] proposed APATE, which constructs transaction networks and applies network-based features to enhance detection performance. Weber et al. [39] introduced Graph Convolutional Networks (GCNs) for credit card fraud detection, demonstrating that modeling relationships between cardholders, merchants, and transactions improves classification accuracy. However, these methods often treat transactions as static snapshots and do not fully exploit the temporal dynamics of transaction sequences.

Deep learning has gained traction in fraud detection due to its ability to automatically learn hierarchical feature representations. Fu et al. [10] applied CNNs to extract spatial patterns from transaction data, while autoencoders have been used for unsupervised anomaly detection [40]. Roy et al. [41] proposed a deep learning framework combining multiple neural architectures, demonstrating improved performance over traditional methods. However, these approaches generally do not explicitly model the sequential nature of user behavior, treating each transaction independently or with limited temporal context.

2.2. Sequential Modeling for Fraud Detection

Recognizing that fraudulent behavior manifests as anomalous transaction sequences, several researchers have explored sequential modeling techniques. Jurgovsky et al. [6] pioneered the application of sequence classification to fraud detection, demonstrating that modeling transaction histories as sequences significantly improves detection accuracy. They employed LSTM networks to capture temporal dependencies, showing that sequential models outperform traditional feature-based classifiers on real-world datasets.

RNN-based architectures, including LSTMs [11] and GRUs [8], have been widely adopted for modeling transaction sequences. Altman et al. [42] proposed a bidirectional LSTM architecture that processes transaction sequences in both forward and backward directions, capturing richer contextual information. Wang et al. [43] introduced an attention-based LSTM model that learns to focus on relevant transactions within a sequence, improving interpretability and performance. Despite these advances, RNN-based models face inherent limitations: they suffer from vanishing gradients when processing long sequences, cannot effectively parallelize during training, and struggle to capture very-long-range dependencies.

Temporal Point Processes (TPPs) offer an alternative approach for modeling irregularly-spaced transaction events. Du et al. [44] applied recurrent marked temporal point processes to credit card fraud detection, modeling the continuous-time dynamics of transaction arrivals. While TPPs provide a principled probabilistic framework, they typically require complex inference procedures and may not scale well to large transaction volumes.

The success of Transformer architectures in natural language processing [12] and time-series forecasting [23,24] has inspired their application to fraud detection. Li et al. [45] proposed a Transformer-based model for sequential recommendation that can be adapted to fraud detection, demonstrating the effectiveness of self-attention mechanisms for capturing dependencies in user behavior sequences. However, existing Transformer applications to fraud detection often lack explicit temporal encoding mechanisms to preserve critical time interval information between transactions, which is crucial for detecting time-sensitive fraud patterns. While several recent studies have introduced time-interval-aware attention mechanisms for recommendation systems and banking event streams, our Time-Aware Transformer is designed specifically for irregular credit card transaction sequences, where both temporal gaps and behavioral deviations across spending histories are critical for fraud detection. In this sense, the proposed framework emphasizes temporally informed behavioral anomaly detection rather than general sequential recommendation or event prediction.

More recently, self-supervised [29,30] and contrastive sequence learning [31,32] has emerged as a strong paradigm for temporal representation learning, especially in settings where labeled fraud samples are scarce. These methods typically pretrain sequence encoders using masking- or augmentation-based objectives and then fine-tune them for downstream classification. Such approaches are highly relevant to fraud detection because they can leverage large-scale unlabeled transaction histories. However, they introduce an additional pretraining stage and augmentation design choices, and are therefore beyond the scope of the present study. We regard them as an important direction for future benchmarking against our supervised time-aware Transformer framework.

2.3. Generative Models for Imbalanced Learning

The severe class imbalance in fraud detection has motivated extensive research on data-level solutions. Traditional oversampling techniques, particularly SMOTE [13] and its variants such as Borderline-SMOTE [46] and ADASYN [14], generate synthetic minority samples through interpolation. While computationally efficient, these methods produce simplistic samples that may not capture the complex, multi-modal distributions of fraudulent transactions and can introduce noise in overlapping class regions [16].

Generative Adversarial Networks [17] have revolutionized synthetic data generation by learning to model complex data distributions through adversarial training. Conditional GANs [47] extend the basic GAN framework by conditioning generation on class labels, enabling targeted synthesis of minority class samples. CTGAN [18], specifically designed for tabular data, addresses unique challenges such as mixed data types, multi-modal distributions, and imbalanced categorical variables through mode-specific normalization and conditional vector sampling.

Several studies have explored GANs for fraud detection and imbalanced learning. Fiore et al. [19] demonstrated that GAN-generated synthetic fraud samples improve classification performance when combined with traditional machine learning models. Douzas and Bacao [48] proposed using GANs for oversampling in imbalanced classification tasks, showing improvements over SMOTE across multiple datasets. Engelmann and Lessmann [20] introduced Conditional Wasserstein GAN (CWGAN) for tabular data oversampling, demonstrating superior sample quality and downstream classification performance.

More recent work has explored diffusion models [33] and even hybrid architectures [34] as an alternative generative approach. Diffusion models have shown remarkable success in image generation and have been adapted for tabular data synthesis, like TabDDPM [49]. They have demonstrated strong performance in modeling complex tabular distributions and often provide improved training stability compared with GAN-based generators. These methods represent a promising alternative for synthetic data generation in imbalanced classification problems. However, these models typically require longer training times and more computational resources than GANs, which may limit their practical applicability in fraud detection scenarios requiring frequent model updates.

Despite these advances, existing GAN-based fraud detection systems typically apply generative models in isolation, generating synthetic samples without considering the sequential structure of transaction data. Furthermore, most approaches focus solely on balancing class distributions without integrating generative oversampling with advanced sequential modeling architectures. In addition, diffusion-based tabular generators constitute a strong contemporary alternative to GAN-based oversampling. Although we discuss their potential advantages and computational trade-offs, we do not include them as direct baselines in the current experiments and leave such comparison to future work. Our work addresses these gaps by synergistically combining CTGAN-based oversampling with Time-Aware Transformer modeling, creating a unified framework that leverages both improved class balance and enhanced sequential pattern recognition. While these techniques have been studied independently in the prior literature, their combined application for large-scale tabular transaction fraud detection remains relatively underexplored.

2.4. Publicly Available Fraud Detection Datasets

Several publicly available datasets have been widely used in fraud detection research. One of the most commonly used benchmarks is the European cardholder transaction dataset [50], which contains real credit card transactions with a highly imbalanced fraud distribution and anonymized features derived from principal component analysis. Another widely used dataset is PaySim [51], a synthetic financial transaction dataset that simulates mobile money transactions and fraudulent activities in a payment system environment. In addition, the Elliptic Bitcoin dataset [39] models transactions on the Bitcoin network and is frequently used for fraud and illicit transaction detection using graph-based approaches. The IEEE-CIS Fraud Detection dataset [52], used in this study, provides a large-scale real-world benchmark containing transactional and identity-related features from online payments. Compared with many earlier datasets, IEEE-CIS offers richer feature diversity and a more realistic transaction environment, making it suitable for evaluating modern machine learning and deep learning models for fraud detection.

3. Preliminaries

This section provides the necessary background on the key technical components underlying our proposed framework: Transformer architecture and Generative Adversarial Networks.

3.1. Transformer Architecture

The Transformer architecture [12] revolutionized sequence modeling by replacing recurrent connections with self-attention mechanisms. Given an input sequence

X = [x_{1}, x_{2}, \dots, x_{n}]

, where each

x_{i} \in R^{d}

represents a transaction embedding, the Transformer encoder applies multi-head self-attention to capture dependencies between all positions in the sequence.

Self-Attention Mechanism: To model dependencies between transactions in a sequence, the standard self-attention mechanism is adopted. The self-attention operation computes attention weights between all pairs of positions in the sequence. For an input

X

, three matrices are computed through linear projections:

\begin{matrix} Q & = X W^{Q}, K = X W^{K}, V = X W^{V}, \end{matrix}

(1)

where

W^{Q}, W^{K}, W^{V} \in R^{d \times d_{k}}

are learnable parameter matrices representing queries, keys, and values respectively. The attention output is computed as:

\begin{matrix} Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V, \end{matrix}

(2)

where the scaling factor

\sqrt{d_{k}}

prevents the dot products from growing too large in magnitude.

Multi-Head Attention: Instead of performing a single attention function, multi-head attention employs h parallel attention heads, each with different learned projections:

\begin{matrix} MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O}, \end{matrix}

(3)

where

{head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

and

W^{O}

is the output projection matrix. This allows the model to jointly attend to information from different representation subspaces.

Position Encoding: Since self-attention is permutation-invariant, positional information must be explicitly injected. The original Transformer uses sinusoidal position encodings:

\begin{matrix} P E_{(p o s, 2 i)} & = \sin (\frac{p o s}{10000^{2 i / d}}) \end{matrix}

(4)

\begin{matrix} P E_{(p o s, 2 i + 1)} & = \cos (\frac{p o s}{10000^{2 i / d}}), \end{matrix}

(5)

where

p o s

is the position index and i is the dimension index. These encodings are added to the input embeddings to provide sequential order information.

3.2. Generative Adversarial Networks

Generative Adversarial Networks (GANs) [17] consist of two neural networks trained in an adversarial manner: a generator G that learns to produce synthetic samples, and a discriminator D that learns to distinguish real samples from generated ones.

Adversarial Training: The generator G takes random noise

z \sim p_{z} (z)

as input and produces synthetic samples

x_{f a k e} = G (z)

. The discriminator D outputs the probability that a given sample is real. The training objective is formulated as a minimax game:

\begin{matrix} \min_{G} \max_{D} E_{x \sim p_{d a t a}} [\log D (x)] + E_{z \sim p_{z}} [\log (1 - D (G (z)))] . \end{matrix}

(6)

The discriminator maximizes its ability to correctly classify real and fake samples, while the generator minimizes the discriminator’s ability to distinguish its outputs from real data.

Conditional Tabular GAN (CTGAN): CTGAN [18] extends the basic GAN framework to handle the unique challenges of tabular data, including mixed data types and imbalanced categorical variables. For conditional generation targeting a specific class c, the generator receives both noise

z

and class label c as input:

\begin{matrix} x_{f a k e} = G (z, c) . \end{matrix}

(7)

CTGAN employs mode-specific normalization to handle multi-modal continuous distributions. For each continuous variable, a Gaussian mixture model is fitted, and the variable is transformed into a one-hot encoded mode indicator and a normalized scalar value. For categorical variables with imbalanced distributions, CTGAN uses conditional vector sampling to ensure all categories are adequately represented during training. The discriminator is also conditioned on the class label, learning to distinguish real samples of class c from generated samples. To improve training stability for tabular data generation, we use a Wasserstein objective with gradient penalty:

\begin{matrix} \min_{G} \max_{D} & E_{x \sim p_{d a t a} (x | c)} [D (x, c)] - E_{z \sim p_{z}} [D (G (z, c), c)] \\ + λ E_{\hat{x} \sim p_{\hat{x}}} [{(∥ \nabla_{\hat{x}} D (\hat{x}, c) ∥_{2} - 1)}^{2}], \end{matrix}

(8)

where

λ

is the gradient penalty coefficient and

\hat{x}

denotes samples interpolated between real and generated data points. The gradient penalty enforces the Lipschitz constraint on the discriminator, which improves training stability and helps mitigate mode collapse in tabular data generation. This conditional framework enables CTGAN to generate synthetic samples that preserve important statistical properties and conditional structures of the target class, making it particularly suitable for generating fraudulent transaction samples in imbalanced fraud detection scenarios.

4. Methodology

In this section, we present our proposed framework that synergistically combines Time-Aware Transformer encoding with CTGAN-based generative oversampling for credit card fraud detection. The framework addresses two fundamental challenges: capturing complex temporal dependencies in transaction sequences and mitigating severe class imbalance through high-quality synthetic sample generation. We first describe the overall architecture, followed by detailed descriptions of the Time-Aware Transformer encoder and the CTGAN-based oversampling strategy, concluding with the integrated training pipeline.

4.1. Framework Overview

Our framework operates in two main stages: generative oversampling and sequential classification, as shown in Figure 1. In the first stage, CTGAN learns the complex conditional distribution of fraudulent transactions from the minority class and generates synthetic fraud samples to balance the training dataset. The motivation for this design is that traditional oversampling methods like SMOTE produce simplistic interpolations that fail to capture the multi-modal, high-dimensional distributions characteristic of real fraud patterns. By employing adversarial training, CTGAN learns the underlying data manifold and generates diverse, realistic samples that preserve critical statistical properties.

In the second stage, we construct temporal transaction sequences for each user by aggregating their historical transactions in chronological order. These sequences, now balanced through the inclusion of synthetic fraudulent transactions, are fed into a Time-Aware Transformer encoder that explicitly models both the sequential order and temporal intervals between transactions. The rationale behind this sequential formulation is that fraud detection should not treat transactions as isolated events but rather as manifestations of user behavior patterns that evolve over time. Legitimate users typically exhibit consistent spending patterns, while fraudulent activities often introduce abrupt deviations in transaction amounts, merchant categories, geographical locations, or temporal frequencies. The Time-Aware Transformer’s self-attention mechanism enables it to identify such anomalous subsequences within the broader context of user history.

The synergy between these two components is crucial: CTGAN provides the model with sufficient fraudulent examples to learn discriminative patterns, while the Time-Aware Transformer leverages sequential context to distinguish subtle behavioral anomalies that point-wise classifiers would miss. The framework outputs a binary classification indicating whether a given transaction sequence contains fraudulent activity.

4.2. Time-Aware Transformer Encoder

The Time-Aware Transformer encoder extends the standard Transformer architecture [12] with explicit temporal awareness to capture the time-dependent nature of transaction sequences. Given a user’s transaction sequence

S = [(x_{1}, t_{1}), (x_{2}, t_{2}), \dots, (x_{n}, t_{n})]

, where

x_{i} \in R^{d_{x}}

represents the feature vector of the i-th transaction and

t_{i}

denotes its timestamp, our goal is to learn a representation that encodes both transactional features and temporal dynamics.

4.2.1. Temporal Feature Embedding

We first construct a comprehensive embedding for each transaction that incorporates both raw features and temporal information. Each transaction’s feature vector

x_{i}

contains attributes such as transaction amount, merchant category code, device information, and geographical indicators extracted from the IEEE-CIS dataset. We apply a linear projection to map these features to a higher-dimensional space:

\begin{matrix} f_{i} = W_{f} x_{i} + b_{f}, \end{matrix}

(9)

where

W_{f} \in R^{d_{m o d e l} \times d_{x}}

and

b_{f} \in R^{d_{m o d e l}}

are learnable parameters, and

d_{m o d e l}

is the model dimension.

From a modeling perspective, transaction histories in fraud detection can be viewed as irregular event sequences rather than uniformly sampled time series. The temporal spacing between transactions therefore contains important behavioral information. In particular, sudden bursts of activity or unusual temporal gaps often indicate deviations from normal spending behavior. To capture such irregular temporal dynamics, we explicitly model the time interval between consecutive transactions. To incorporate temporal information between transactions, we compute time intervals between consecutive transactions. The time difference

Δ t_{i} = t_{i} - t_{i - 1}

captures the spending frequency, which is a critical indicator for fraud detection since fraudulent transactions often occur in rapid succession. We encode these intervals using a temporal embedding:

\begin{matrix} e_{Δ t_{i}} = W_{t} ϕ (Δ t_{i}) + b_{t}, \end{matrix}

(10)

where

ϕ (\cdot)

is a non-linear transformation function that maps time differences to a fixed-dimensional representation. Specifically, we use a logarithmic scaling combined with binning to handle the wide range of possible time intervals:

\begin{matrix} ϕ (Δ t) = [⊮_{b_{1}} (Δ t), ⊮_{b_{2}} (Δ t), \dots, ⊮_{b_{m}} (Δ t), \log (1 + Δ t)], \end{matrix}

(11)

where

⊮_{b_{j}} (Δ t)

are indicator functions for predefined time bins (e.g., within 1 h, 1–24 h, 1–7 days, etc.), and the logarithmic term captures fine-grained temporal variations. This design is motivated by the observation that fraud patterns exhibit different characteristics at different time scales: some fraud involves rapid consecutive transactions, while others maintain seemingly normal intervals to avoid detection. The temporal bin boundaries are chosen to reflect typical transaction time scales observed in credit card usage patterns, distinguishing rapid transaction bursts from normal daily or weekly spending intervals. Such discretization provides a simple yet effective way to capture behavioral differences across time scales while keeping the temporal representation compact.

We also incorporate absolute positional encoding to preserve the sequential order of transactions within the sequence. Following the Transformer architecture, we use sinusoidal position encodings:

\begin{matrix} p_{i} & = [\sin (i / 10000^{0 / d_{m o d e l}}), \cos (i / 10000^{0 / d_{m o d e l}}), \dots, \end{matrix}

(12)

\sin (i / 10000^{(d_{m o d e l} - 1) / d_{m o d e l}}), \cos (i / 10000^{(d_{m o d e l} - 1) / d_{m o d e l}})] .

(13)

The final embedding for each transaction combines feature, temporal, and positional information:

\begin{matrix} h_{i}^{(0)} = f_{i} + e_{Δ t_{i}} + p_{i} . \end{matrix}

(14)

This additive composition allows the model to simultaneously reason about transaction content, temporal dynamics, and sequential order.

4.2.2. Time-Aware Multi-Head Self-Attention

The core of our encoder is a modified multi-head self-attention mechanism that explicitly incorporates temporal awareness. Standard self-attention computes attention weights based solely on content similarity between query and key vectors. However, in fraud detection, the temporal relationship between transactions is equally important: a large transaction immediately following another large transaction may indicate fraud, while the same pair separated by weeks may be legitimate.

We augment the attention computation with temporal bias terms that modulate attention weights based on time intervals. For the l-th layer, we compute queries, keys, and values as:

\begin{matrix} Q^{(l)} = H^{(l - 1)} W_{Q}^{(l)}, K^{(l)} = H^{(l - 1)} W_{K}^{(l)}, V^{(l)} = H^{(l - 1)} W_{V}^{(l)}, \end{matrix}

(15)

where

H^{(l - 1)} \in R^{n \times d_{m o d e l}}

is the output of the previous layer.

The temporal bias is computed based on the time difference between transactions i and j:

\begin{matrix} b_{i j} = w_{b}^{T} \tanh (W_{b} ϕ (| t_{i} - t_{j} |)), \end{matrix}

(16)

where

W_{b}

and

w_{b}

are learnable parameters. This bias encourages the model to attend more strongly to temporally proximate transactions, which is crucial for detecting fraud patterns that unfold over short time windows.

The time-aware attention for each head h is computed as:

\begin{matrix} {Attention}_{h} (Q, K, V, B) = softmax (\frac{Q_{h} K_{h}^{T}}{\sqrt{d_{k}}} + B) V_{h}, \end{matrix}

(17)

where

B \in R^{n \times n}

is the temporal bias matrix with entries

b_{i j}

. The outputs from all H heads are concatenated and linearly transformed:

\begin{matrix} MultiHead (Q, K, V, B) = Concat ({Attention}_{1}, \dots, {Attention}_{H}) W_{O}, \end{matrix}

(18)

Following the attention sublayer, we apply layer normalization and a position-wise feed-forward network:

\begin{matrix} H^{(l)} = LayerNorm (H^{(l - 1)} + FFN (LayerNorm (H^{(l - 1)} + MultiHead (\cdot)))), \end{matrix}

(19)

where

FFN (x) = \max (0, x W_{1} + b_{1}) W_{2} + b_{2}

is a two-layer feed-forward network with ReLU activation.

4.2.3. Sequence-Level Classification

After L transformer layers, we obtain contextualized representations for all transactions in the sequence

H^{(L)} = [h_{1}^{(L)}, h_{2}^{(L)}, \dots, h_{n}^{(L)}]

. To perform sequence-level classification, we aggregate these representations using a weighted pooling mechanism that emphasizes more recent transactions, as fraudulent activities are often detected through the most recent behavioral patterns. We compute attention weights over the sequence:

\begin{matrix} α_{i} = \frac{\exp (w_{a}^{T} \tanh (W_{a} h_{i}^{(L)}))}{\sum_{j = 1}^{n} \exp (w_{a}^{T} \tanh (W_{a} h_{j}^{(L)}))}, \end{matrix}

(20)

and aggregate the representations:

\begin{matrix} s = \sum_{i = 1}^{n} α_{i} h_{i}^{(L)} . \end{matrix}

(21)

The final classification is performed through a multi-layer perceptron:

\begin{matrix} \hat{y} = σ (W_{c} ReLU (W_{s} s + b_{s}) + b_{c}), \end{matrix}

(22)

where

σ

is the sigmoid function and

\hat{y}

represents the predicted probability of fraud.

4.3. CTGAN-Based Generative Oversampling

The severe class imbalance in fraud detection datasets poses significant challenges for training effective classifiers. Traditional oversampling methods generate synthetic samples through simple interpolation, which fails to capture the complex distributions of fraudulent transactions. We employ CTGAN [18] to learn the intricate conditional distribution of fraud samples and generate high-quality synthetic transactions.

From a probabilistic perspective, the generative model approximates the conditional distribution

p (x ∣ y = fraud)

of minority-class transactions. By learning this distribution from observed fraud samples, CTGAN can generate synthetic instances that preserve joint statistical dependencies among heterogeneous tabular variables, including both numerical and categorical attributes. This distributional modeling capability is particularly important in fraud detection, where fraudulent patterns often arise from complex interactions among multiple transaction features rather than simple local variations that can be captured by interpolation-based oversampling methods.

4.3.1. Mode-Specific Normalization

Credit card transaction data exhibits multi-modal continuous distributions. For example, transaction amounts may cluster around common price points such as gas station fills, grocery purchases, or online subscription fees. Standard normalization techniques like min–max scaling or z-score normalization fail to preserve these multi-modal characteristics, leading to poor generative quality.

CTGAN addresses this through mode-specific normalization. For each continuous variable x, we fit a Gaussian Mixture Model with K components:

\begin{matrix} p (x) = \sum_{k = 1}^{K} π_{k} N (x | μ_{k}, σ_{k}^{2}), \end{matrix}

(23)

where

π_{k}

,

μ_{k}

, and

σ_{k}

are the mixture weight, mean, and standard deviation of the k-th component. In our implementation, K is treated as a fixed upper bound shared across continuous features for modeling consistency. However, the effective number of active modes may vary by feature, since components with negligible mixture weights contribute little to the final representation. Thus, K provides sufficient flexibility for complex multi-modal variables without forcing all features to use the same effective level of complexity. Each value x is then represented by a tuple

(m, v)

where m is a one-hot vector indicating the most likely mode:

\begin{matrix} m = \arg \max_{k} \frac{π_{k} N (x | μ_{k}, σ_{k}^{2})}{\sum_{j = 1}^{K} π_{j} N (x | μ_{j}, σ_{j}^{2})}, \end{matrix}

(24)

and v is the normalized value within that mode:

\begin{matrix} v = \frac{x - μ_{m}}{σ_{m}} . \end{matrix}

(25)

This transformation preserves the multi-modal structure during generation. When sampling from the generator, we first sample a mode from the one-hot vector, then denormalize the continuous value according to that mode’s parameters.

4.3.2. Conditional Vector Sampling

Fraudulent transactions often exhibit highly imbalanced categorical attributes. For instance, certain merchant category codes or device types may be heavily associated with fraud but represent only a small fraction of the overall data. To ensure comprehensive coverage during training, CTGAN employs conditional vector sampling.

For each categorical variable with categories

{c_{1}, c_{2}, \dots, c_{M}}

, we construct a conditional vector

cond \in {0, 1}^{M}

that specifies which category should be generated. During training, we sample categories with probability proportional to their log-frequency:

\begin{matrix} p (c_{i}) \propto \log (f r e q (c_{i}) + 1), \end{matrix}

(26)

where

f r e q (c_{i})

is the frequency of category

c_{i}

in the training data. This logarithmic weighting ensures that rare categories receive sufficient training signal without completely ignoring frequent categories.

In our framework, this conditional sampling mechanism is used to generate individual synthetic fraudulent transactions rather than complete transaction sequences. Each synthetic sample follows the same tabular schema as the original data, including the transaction attributes later used for sequence construction, such as card-related identifiers and the transaction timestamp feature. After generation, the synthetic fraudulent transactions are merged with the real training transactions, and the augmented data are then reorganized into sequences using the same procedure applied to the original dataset: transactions are grouped by card identifiers, sorted chronologically according to TransactionDT, and the temporal gaps between consecutive transactions are recomputed. Therefore, the Time-Aware Transformer operates on reconstructed chronological sequences in the augmented training set, rather than on sequences generated directly by CTGAN. Although the generator operates on tabular transactions rather than complete sequences, temporal coherence is preserved during sequence reconstruction because the generated samples retain timestamp-related attributes and are reorganized into chronological event streams before being processed by the sequential model. While this design preserves compatibility between tabular generation and temporal sequence modeling, the generator itself does not explicitly enforce sequence-level temporal coherence, which remains an important direction for future work.

4.3.3. Generator and Discriminator Architecture

The generator G takes as input a random noise vector

z \sim N (0, I)

and a conditional vector

cond

specifying the target class (fraud) and categorical constraints. The generator outputs a synthetic transaction:

\begin{matrix} \tilde{x} = G (z, cond) . \end{matrix}

(27)

The discriminator D receives both real and generated transactions along with their conditional vectors and learns to distinguish between them:

\begin{matrix} D (x, cond) \to [0, 1] . \end{matrix}

(28)

The training objective combines adversarial loss with conditional consistency:

\begin{matrix} L_{D} & = - E_{x \sim p_{f r a u d}} [\log D (x, cond)] - E_{z \sim p_{z}} [\log (1 - D (G (z, cond), cond))], \end{matrix}

(29)

\begin{matrix} L_{G} & = - E_{z \sim p_{z}} [\log D (G (z, cond), cond)] . \end{matrix}

(30)

To stabilize training, we employ gradient penalty regularization following the Wasserstein GAN framework. The discriminator loss is augmented with:

\begin{matrix} L_{G P} = λ E_{\hat{x}} [(∥ \nabla_{\hat{x}} D (\hat{x}, cond) ∥_{2} - 1)^{2}], \end{matrix}

(31)

where

\hat{x} = ϵ x + (1 - ϵ) G (z, cond)

is a random interpolation between real and generated samples, and

λ

controls the penalty strength.

4.4. Integrated Training Pipeline

The complete framework integrates CTGAN-based oversampling with Time-Aware Transformer classification through a two-stage training procedure. In the first stage, we train CTGAN exclusively on fraudulent transactions to learn their distribution. Once CTGAN converges, we generate synthetic fraud samples to balance the training dataset. The number of synthetic samples is determined to achieve a target fraud ratio, typically between 30 and 50% of the total training set, which we find provides sufficient signal without overwhelming the model with purely synthetic data.

In the second stage, we construct temporal transaction sequences by grouping transactions by user ID and sorting them chronologically. For users with fraudulent transactions (both real and synthetic), we create sequences that include contextual legitimate transactions preceding the fraud event, simulating realistic fraud scenarios. For legitimate users, we randomly sample transaction windows. This sequence construction is motivated by the need to provide the model with realistic behavioral context: fraudulent transactions rarely occur in isolation but are preceded by legitimate usage patterns that make the anomaly detectable.

The Time-Aware Transformer is then trained on these sequences using binary cross-entropy loss with class weights to account for any remaining imbalance after oversampling:

\begin{matrix} L_{c l s} = - \frac{1}{N} \sum_{i = 1}^{N} [w_{f r a u d} \cdot y_{i} \log ({\hat{y}}_{i}) + w_{l e g i t} \cdot (1 - y_{i}) \log (1 - {\hat{y}}_{i})], \end{matrix}

(32)

where

w_{f r a u d}

and

w_{l e g i t}

are class weights inversely proportional to class frequencies in the balanced dataset.

The complete training algorithm is summarized in Algorithm 1.

Algorithm 1 Time-Aware Transformer with CTGAN Training.

1:: Input: Transaction dataset $D = {(x_{i}, y_{i}, t_{i}, u_{i})}$ , target fraud ratio $ρ$
2:: Output: Trained Time-Aware Transformer model $f_{θ}$
3:: // Stage 1: Generative Oversampling
4:: Extract fraud transactions: $D_{f r a u d} = {(x_{i}, t_{i}) | y_{i} = 1}$
5:: Train CTGAN on $D_{f r a u d}$ until convergence
6:: Generate synthetic fraud samples to achieve ratio $ρ$
7:: $D_{b a l a n c e d} = D \cup D_{s y n t h e t i c}$
8:: // Stage 2: Sequential Modeling
9:: Group transactions by user: $S_{u} = {(x_{i}, t_{i}, y_{i}) | u_{i} = u}$ for each user u
10:: for each user u with fraud transactions do
11:: Create sequence with context: $S_{u} = [(x_{i - k}, t_{i - k}), \dots, (x_{i}, t_{i})]$
12:: Label sequence: $y_{S_{u}} = 1$ if contains fraud, else 0
13:: end for
14:: for each user u without fraud do
15:: Sample random transaction window: $S_{u}$ , label $y_{S_{u}} = 0$
16:: end for
17:: Initialize Time-Aware Transformer parameters $θ$
18:: while not converged do
19:: Sample batch of sequences ${S_{j}, y_{S_{j}}}$
20:: Compute predictions ${\hat{y}}_{j} = f_{θ} (S_{j})$
21:: Update $θ$ to minimize $L_{c l s}$
22:: end while
23:: return $f_{θ}$

This integrated approach leverages the complementary strengths of both components: CTGAN provides diverse, high-quality fraudulent samples that enrich the training distribution, while the Time-Aware Transformer exploits sequential context to identify subtle anomalies that distinguish fraud from legitimate behavior patterns. The framework is end-to-end trainable and can be updated as new fraud patterns emerge, making it adaptable to the evolving nature of financial fraud.

5. Experiments

In this section, we conduct extensive experiments to evaluate the effectiveness of our proposed framework. We first describe the dataset and experimental setup, then present comparative results against baseline methods, followed by ablation studies and detailed analysis of key components.

5.1. Experiment Setup

5.1.1. Dataset Description

We evaluate our framework on the IEEE-CIS Fraud Detection dataset (https://www.kaggle.com/c/ieee-fraud-detection/data (accessed on 1 February 2026)) released through the IEEE Computational Intelligence Society (IEEE-CIS) fraud detection competition on Kaggle. The dataset was provided by Vesta Corporation (Weatogue, CT, USA), a payment services company, and contains real-world online transaction records collected from an e-commerce fraud prevention system. Although many variables have been anonymized to protect sensitive financial information and user privacy, the dataset originates from real transaction logs rather than synthetic or generative data. It represents one of the most comprehensive and realistic credit card fraud detection benchmarks available. The dataset comprises two main tables: transaction information and identity information.

The transaction table contains 590,540 records and 394 columns in the raw release, including transaction-related attributes, engineered variables, identifier fields, and the target label. Some repository descriptions report 393 transaction features depending on whether the target label is counted separately; in this work, we refer to the raw transaction table as released, which contains 394 columns including the label field. Key features include transaction amount (TransactionAmt), product code (ProductCD), transaction date and time (TransactionDT), and various anonymized categorical and numerical features labeled as C1–C14 (counting features), D1–D15 (time delta features), M1–M9 (match features), and V1–V339 (Vesta engineered features). The identity table contains 144,233 records and 41 columns, including TransactionID and identity-related attributes that provide additional information about user devices and digital signatures, such as DeviceType, DeviceInfo, and identity verification features (id_01 to id_38). Among these identity features, some variables are numerical while others are treated as categorical attributes during preprocessing.

The dataset exhibits severe class imbalance with only 20,663 fraudulent transactions out of 590,540 total transactions, resulting in a fraud ratio of approximately 3.5%. This imbalance ratio is realistic for production fraud detection systems but poses significant challenges for model training. We merge the transaction and identity tables using TransactionID as the key, resulting in a unified dataset where each transaction contains both transactional and identity features when available. For reproducibility, we note that the raw feature counts refer to the original released columns before preprocessing. After merging the two tables, identifier fields and the target label are excluded from the model input, categorical attributes are encoded using standard categorical encoding, numerical missing values are imputed with the mean, and categorical missing values are imputed with the mode. The resulting processed feature matrix is then used as the input to both the CTGAN module and the sequential classifier.

For temporal sequence construction, we group transactions by card identifiers (derived from card1–card6 features) and sort them chronologically using TransactionDT. The average sequence length is 12.4 transactions per card, with a standard deviation of 8.7, minimum of 1, and maximum of 156 transactions. We filter out cards with fewer than 3 transactions to ensure sufficient sequential context. The final dataset comprises 89,432 transaction sequences, with 8941 sequences containing at least one fraudulent transaction (10% sequence-level fraud ratio).

We split the data temporally to simulate realistic deployment scenarios where models are trained on historical data and evaluated on future transactions. Specifically, we use the first 70% of transactions (chronologically) for training, 10% for validation, and the remaining 20% for testing. This temporal split ensures that the model does not have access to future information during training, reflecting real-world constraints where fraud patterns may evolve over time. While it is possible to construct synthetic fraud datasets using transaction simulators or generative models, such datasets may not fully capture the complex statistical properties of real financial systems. Therefore, we focus on evaluating the proposed framework using the publicly available IEEE-CIS dataset, which provides anonymized real-world transaction records.

5.1.2. Baseline Methods

We compare our proposed framework against three categories of baselines: traditional machine learning methods, deep learning approaches without sequential modeling, and existing sequential models.

Traditional Machine Learning: We implement Logistic Regression (LR) as a linear baseline, Random Forest (RF) [5] with 200 trees as a strong ensemble baseline, and XGBoost (XGB) [53] as a representative gradient boosting method. For these methods, we engineer aggregate features from transaction sequences including statistical summaries (mean, standard deviation, minimum, maximum of transaction amounts), temporal features (average time interval, transaction frequency), and recency features (features from the most recent transaction). We apply SMOTE [13] oversampling to handle class imbalance for fair comparison.

Deep Learning without Sequences: We implement a Multi-Layer Perceptron (MLP) with three hidden layers as a neural baseline operating on engineered aggregate features. We also implement CTGAN + MLP, which uses our CTGAN oversampling strategy combined with the MLP classifier to isolate the contribution of generative oversampling from sequential modeling.

Sequential Models: We compare against LSTM [6], a widely used recurrent architecture for sequence modeling in fraud detection, and vanilla Transformer without temporal awareness to demonstrate the importance of our time-aware modifications. We also implement GRU [8] as an alternative recurrent baseline and Temporal Convolutional Network (TCN) [54] to represent convolutional approaches to sequential modeling.

All baseline methods are carefully tuned using grid search on the validation set to ensure fair comparison. For methods requiring class imbalance handling, we experiment with both oversampling (SMOTE) and class weighting, selecting the better-performing strategy for each method. We selected these baselines to cover classical tabular classifiers, oversampling-based neural models, and supervised sequential architectures under a unified experimental pipeline. We note, however, that the present evaluation does not include several strong contemporary alternatives, including graph neural network-based fraud detection, self-supervised or contrastive sequence models, and diffusion-based tabular generators. Accordingly, we interpret our empirical findings within this evaluation scope and leave broader benchmarking against these methods to future work.

5.1.3. Evaluation Metrics

Given the severe class imbalance in fraud detection, accuracy is insufficient as a sole metric since a trivial classifier that labels all transactions as legitimate would achieve over 96% accuracy. We therefore employ multiple complementary metrics that capture different aspects of model performance.

We report Area Under the Receiver Operating Characteristic Curve (AUC-ROC), which measures the model’s ability to rank fraudulent transactions higher than legitimate ones across all possible classification thresholds. This metric is particularly valuable for fraud detection where the operating threshold may be adjusted based on business requirements. We also compute Area Under the Precision–Recall Curve (AUC-PR), which is more informative than AUC-ROC for highly imbalanced datasets as it focuses on the positive class performance.

For threshold-dependent metrics, we report Precision (the fraction of predicted fraud cases that are actual fraud), Recall (the fraction of actual fraud cases that are detected), and F1-Score (the harmonic mean of precision and recall). We select the classification threshold on the validation set to maximize F1-score, which balances precision and recall. Additionally, we report False Positive Rate (FPR) at 90% recall, which is a critical operational metric indicating how many false alarms the system generates while maintaining high fraud detection coverage.

All experiments are repeated with 5 different random seeds to account for stochasticity in model initialization and training. We report mean performance with standard deviations to provide statistically meaningful comparisons.

5.1.4. Implementation Details

Our framework is implemented in PyTorch 2.0. The Time-Aware Transformer encoder consists of 4 layers with model dimension

d_{m o d e l} = 256

, 8 attention heads, and feed-forward dimension of 1024. We use dropout with rate 0.1 after each sublayer for regularization. The temporal encoding uses 10 time bins with boundaries at 1 h, 6 h, 1 day, 3 days, 1 week, 2 weeks, 1 month, 3 months, 6 months, and 1 year. Sequences are padded or truncated to a maximum length of 50 transactions.

For CTGAN, we use a generator with 2 residual blocks of dimension 256 and a discriminator with 2 layers of dimension 256. We train CTGAN for 300 epochs using Adam optimizer with learning rate

2 \times 10^{- 4}

and batch size 500. The gradient penalty coefficient is set to 10. We generate synthetic fraud samples to achieve a 40% fraud ratio in the balanced training set, which we found optimal through validation set tuning.

The integrated framework is trained end-to-end using Adam optimizer with learning rate

1 \times 10^{- 4}

, batch size 32, and early stopping with patience of 10 epochs based on validation AUC-ROC. We use class weights inversely proportional to class frequencies in the balanced dataset. Training converges within 50–80 epochs on average. All experiments are conducted on NVIDIA A100 GPUs with 40 GB memory.

5.2. Main Results

Table 1 presents the extensive comparison between our proposed framework and all baseline methods. Our Time-Aware Transformer with CTGAN (TAT-CTGAN) achieves the best performance across all metrics, demonstrating the effectiveness of combining generative oversampling with temporal sequential modeling.

Our method achieves an AUC-ROC of 0.982, significantly outperforming the vanilla Transformer (0.956) and the best RNN baseline LSTM (0.931). The improvement is even more pronounced in AUC-PR (0.911 vs 0.834), demonstrating superior precision–recall trade-offs. The F1-score of 0.883 represents a substantial improvement over XGBoost (0.775), the best traditional method, indicating that our approach better balances precision and recall.

Notably, our method achieves an FPR of only 3.8% at 90% recall, which is critically important for production deployment where excessive false alarms can overwhelm fraud investigation teams. The vanilla Transformer achieves 6.1% FPR at the same recall level, while LSTM reaches 8.9%, demonstrating the practical advantage of our time-aware design and generative oversampling strategy.

The comparison between CTGAN + MLP (0.903 AUC-ROC) and MLP + SMOTE (0.874 AUC-ROC) isolates the contribution of CTGAN oversampling, showing a 2.9 percentage point improvement. Similarly, comparing vanilla Transformer (0.956) with LSTM (0.931) demonstrates the advantage of attention mechanisms over recurrent architectures. Our full framework combines both advantages, achieving synergistic improvements beyond either component alone.

The relatively low standard deviations across all metrics (typically less than 1% of the mean) indicate that our results are robust to random initialization and training stochasticity. Traditional methods show slightly higher variance, particularly in precision and recall, suggesting greater sensitivity to the specific training samples selected during cross-validation splits.

We further evaluate the sensitivity of the framework to key hyperparameters, including the number of temporal bins, Transformer embedding dimension, and GAN training iterations. The results indicate that the proposed framework remains stable across a reasonable range of parameter settings.

5.3. Ablation Study

To understand the individual contributions of our framework’s components, we conduct comprehensive ablation studies. Table 2 presents results for various configurations.

Replacing CTGAN with SMOTE reduces AUC-ROC from 0.982 to 0.961, indicating that CTGAN-based oversampling is more effective than interpolation-based oversampling in our downstream fraud detection pipeline. The gap is even larger in AUC-PR (6.2 percentage points), indicating that CTGAN-generated samples particularly improve precision. Removing oversampling entirely drops performance to 0.943 AUC-ROC, demonstrating that some form of class balancing is essential for optimal performance.

Removing temporal encoding decreases AUC-ROC by 1.4 percentage points to 0.968, showing that explicit time interval modeling is valuable for capturing fraud patterns. The temporal bias in attention contributes an additional 0.9 percentage points, indicating that allowing the model to modulate attention based on temporal proximity helps focus on relevant transaction windows. Interestingly, removing positional encoding has a larger impact (2.3 percentage point drop) than removing temporal encoding, suggesting that sequential order is even more critical than absolute time intervals. This makes intuitive sense: the sequence of events matters more than their exact timing.

Using standard attention without any temporal modifications reduces performance to 0.956, essentially matching the vanilla Transformer baseline. This confirms that our time-aware modifications—temporal encoding, temporal bias, and positional encoding—collectively contribute the improvement over standard Transformers.

Architecture depth analysis shows that reducing to 2 layers decreases AUC-ROC by 1.1 percentage points, indicating that deeper representations are beneficial. However, increasing to 6 layers provides only marginal additional improvement (0.2 percentage points) while significantly increasing computational cost, suggesting that 4 layers represents a good trade-off between performance and efficiency.

5.4. Impact of Synthetic Sample Ratio

We investigate how the ratio of synthetic fraud samples affects model performance. Figure 2 shows performance curves as we vary the target fraud ratio in the balanced dataset from 10% (minimal oversampling) to 70% (aggressive oversampling).

Performance improves rapidly as we increase the fraud ratio from 10% to 40%, with AUC-ROC rising from 0.951 to 0.982 and F1-score increasing from 0.846 to 0.883. This demonstrates that the model benefits substantially from additional synthetic fraud examples up to a certain point. However, beyond 40–50% fraud ratio, performance plateaus and even slightly decreases at 70% fraud ratio (0.979 AUC-ROC, 0.878 F1-score).

This pattern suggests an optimal balance between synthetic and real samples. With too few synthetic samples, the model still suffers from class imbalance and cannot learn the full diversity of fraud patterns. With too many synthetic samples, the model may overfit to the synthetic data distribution, which inevitably contains some artifacts from the generative process. The plateau around 40–50% indicates that at this ratio, the model has sufficient fraud examples to learn robust patterns without being dominated by synthetic artifacts.

Interestingly, precision shows more sensitivity to the synthetic ratio than recall. At 70% fraud ratio, precision drops to 0.872 while recall remains at 0.881, suggesting that excessive synthetic samples may introduce false patterns that increase false positive rates. This reinforces the importance of carefully tuning the oversampling ratio based on validation set performance rather than maximizing fraud representation.

5.5. Sequence Length Analysis

Transaction sequence length varies widely across users, ranging from 3 to 156 transactions in our dataset. We analyze how sequence length affects detection performance by stratifying the test set into length buckets and evaluating performance separately for each bucket. Figure 3 presents the results.

For very short sequences (3–5 transactions), AUC-ROC is 0.941 with F1-score of 0.827, indicating that limited historical context constrains the model’s ability to identify behavioral anomalies. As sequence length increases to 6–10 transactions, performance improves to 0.968 AUC-ROC and 0.859 F1-score. The improvement continues for sequences of 11–20 transactions (0.981 AUC-ROC, 0.881 F1-score) and 21–30 transactions (0.986 AUC-ROC, 0.892 F1-score).

Beyond 30 transactions, performance stabilizes with minimal further improvement, suggesting that most relevant behavioral patterns are captured within a 30-transaction window. Very long sequences (50+ transactions) show comparable performance (0.985 AUC-ROC, 0.889 F1-score) to medium-length sequences, indicating that our attention mechanism successfully focuses on relevant segments even when processing extensive histories.

This analysis has practical implications for deployment: the model can achieve strong performance with moderate sequence lengths (20–30 transactions), which most users accumulate within a few weeks of activity. New users with limited history will have reduced but still reasonable detection capability, while long-term users benefit from richer behavioral context.

Comparing our method to LSTM on this metric reveals an interesting pattern: LSTM performance degrades slightly for very long sequences (50+ transactions), likely due to vanishing gradient issues, whereas our Transformer-based approach maintains stable performance. This validates our architectural choice for handling long-range dependencies.

5.6. Temporal Dynamics Visualization

To understand what temporal patterns the model learns, we visualize the attention weights for selected fraud cases. Figure 4 shows attention heatmaps for two representative sequences.

In the first case (Figure 4a), we observe a burst pattern where a user’s card was stolen and used for three rapid transactions within 2 h. The attention heatmap shows that these fraudulent transactions (positions 8–10) attend strongly to each other, with attention weights exceeding 0.3 for intra-fraud connections compared to less than 0.1 for connections to legitimate transactions. This demonstrates that the model has learned to identify rapid succession as a fraud indicator, consistent with known fraud patterns where stolen cards are exploited quickly before being blocked.

The second case (Figure 4b) illustrates a more sophisticated pattern where a fraudster gradually escalates transaction amounts over several days to avoid triggering rule-based systems. The final fraudulent transaction (position 15,

1, 847

) attends most strongly to previous transactions at positions 11 (723), 13 (

1, 124

), and 14 (

1, 456

), all of which show increasing amounts. The attention weights to these transactions (0.28, 0.31, 0.35) are significantly higher than to earlier legitimate transactions (typically below 0.15), indicating that the model recognizes the escalation pattern as anomalous.

Interestingly, legitimate transactions show more diffuse attention patterns, distributing attention relatively evenly across recent history. This suggests the model interprets consistent, stable patterns as indicators of legitimate behavior, whereas focused attention on specific subsequences signals anomalies.

These visualizations provide interpretability for fraud investigators, who can inspect which historical transactions influenced the model’s decision, facilitating more efficient case review and validation.

5.7. Comparison of Oversampling Quality

To validate that CTGAN generates higher-quality synthetic samples than SMOTE, we conduct a quantitative comparison of the synthetic data distributions. We measure distribution similarity using Maximum Mean Discrepancy (MMD) between synthetic and real fraud samples, and evaluate discriminability using a held-out classifier’s ability to distinguish synthetic from real fraud (lower accuracy indicates more realistic synthetic samples).

Table 3 shows that CTGAN achieves significantly lower MMD (8.3) compared to SMOTE (18.4) and ADASYN (16.7), indicating that CTGAN-generated samples more closely match the distribution of real fraud cases. The discriminator accuracy for CTGAN is 54.7%, barely better than random guessing, whereas SMOTE samples are correctly identified 73.2% of the time. This demonstrates that CTGAN produces samples that are much harder to distinguish from real fraud.

Feature coverage measures the fraction of unique categorical feature combinations observed in synthetic samples compared to real fraud. CTGAN covers 89.2% of real fraud feature combinations, substantially higher than SMOTE’s 64.1%. This indicates that CTGAN better captures the diversity of fraud patterns, including rare combinations that SMOTE’s interpolation cannot generate.

We also visualize the distributions using t-SNE dimensionality reduction in Figure 5, projecting real fraud, synthetic fraud from CTGAN, synthetic fraud from SMOTE, and legitimate transactions into 2D space. The visualization confirms that CTGAN samples occupy similar regions to real fraud samples, including the multi-modal structure visible in the real fraud distribution. SMOTE samples, by contrast, concentrate in a tighter cluster that represents the interpolated region between real fraud samples but misses the full distributional complexity. Legitimate transactions form a separate cluster, validating that fraud and legitimate transactions have distinct feature distributions that our model can learn to distinguish.

5.8. Computational Efficiency

While our framework achieves superior detection performance, computational efficiency is critical for practical deployment. Table 4 compares training and inference times across methods.

Our complete framework requires 5.4 h for full training, including 1.5 h for CTGAN pre-training and 3.9 h for Transformer training. While this is longer than XGBoost (42 min) or LSTM (2.8 h), the training is a one-time offline cost. For production deployment with periodic retraining (e.g., weekly), this overhead is acceptable given the substantial performance gains.

Inference time is 0.71 s per 1000 sequences, corresponding to 1408 sequences per second or approximately 17,000 individual transactions per second (given average sequence length of 12). This throughput is sufficient for real-time fraud detection in most payment processing scenarios. The Transformer architecture’s parallelizability enables efficient batch processing, with inference time scaling sub-linearly with batch size.

Memory requirements (9.6 GB) are moderate and well within the capacity of modern GPUs or can be accommodated on high-memory CPU servers if necessary. The model contains 5.1 million parameters, making it deployable on edge devices or cloud services without specialized infrastructure.

Compared to LSTM, our method incurs 36% longer inference time but delivers a 5.5-percentage-point higher AUC-ROC, representing a favorable accuracy–efficiency trade-off. For applications requiring maximum throughput, model distillation or quantization could further reduce inference costs while maintaining most of the performance gain.

5.9. Error Analysis

To understand the limitations of our approach, we analyze false positives and false negatives on the test set. We randomly sample 100 errors of each type and manually inspect their characteristics.

False Positives: The most common false positive pattern (38% of cases) involves legitimate users making unusual but valid purchases such as expensive one-time purchases (electronics, jewelry, travel bookings) that deviate significantly from their typical spending patterns. The model correctly identifies these as anomalous but cannot distinguish genuine lifestyle changes from fraud without additional verification signals. Another frequent pattern (27%) is shared card usage within families, where multiple individuals use the same card with different purchasing patterns. For example, a parent’s card used by a teenager for gaming purchases generates attention patterns similar to account takeover fraud. These cases highlight the limitation of transaction data alone without user identity verification. Geographic anomalies account for 19% of false positives, where users traveling to new locations generate legitimate transactions that appear suspicious due to sudden location changes. Cross-border transactions, in particular, trigger higher fraud scores even when legitimate. These observations suggest that incorporating additional contextual signals, such as long-term behavioral profiles, user identity verification features, or external risk indicators, may help reduce false positives and improve the robustness of fraud detection systems.

False Negatives: The most challenging missed fraud cases (41%) involve fraudsters who carefully mimic the victim’s normal spending patterns, making small purchases at familiar merchant categories. These attacks evade detection precisely because they avoid behavioral anomalies that our model relies on. Technical fraud using stolen valid card details accounts for 31% of false negatives. When fraudsters obtain complete card information including CVV and billing address, their transactions appear identical to legitimate ones in our feature space. Detecting these requires additional signals like device fingerprinting or behavioral biometrics, which are outside our model’s scope. Very-short-lived attacks (18%) where fraud occurs immediately after card issuance provide minimal historical context, limiting our sequential model’s effectiveness. These cases would benefit from incorporating card-not-present indicators, IP geolocation, and other auxiliary features.

This error analysis reveals opportunities for improvement through multi-modal fusion incorporating device signals, biometric authentication logs, and external fraud intelligence, which we discuss in the conclusion as future work directions.

5.10. Further Discussions

5.10.1. Practical Deployment for Real-Time Fraud Detection

In practical fraud detection systems, models are required to process incoming transactions with low latency. In the proposed framework, the CTGAN component is used only during the training stage to generate additional fraudulent samples and mitigate class imbalance. Once training is completed, the deployed system relies solely on the time-aware Transformer model for inference. During real-time operation, transactions are processed sequentially as they arrive. For each new transaction associated with a particular card or account, the system constructs a transaction sequence using the most recent historical records and feeds it into the trained model to compute a fraud probability score. Since inference involves only a forward pass through the neural network, predictions can be generated efficiently and integrated into existing fraud monitoring pipelines. This design ensures that the computationally intensive generative modeling stage does not affect real-time inference performance, making the proposed approach compatible with practical fraud detection environments.

5.10.2. Computational Cost and Resource Requirements

The proposed framework introduces additional computational cost during the training stage due to the use of generative modeling and deep neural architectures. Training the CTGAN model to generate synthetic fraud samples requires iterative adversarial optimization, while the time-aware Transformer model involves multi-head attention computations for sequential modeling. These training processes may require GPU acceleration for efficient optimization on large datasets. However, the generative model is used only during the training phase to address class imbalance. Once the model is trained, the deployed system relies solely on the time-aware Transformer for inference. Prediction therefore involves only a forward pass through the neural network, which can be executed efficiently and integrated into existing fraud monitoring pipelines. Consequently, the proposed approach remains practical for real-world fraud detection systems where low-latency predictions are required.

6. Conclusions and Future Works

This paper presents a hybrid fraud detection framework that integrates generative adversarial networks with time-aware Transformer architectures. While the individual modeling components are not entirely new, our contribution lies in the systematic integration of generative tabular augmentation and temporal sequence modeling within a unified fraud detection pipeline. By formulating fraud detection as a sequential classification problem and modeling transaction histories as temporal sequences, our approach captures complex behavioral patterns and temporal dependencies that point-wise classifiers cannot detect. The Time-Aware Transformer explicitly incorporates temporal encodings and time-aware attention mechanisms to distinguish legitimate spending patterns from fraudulent anomalies, while CTGAN generates synthetic fraud samples that enrich the minority-class training distribution and help address the severe class imbalance inherent in fraud detection datasets. Experiments on the IEEE-CIS Fraud Detection dataset demonstrate strong performance against the representative baselines considered in this study, achieving 0.982 AUC-ROC and 0.883 F1-score. Ablation studies further confirm the contributions of both the generative oversampling and sequential modeling components, and attention visualizations provide interpretability for fraud investigation.

Despite these promising results, several important limitations should be explicitly acknowledged. First, the current evaluation is conducted on a single benchmark dataset (IEEE-CIS). While this dataset is widely used and reflects realistic transaction scenarios, relying on a single dataset constrains the extent to which the results can be generalized to other fraud detection environments. Therefore, the empirical findings of this work should be interpreted within this experimental setting, and broader validation across multiple datasets remains necessary. Second, the experimental comparisons are limited to traditional machine learning models, standard deep learning approaches, and supervised sequential architectures. We do not include several strong contemporary directions, such as graph neural network-based fraud detection, self-supervised or contrastive sequence learning, and diffusion-based tabular generative models. While this design allows for controlled comparison within a supervised sequential learning framework, it also limits the scope of empirical claims. As such, the reported performance should not be interpreted as establishing superiority over all existing approaches, but rather within the specific evaluation scope considered in this study. Third, the integration of CTGAN-generated synthetic samples into temporal transaction sequences introduces a conceptual limitation. The generative model operates at the level of individual tabular transactions and does not explicitly enforce sequence-level temporal or behavioral coherence. Although synthetic transactions are reorganized into chronological sequences during preprocessing, there is no guarantee that generated fraud events are fully consistent with the surrounding legitimate transaction context. As a result, the Time-Aware Transformer is trained on partially synthetic sequences that may not perfectly reflect real-world behavioral dynamics. This limitation highlights an important gap between tabular generative modeling and sequence-aware data generation. Fourth, while the reported performance is strong and stable across multiple runs, reproducibility is an important consideration. To facilitate transparency and independent verification, we provide the implementation code as Supplementary Material accompanying this submission, and it will be made publicly available upon publication. In addition to these limitations, the framework requires sufficient transaction history for effective sequential modeling, which may reduce performance for new accounts with limited behavioral records. Error analysis also indicates that sophisticated fraud patterns closely mimicking legitimate behavior and cases involving fully compromised credentials remain challenging to detect using transaction features alone. Furthermore, the computational overhead introduced by generative modeling may pose challenges for scenarios requiring frequent retraining or real-time adaptation.

Future research directions include several promising avenues. First, extending the framework to incorporate relational inductive biases through graph neural networks could better capture interactions among entities such as cards, merchants, and devices. Second, developing sequence-aware generative models that explicitly preserve temporal coherence may address the limitations of tabular-only generation. Third, evaluating the framework across multiple public datasets would strengthen its generalizability. Fourth, exploring self-supervised or contrastive pretraining for transaction sequences may improve representation learning under limited labels. Fifth, investigating alternative generative models such as diffusion-based tabular synthesis could further enhance sample quality. Finally, federated and continual learning strategies may enable adaptive and privacy-preserving fraud detection in real-world deployment settings. Beyond fraud detection, the proposed paradigm of combining generative augmentation with temporal sequence modeling may generalize to other imbalanced sequential classification problems, including cybersecurity intrusion detection, rare disease diagnosis from patient histories, predictive maintenance, and anomaly detection in financial systems.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/math14071183/s1.

Author Contributions

Methodology, J.C. and Y.L.; Software, J.C.; Validation, J.L.; Writing—original draft, J.C., Y.L., J.L. and M.Z.; Writing—review & editing, M.Z.; Visualization, J.L.; Supervision, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Beju, D.G.; Făt, C.M. Frauds in banking system: Frauds with cards and their associated services. In Economic and Financial Crime, Sustainability and Good Governance; Springer International Publishing: Cham, Switzerland, 2023; pp. 31–52. [Google Scholar]
Abdallah, A.; Maarof, M.A.; Zainal, A. Fraud detection system: A survey. J. Netw. Comput. Appl. 2016, 68, 90–113. [Google Scholar] [CrossRef]
Dal Pozzolo, A.; Caelen, O.; Le Borgne, Y.A.; Waterschoot, S.; Bontempi, G. Learned lessons in credit card fraud detection from a practitioner perspective. Expert Syst. Appl. 2014, 41, 4915–4928. [Google Scholar] [CrossRef]
Bhattacharyya, S.; Jha, S.; Tharakunnel, K.; Westland, J.C. Data mining for credit card fraud: A comparative study. Decis. Support Syst. 2011, 50, 602–613. [Google Scholar] [CrossRef]
Whitrow, C.; Hand, D.J.; Juszczak, P.; Weston, D.; Adams, N.M. Transaction aggregation as a strategy for credit card fraud detection. Data Min. Knowl. Discov. 2009, 18, 30–55. [Google Scholar] [CrossRef]
Jurgovsky, J.; Granitzer, M.; Ziegler, K.; Calabretto, S.; Portier, P.E.; He-Guelton, L.; Caelen, O. Sequence classification for credit-card fraud detection. Expert Syst. Appl. 2018, 100, 234–245. [Google Scholar] [CrossRef]
Van Vlasselaer, V.; Bravo, C.; Caelen, O.; Eliassi-Rad, T.; Akoglu, L.; Snoeck, M.; Baesens, B. APATE: A novel approach for automated credit card transaction fraud detection using network-based extensions. Decis. Support Syst. 2015, 75, 38–48. [Google Scholar] [CrossRef]
Lucas, Y.; Jurgovsky, J. Credit card fraud detection using machine learning: A survey. arXiv 2020, arXiv:2010.06479. [Google Scholar] [CrossRef]
Tan, Y.; Wu, B.; Cao, J.; Jiang, B. LLaMA-UTP: Knowledge-guided expert mixture for analyzing uncertain tax positions. IEEE Access 2025, 13, 90637–90650. [Google Scholar] [CrossRef]
Fu, K.; Cheng, D.; Tu, Y.; Zhang, L. Credit card fraud detection using convolutional neural networks. In International Conference on Neural Information Processing; Springer International Publishing: Cham, Switzerland, 2016; pp. 483–490. [Google Scholar]
Mienye, I.D.; Jere, N. Deep learning for credit card fraud detection: A review of algorithms, challenges, and solutions. IEEE Access 2024, 12, 96893–96910. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence); IEEE: Piscataway, NJ, USA, 2008; pp. 1322–1328. [Google Scholar]
Douzas, G.; Bacao, F.; Last, F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf. Sci. 2018, 465, 1–20. [Google Scholar] [CrossRef]
Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Pacific-Asia Conference on Knowledge Discovery and Data Mining; Springer: Berlin/Heidelberg, Germany, 2009; pp. 475–482. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling tabular data using conditional gan. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Fiore, U.; De Santis, A.; Perla, F.; Zanetti, P.; Palmieri, F. Using generative adversarial networks for improving classification effectiveness in credit card fraud detection. Inf. Sci. 2019, 479, 448–455. [Google Scholar] [CrossRef]
Engelmann, J.; Lessmann, S. Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning. Expert Syst. Appl. 2021, 174, 114582. [Google Scholar] [CrossRef]
Jiang, B.; Wu, B.; Cao, J.; Tan, Y. Interpretable Fair Value Hierarchy Classification via Hybrid Transformer-GNN Architecture. IEEE Access 2025, 13, 198142–198163. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; Volume 35, pp. 11106–11115. [Google Scholar]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. In Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Virtual, 6–14 December 2021; pp. 22419–22430. [Google Scholar]
Xu, J.; Wu, H.; Wang, J.; Long, M. Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy. arXiv 2021, arXiv:2110.02642. [Google Scholar]
Alarfaj, F.K.; Shahzadi, S. Enhancing Fraud detection in banking with deep learning: Graph neural networks and autoencoders for real-time credit card fraud prevention. IEEE Access 2024, 13, 20633–20646. [Google Scholar] [CrossRef]
Cherif, A.; Ammar, H.; Kalkatawi, M.; Alshehri, S.; Imine, A. Encoder–decoder graph neural network for credit card fraud detection. J. King Saud-Univ.-Comput. Inf. Sci. 2024, 36, 102003. [Google Scholar] [CrossRef]
Olaniyan, J.; Olaniyan, D.; Obagbuwa, I.C.; Ngafeeson, M. Graph-Temporal Contrastive Transformer for Financial Fraud Detection Using Transaction Behavior Modeling. Algorithms 2025, 18, 770. [Google Scholar] [CrossRef]
Chen, C.T.; Lee, C.; Huang, S.H.; Peng, W.C. Credit card fraud detection via intelligent sampling and self-supervised learning. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–29. [Google Scholar] [CrossRef]
Lai, J.; Xie, A.; Feng, H.; Wang, Y.; Fang, R. Self-supervised learning for financial statement fraud detection with limited and imbalanced data. In Proceedings of the 4th International Conference on Artificial Intelligence and Intelligent Information Processing, Qiangdao, China, 24–26 October 2025; pp. 919–924. [Google Scholar]
Zhang, R.; Cheng, D.; Yang, J.; Ouyang, Y.; Wu, X.; Zheng, Y.; Jiang, C. Pre-trained online contrastive learning for insurance fraud detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–21 February 2024; Volume 38, pp. 22511–22519. [Google Scholar]
Tang, J.; Gu, H.; Vuković, D.B.; Xu, G.; Wang, Y.; Tao, H.; Cao, J. Fraud detection in multi-relation graph: Contrastive Learning on Feature and Structural Levels. Neurocomputing 2025, 637, 130063. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Ren, S.; Ding, J.; Cheung, Y.M. Diffusion GAN-Based Oversampling for Imbalanced Tabular Data. IEEE Trans. Knowl. Data Eng. 2025, 38, 983–996. [Google Scholar] [CrossRef]
Bolton, R.J.; Hand, D.J. Statistical fraud detection: A review. Stat. Sci. 2002, 17, 235–255. [Google Scholar] [CrossRef]
Lebichot, B.; Paldino, G.M.; Siblini, W.; He-Guelton, L.; Oblé, F.; Bontempi, G. Incremental learning strategies for credit cards fraud detection. Int. J. Data Sci. Anal. 2021, 12, 165–174. [Google Scholar] [CrossRef]
Bahnsen, A.C.; Aouada, D.; Stojanovic, A.; Ottersten, B. Feature engineering strategies for credit card fraud detection. Expert Syst. Appl. 2016, 51, 134–142. [Google Scholar] [CrossRef]
Liu, X.Y.; Wu, J.; Zhou, Z.H. Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2008, 39, 539–550. [Google Scholar]
Weber, M.; Domeniconi, G.; Chen, J.; Weidele, D.K.I.; Bellei, C.; Robinson, T.; Leiserson, C. Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 4–8 August 2019. [Google Scholar]
Pumsirirat, A. Credit card fraud detection using deep learning based on auto-encoder and restricted boltzmann machine. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 18–25. [Google Scholar] [CrossRef]
Roy, A.; Sun, J.; Mahoney, R.; Alonzi, L.; Adams, S.; Beling, P. Deep learning detecting fraud in credit card transactions. In Proceedings of the 2018 Systems and Information Engineering Design Symposium (SIEDS); IEEE: Piscataway, NJ, USA, 2018; pp. 129–134. [Google Scholar]
Agarwal, S.; Zhang, J. FinTech, lending and payment innovation: A review. Asia-Pacific J. Financ. Stud. 2020, 49, 353–367. [Google Scholar] [CrossRef]
Wang, S.; Liu, C.; Gao, X.; Qu, H.; Xu, W. Session-based fraud detection in online e-commerce transactions using recurrent neural networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer International Publishing: Cham, Switzerland, 2017; pp. 241–252. [Google Scholar]
Du, N.; Dai, H.; Trivedi, R.; Upadhyay, U.; Gomez-Rodriguez, M.; Song, L. Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1555–1564. [Google Scholar]
Li, J.; Wang, Y.; McAuley, J. Time interval aware self-attention for sequential recommendation. In Proceedings of the 13th International Conference on Web Search and Data Mining, Houston, TX, USA, 3–7 February 2020; pp. 322–330. [Google Scholar]
Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887. [Google Scholar]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar] [CrossRef]
Douzas, G.; Bacao, F. Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst. Appl. 2018, 91, 464–471. [Google Scholar] [CrossRef]
Kotelnikov, A.; Baranchuk, D.; Rubachev, I.; Babenko, A. Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2023; pp. 17564–17579. [Google Scholar]
Dal Pozzolo, A.; Caelen, O.; Johnson, R.A.; Bontempi, G. Calibrating Probability with Undersampling for Unbalanced Classification. In SSCI; IEEE: Cape Town, South Africa, 2015; pp. 159–166. [Google Scholar]
Lopez-Rojas, E.; Elmir, A.; Axelsson, S. PaySim: A financial mobile money simulator for fraud detection. In 28th European Modeling and Simulation Symposium, EMSS, Larnaca; Dime University of Genoa: Genova, Italy, 2016; pp. 249–255. [Google Scholar]
Najadat, H.; Altiti, O.; Aqouleh, A.A.; Younes, M. Credit card fraud detection based on machine and deep learning. In 2020 11th International Conference on Information and Communication Systems (ICICS); IEEE: Piscataway, NJ, USA, 2020; pp. 204–208. [Google Scholar]
Chen, T. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
Lea, C.; Flynn, M.D.; Vidal, R.; Reiter, A.; Hager, G.D. Temporal convolutional networks for action segmentation and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2017; pp. 156–165. [Google Scholar]

Figure 1. Overview of the proposed framework integrating CTGAN-based generative oversampling with Time-Aware Transformer for credit card fraud detection.

Figure 2. Impact of synthetic fraud sample ratio on model performance. The shaded regions represent

\pm 1

standard deviation across 5 runs. Performance plateaus around 40–50% fraud ratio, with diminishing returns beyond this point.

Figure 2. Impact of synthetic fraud sample ratio on model performance. The shaded regions represent

\pm 1

standard deviation across 5 runs. Performance plateaus around 40–50% fraud ratio, with diminishing returns beyond this point.

Figure 3. Model performance across different sequence lengths. Error bars indicate standard errors. Performance generally improves with longer sequences up to approximately 30 transactions, then stabilizes.

Figure 4. Attention weight visualizations for two fraud cases. Each row shows attention from one transaction (query) to all transactions (keys) in the sequence. Darker colors indicate stronger attention. (a) Rapid successive fraud: the fraudulent transactions (red boxes) at positions 8–10 attend strongly to each other, capturing the burst pattern. (b) Gradual escalation fraud: the final fraudulent transaction attends to a progression of increasing amounts, detecting the escalation pattern.

Figure 5. t-SNE visualization of transaction feature distributions. CTGAN-generated samples (orange) overlap substantially with real fraud (red), while SMOTE samples (purple) form a distinct, more concentrated cluster. Legitimate transactions (blue) are well-separated from fraud.

Table 1. Performance comparison on IEEE-CIS fraud detection dataset. Results are averaged over 5 runs with different random seeds. Bold indicates best performance; underline indicates second best.

Method	AUC-ROC	AUC-PR	Precision	Recall	F1-Score	FPR@90%
LR + SMOTE	0.857 ± 0.008	0.612 ± 0.015	0.623 ± 0.019	0.701 ± 0.012	0.660 ± 0.011	0.182 ± 0.011
RF + SMOTE	0.891 ± 0.006	0.697 ± 0.012	0.724 ± 0.014	0.758 ± 0.009	0.740 ± 0.008	0.134 ± 0.008
XGB + SMOTE	0.912 ± 0.005	0.741 ± 0.011	0.762 ± 0.012	0.789 ± 0.008	0.775 ± 0.007	0.108 ± 0.007
MLP + SMOTE	0.874 ± 0.009	0.671 ± 0.014	0.695 ± 0.016	0.734 ± 0.011	0.714 ± 0.010	0.156 ± 0.010
CTGAN + MLP	0.903 ± 0.007	0.728 ± 0.013	0.751 ± 0.015	0.775 ± 0.010	0.763 ± 0.009	0.119 ± 0.009
LSTM	0.931 ± 0.006	0.789 ± 0.011	0.804 ± 0.013	0.821 ± 0.009	0.812 ± 0.008	0.089 ± 0.007
GRU	0.928 ± 0.006	0.783 ± 0.012	0.798 ± 0.014	0.816 ± 0.010	0.807 ± 0.009	0.092 ± 0.008
TCN	0.924 ± 0.007	0.774 ± 0.013	0.789 ± 0.015	0.809 ± 0.011	0.799 ± 0.010	0.097 ± 0.009
Transformer	0.956 ± 0.005	0.834 ± 0.010	0.847 ± 0.012	0.859 ± 0.008	0.853 ± 0.007	0.061 ± 0.006
TAT-CTGAN (Ours)	0.982 ± 0.003	0.911 ± 0.007	0.891 ± 0.009	0.876 ± 0.006	0.883 ± 0.005	0.038 ± 0.004

Table 2. Ablation study analyzing the contribution of key components. Each row removes or modifies one component from the full framework. The bold indicates best performance.

Configuration	AUC-ROC	AUC-PR	F1-Score	FPR@90%
Full Model (TAT-CTGAN)	0.982 ± 0.003	0.911 ± 0.007	0.883 ± 0.005	0.038 ± 0.004
w/o CTGAN (use SMOTE)	0.961 ± 0.005	0.849 ± 0.010	0.861 ± 0.007	0.057 ± 0.005
w/o Oversampling	0.943 ± 0.007	0.807 ± 0.013	0.834 ± 0.009	0.078 ± 0.008
w/o Temporal Encoding	0.968 ± 0.004	0.872 ± 0.009	0.869 ± 0.006	0.049 ± 0.005
w/o Temporal Bias in Attention	0.973 ± 0.004	0.889 ± 0.008	0.875 ± 0.006	0.044 ± 0.004
w/o Positional Encoding	0.959 ± 0.005	0.841 ± 0.011	0.857 ± 0.008	0.060 ± 0.006
Use Standard Attention	0.956 ± 0.005	0.834 ± 0.010	0.853 ± 0.007	0.061 ± 0.006
Reduce to 2 Layers	0.971 ± 0.004	0.881 ± 0.009	0.872 ± 0.006	0.046 ± 0.005
Increase to 6 Layers	0.980 ± 0.004	0.907 ± 0.008	0.880 ± 0.006	0.040 ± 0.004

Table 3. Quality comparison of synthetic fraud samples generated by different methods. Lower MMD indicates closer match to real distribution. Discriminator accuracy near 50% indicates synthetic samples are indistinguishable from real. The bold indicates best performance.

Method	MMD ( $\times 10^{- 3}$ )	Discriminator Acc.	Feature Coverage
SMOTE	18.4 ± 2.1	0.732 ± 0.018	0.641
ADASYN	16.7 ± 1.9	0.698 ± 0.021	0.678
CTGAN	8.3 ± 1.2	0.547 ± 0.024	0.892

Table 4. Computational efficiency comparison. Training time is for full dataset convergence. Inference time is per 1000 sequences. All measurements on NVIDIA A100 GPU.

Method	Training Time	Inference Time	Memory (GB)	Params (M)
XGBoost	42 min	0.18 s	4.2	-
LSTM	2.8 h	0.52 s	6.8	3.2
Transformer	3.6 h	0.64 s	8.4	4.7
TAT-CTGAN	5.4 h	0.71 s	9.6	5.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, J.; Liang, Y.; Liu, J.; Zhou, M. Temporal Transformer with Conditional Tabular GAN for Credit Card Fraud Detection: A Sequential Deep Learning Approach. Mathematics 2026, 14, 1183. https://doi.org/10.3390/math14071183

AMA Style

Chen J, Liang Y, Liu J, Zhou M. Temporal Transformer with Conditional Tabular GAN for Credit Card Fraud Detection: A Sequential Deep Learning Approach. Mathematics. 2026; 14(7):1183. https://doi.org/10.3390/math14071183

Chicago/Turabian Style

Chen, Jiaying, Yiwen Liang, Jingyi Liu, and Mengjie Zhou. 2026. "Temporal Transformer with Conditional Tabular GAN for Credit Card Fraud Detection: A Sequential Deep Learning Approach" Mathematics 14, no. 7: 1183. https://doi.org/10.3390/math14071183

APA Style

Chen, J., Liang, Y., Liu, J., & Zhou, M. (2026). Temporal Transformer with Conditional Tabular GAN for Credit Card Fraud Detection: A Sequential Deep Learning Approach. Mathematics, 14(7), 1183. https://doi.org/10.3390/math14071183

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Temporal Transformer with Conditional Tabular GAN for Credit Card Fraud Detection: A Sequential Deep Learning Approach

Abstract

1. Introduction

2. Related Works

2.1. Credit Card Fraud Detection

2.2. Sequential Modeling for Fraud Detection

2.3. Generative Models for Imbalanced Learning

2.4. Publicly Available Fraud Detection Datasets

3. Preliminaries

3.1. Transformer Architecture

3.2. Generative Adversarial Networks

4. Methodology

4.1. Framework Overview

4.2. Time-Aware Transformer Encoder

4.2.1. Temporal Feature Embedding

4.2.2. Time-Aware Multi-Head Self-Attention

4.2.3. Sequence-Level Classification

4.3. CTGAN-Based Generative Oversampling

4.3.1. Mode-Specific Normalization

4.3.2. Conditional Vector Sampling

4.3.3. Generator and Discriminator Architecture

4.4. Integrated Training Pipeline

5. Experiments

5.1. Experiment Setup

5.1.1. Dataset Description

5.1.2. Baseline Methods

5.1.3. Evaluation Metrics

5.1.4. Implementation Details

5.2. Main Results

5.3. Ablation Study

5.4. Impact of Synthetic Sample Ratio

5.5. Sequence Length Analysis

5.6. Temporal Dynamics Visualization

5.7. Comparison of Oversampling Quality

5.8. Computational Efficiency

5.9. Error Analysis

5.10. Further Discussions

5.10.1. Practical Deployment for Real-Time Fraud Detection

5.10.2. Computational Cost and Resource Requirements

6. Conclusions and Future Works

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI