1. Introduction
As technology evolves rapidly and the volume of financial transactions surges, digital banking is experiencing unprecedented growth. Consequently, financial systems are increasingly vulnerable to fraudulent activities, such as credit card or online payment fraud [
1,
2,
3]. Reliable fraud detection has, therefore, become a priority for financial institutions. Traditional methods often rely on point estimates and require large amounts of labeled data [
4]. Class imbalance is mainly approached through oversampling and undersampling techniques [
5,
6,
7].
Existing methods can be broadly divided into two groups: classifiers using single-transaction features or sequential models that incorporate historical customer behavior [
8,
9]. The former often lack robustness, since what constitutes suspicious activities for one customer may be entirely normal for another. Sequential models usually rely on fixed-size time windows of transaction histories. Such windows may truncate long-term dependencies or, in the case of new accounts, include potentially misleading transactions from other customers. A more natural approach is to include each individual customer’s full transaction history. However, these transaction time series are usually irregularly sampled and vary in length, which cannot be handled easily by existing approaches, highlighting the need for effective feature extraction techniques.
Handcrafted features, typically derived from expert knowledge or general statistics, such as mean, variance, skewness, and kurtosis, have been widely studied and can provide insights (see e.g., [
2,
10]). However, they generally fail to capture the temporal order of transactions, which is essential in anomaly detection for time series. To address these limitations, we propose encoding transaction histories using (log-)signatures. Log-signatures provide a mathematical representation of a path, gradually encoding details until fully characterizing it under mild conditions. They are robust to irregular sampling frequencies and varying time series lengths [
11].
Classical machine learning models for time series classification (TSC) [
12] typically rely on large labeled datasets, which are often costly to acquire. Semi-supervised learning (SSL) mitigates this by leveraging abundant unlabeled data to learn representations that capture shared structures and reduce dependence on scarce labeled data [
13]. In practice, however, fraud datasets are constrained not only by the lack of labels but also by a shortage of unlabeled samples, in addition to severe class imbalance. Consequently, data augmentation has become an essential component of regularization-based SSL and classification tasks [
6]. Inspired by successes in computer vision and natural language processing [
14], we extend generative SSL techniques to sequential financial data. As common augmentation methods, such as rotation and cropping, may disrupt temporal dependencies, we employ generative models for data augmentation.
Another key challenge in fraud detection is evaluation. Global measures such as ROC AUC or PR-AUC provide useful overall summaries, but they do not reflect operational constraints, where only a very small fraction of transactions can be manually reviewed. To address this, we introduce a dual evaluation framework combining standard global metrics with domain-specific head metrics. In particular, we evaluate based on Precision@K, Recall@K, and a cost-sensitive Expected Cost@K measure that incorporates the monetary impact of missed frauds and false alerts within the top K% of alerts ranked by fraud predictions. This enables comparison not only in terms of statistical performance but also in terms of real-world business impact.
Further, to quantify uncertainty in predictions, a vital aspect in high-risk areas such as fraud detection, we propose using Bayesian inference. In a Bayesian setting, a distribution is placed over network weights, resulting in a predictive distribution rather than point estimates. This allows fraud likelihood prediction while enabling uncertainty quantification. Such uncertainty is particularly valuable, as point estimates may appear highly confident, yet the underlying distribution can reveal substantial variance. By modeling this uncertainty, our approach provides calibrated confidence information and increases robustness in the top ranking of fraud predictions.
In this paper, we integrate recent advances in generative semi-supervised learning from computer vision to developments in credit card fraud detection. Specifically, we extend GAN-based SSL from the image domain to sequential transaction data. Our main contributions are as follows:
We propose a composite loss that integrates supervised risk with a Wasserstein-based IPM in the log-signature space, designed to unify input dimensions, effectively capture temporal dependencies in transaction sequences, and minimize discrepancies between generated and real unlabeled samples, while maximizing classification accuracy on labeled data.
We introduce a conditional generator to produce tailored augmentations by controlling categorical feature combinations (e.g., customer age or risk group), ensuring realistic and context-aware synthetic samples.
We enhance generalization and robustness by integrating Bayesian inference over network weights, thereby providing predictive distributions and principled uncertainty estimates for fraud detection.
We provide a comprehensive dual evaluation framework combining statistical and domain-specific, cost-sensitive metrics, ensuring practically relevant performance assessment.
Our approach is validated on the BankSim dataset [
15] under varying proportions of labeled data. As access to real-world financial data is extremely restricted, due to privacy and regulatory concerns, many studies rely on simulated data. In particular, BankSim is an agent-based simulator designed to mimic real banking behavior and has become a widely adopted benchmark in this domain. Results demonstrate improvements over benchmark models in terms of both global and domain-specific head metrics, while providing distributional outputs and uncertainty measures.
The remainder of this paper is organized as follows:
Section 2 reviews related literature on SSL and path signatures for time series modeling.
Section 3 briefly recaps necessary background on GANs, Bayesian neural networks and log-signature.
Section 4 introduces the proposed network architecture, loss functions and training procedure for our approach.
Section 5 describes the BankSim dataset, outlines the evaluation process and presents the numerical results.
Section 6 draws the conclusions and discusses directions for future work.
2. Literature Review
Semi-supervised Learning (SSL): Methods for extracting additional information from unlabeled data range from deep label propagation [
16] to more recent approaches [
17], unifying regularization with pseudo-labels into a single framework. Despite their successes, most of these methods were not initially designed for the TSC tasks, and therefore, ignore temporal relations [
14].
SSL for Time Series: For time series data, SSL approaches typically fall into two categories: self-learning methods, where unlabeled samples are iteratively assigned pseudo-labels, and regularization-based methods that exploit shared structures across labeled and unlabeled data. Recent work explores enhanced data augmentation strategies [
18] or training a model jointly for supervised classification on labeled data and an auxiliary forecasting task on all samples [
19]. GAN-based techniques extend this line by regenerating signals and combining unsupervised representation learning with supervised loss components [
20].
GAN-based SSL: Ref. [
21] proposes labeling generated samples as a new
th class and solving a
-class classification problem using a GAN. Ref. [
22] provides theoretical insights on suboptimal generators potentially improving SSL by moving the discriminator’s decision boundaries to high-density areas of the data manifold. As in an SSL framework, the discriminator might perform well, whereas the generator might still produce visually unrealistic samples [
23] introduced Triple GAN, containing three neural networks. Three-player GANs for missing value imputation were proposed by [
24]. Ref. [
25] provides a comprehensive overview of SSL-GAN training enhancements and proposes semi-supervised GANs with spatial coevolution for image datasets. Ref. [
26] used a WGAN-based semi-supervised approach for anomaly detection.
Signatures for feature engineering: Signatures, first introduced by [
27] in the 50s, became widely recognized in the mathematical community through Terry Lyons’ development of Rough Path theory [
28]. More recently, they gained considerable traction in the machine learning community. In particular (log-)signatures have emerged as non-parametric and mathematically principled dimension reduction technique for time series data [
29], which has led to successful applications across a broad range of domains, including pricing derivatives [
30], human action and gesture recognition [
31,
32] and more recently sports analytics [
33]. In financial time series encoding and generative modeling, ref. [
34] developed a market simulator trained on path signatures and [
35] combined log-signatures with recurrent neural networks to learn neural stochastic differential equations. Other works exploit signatures to measure time series similarities [
36], detect market anomalies [
37] and enhance deep learning architectures, such as transformers for time series modeling and deep hedging [
38,
39].
4. Proposed Model
To recap, our proposed GAN-based SSL approach relies on three main ideas: (1) Constructing a conditional generator to simulate meaningful samples by controlling for categorical feature combinations. (2) Introducing a novel loss function based on the Wasserstein distance and log-signatures that unifies input dimension and efficiently extracts temporal features of time series data to simultaneously minimize the discrepancy between real and generated unlabeled samples and classify real samples as a supervised learning task. (3) Placing distributions over network weights to avoid model collapse, enhance generalization of the discriminator and estimate the probability of the target variable rather than a point estimate.
4.1. Network Architecture
Generator: For the generator, we generate samples directly representing log-signatures of augmented transaction histories conditioned on a vector . Log-signatures provide fixed-length encodings of transaction histories that are irregularly sampled and vary in length, allowing both G and D to operate in a uniform feature space. Without this encoding, G would be limited to producing histories of fixed length, thereby reducing data variability. To ensure a suitable combination of categorical values, is sampled from real training data. Formally, we train a network G that maps a latent vector z and a vector to an output using tanh activation functions and residual layers defined as
Definition 4. Let be an affine transformation and ϕ a tanh function. Then, a residual layer is defined bywhere ϕ is applied component-wise. First, each categorical feature is passed through an embedding layer, which maps it to a vector with dimension,
, determined by the number of distinct values of the feature in the training data, followed by a tanh activation function. Second, we concatenate the output with the latent vector and apply two residual layers followed by a tanh and a fully connected layer, as illustrated in
Figure 1.
Discriminator: Similar to [
43], we aim for a discriminator
D that takes into account class labels. We, however, propose a discriminator that returns a vector of raw scores with values in
instead of estimating the probability that sample
belongs to class
, where class label 0 represents a sample produced by the generator. To do so, we construct a feedforward neural network
D using tanh activation functions and residual layers.
Further, if given a real sample
X containing both time series data
and a categorical feature vector
, we add a preprocessing step to extract meaningful information. We first augment the time series using a time augmentation, a lead-lag augmentation and an invisibility-reset augmentation. Second, we apply a piece-wise linear interpolation and finally compute the truncated log-signature of order four. These augmentations follow prior literature on log-signature modeling [
11,
36] and ensure the uniqueness of the log-signature, capture information on the quadratic variation in the process and add extra information about the starting point. The truncation of order four is a widely used compromise [
36]: higher truncation orders capture more path information, but also increase dimensionality, which can reduce efficiency and stability. In this case, the dimension of a
d-dimensional time series increases to
. A detailed overview of possible path augmentation and their classification is given by [
11]. Finally, in the generator, each categorical feature is passed through an embedding layer, mapping it to a vector with dimension
. Concatenating the embeddings with the log-signature yields a final 271-dimensional representation for each transaction history. This is applied consistently to both real and synthetic samples.
For a given log-signature of length
l and a vector
of
n categorical features, the discriminator network can be illustrated as
Figure 1.
4.2. Loss Functions
Let be N unlabeled observations and be labeled real samples with class labels . We label generated data as class 0. For any given time series X, we form the augmented path and its truncated log-signature .
Discriminator: Our network
outputs raw scores
. For the discriminator we adopt a composite loss consisting of (i) a supervised classification part on labeled data, (ii) an unsupervised Wasserstein-style IPM term on real vs. generated samples and (iii) a gradient penalty ensuring 1-Lipschitz continuity of the network [
42].
We use a fixed linear projection
which has operator norm
to obtain the scalar critic
.
On labeled real samples, we train the class scores
with cross-entropy:
with
being the probability that sample
belongs to class
.
On unlabeled real
and generated
we minimize an Integral Probability Metric (IPM) estimated with the scalar critic
:
By Kantorovich–Rubinstein duality [
49], minimizing this objective is equivalent to maximizing the Wasserstein-1 distance in its dual formulation, restricted to the critic class
. We enforce 1-Lipschitzness of
via a gradient penalty [
42] on interpolated features
:
Note this applies the penalty to the
scalar , avoiding ambiguity for vector outputs.
Finally, for a latent input
z, a categorical vector
, a sample
X, scaling factors
and
and network weights
and
the loss function for the discriminator is defined as follows:
Together, these components can be interpreted as follows:
performs empirical risk minimization on the scarce labeled data,
enforces distributional alignment between real and generated log-signatures through an IPM, which is a standard technique in semi-supervised learning and domain adaptation [
21,
50] while a gradient penalty
enforces the scalar critic remains 1-Lipschitz, a condition necessary for the IPM term to coincide with the Wasserstein–1 distance [
42].
Generator: The generator is trained to generate samples that are classified as real by the scalar critic. Hence, for a latent input
z, a categorical vector
and network weights
and
, the loss for a mini-batch of
samples is
Integral Probability Metric (IPM) and theoretical grounding: Our unlabeled objective can be formally understood as an Integral Probability Metric (IPM) in log-signature space. Given two distributions
and a function class
, the IPM is defined as
When
is the set of 1-Lipschitz functions,
equals the Wasserstein–1 distance by the Kantorovich–Rubinstein duality [
49]. Our unlabeled loss takes the form
where
is a scalar critic obtained by a fixed unit-norm projection
T of the discriminator outputs. Since
,
is 1-Lipschitz whenever
is. In practice, the supremum is approximated by optimizing
under a gradient penalty that enforces 1-Lipschitz. Thus
is an empirical estimate of
restricted to this critic class, providing a principled Wasserstein-based IPM tailored to log-signature features rather than an ad-hoc design. Because
T is fixed with
,
is 1-Lipschitz whenever
is, ensuring the constraint is enforced. In summary, the discriminator’s composite loss can be interpreted as supervised risk minimization regularized by a Wasserstein discrepancy under a Lipschitz constraint.
Beyond Wasserstein-based IPMs, widely used distributional discrepancy measures, including MMD [
51] or
f-divergences [
52] could be incorporated into our framework without changing the overall architecture. In the context of signature features, however, mainly MMD- and Wasserstein-based objectives have been explored: for example, refs. [
53,
54] introduced MMD with a signature kernel, for statistical tests and training generative models, while [
36,
48] integrate the Wasserstein-1 distance directly in signature space for time-series generation. These works highlight that both families of metrics are principled options in signature space. We adopt the Wasserstein formulation because it is well established in the literature, offers stable training dynamics, and aligns naturally with the geometric representation provided by log-signatures. A broader empirical comparison with alternative divergences, including signature-based MMD, remains an interesting direction for future work.
4.3. Posterior Sampling
To approximate the posterior over network weights, we follow the general Bayesian updating framework (see
Section 3.3) and employ Stochastic Gradient Hamiltonian Monte Carlo (SGHMC), introduced in [
44]. SGHMC is very closely related to momentum-based stochastic gradient descent (SGD), but injects calibrated Gaussian noise to simulate posterior samples while using mini-batches for scalability. Hence, parameter settings, such as learning rate and momentum terms, can be imported from standard optimizers.
Following [
43], we set parameters
and
for the prior distributions
and
of the generator and discriminator weights, respectively. Initial weights
and
are drawn from these priors and updated with SGHMC (see Algorithm 1) using
noise samples
from the latent distribution
and a mini-batch of
real data samples
. Posterior samples yield predictive distributions; for evaluation against baselines, we use the posterior mean prediction in
Section 5.5.1, while uncertainty analyses rely on the full posterior predictive in
Section 5.5.3. For clarity, we present one SGHMC iteration in Algorithm 1 using a standard momentum-based SGD. In practice, consistent with [
43], we achieved more stable performance using ADAM-style momentum updates within the SGHMC framework. Choosing a prior distribution is a crucial part of Bayesian inference, which often relies on expert knowledge. We avoid any exogenous assumptions or domain expert knowledge and follow the Glorot normal initialization [
55] to maintain a high information flow across layers. Network weights are randomly drawn from a centered normal distribution
with variance
with scaling factor
g and fan in and fan out denoting the input and output dimensions of a network layer. As we use small layers down to 32 nodes, we choose
to keep the variance reasonably small. Hyperparameters were adopted from commonly used settings in the literature, with minimal additional tuning of the learning rate. Based on this tuning study
Table A2, we selected learning rates of 0.0001 for the generator and 0.005 for the discriminator in the final evaluation. The much higher learning rate for the discriminator
D is based on the fact that it has to learn both classification and discrimination quickly, while the generator only learns through
D’s signal [
56]. The mini-batch size for both
G and
D is chosen as
and
For efficiency, we run one chain and train the generator/discriminator with alternating update steps using
discriminator updates per one generator update for 1000 iterations. Convergence is monitored heuristically via stabilization of losses.
Algorithm 1 One training epoch of SGHMC with SGD-style update with friction term , learning rate , and parallel running Markov Chains and previous posteriors samples and . |
for do - sample noise batch from - Update using SGHMC with : - append to sample set. end for for do for do - sample a batch of real data real from : - sample noise batch from . - Generate , hence - Update using SGHMC with : - append to sample set. end for end for
|
Our approach constitutes an approximate Bayesian treatment of neural network weights via SGHMC. The resulting predictive distributions provide uncertainty estimates, which are central in fraud detection (e.g., for thresholding or human review). While a non-Bayesian variant may yield sharper point estimates, it would lack principled uncertainty estimates. We, therefore, regard Bayesian inference as an integral design choice, rather than an optional module for ablation.
6. Conclusions and Discussion
In this paper, we introduced a novel deep generative semi-supervised approach for time series classification that leverages conditional GANs, Bayesian inference, and log-signatures to address core challenges in financial fraud detection: irregularly sampled data of varying length, limited labeled samples and the need for probabilistic predictions with uncertainty quantification. Log-signatures provide a principled way to encode transaction histories of variable length, enabling robust learning where other sequence models struggle.
To provide a comprehensive performance assessment, we combined global statistical metrics (macro F1, PR-AUC, cross-entropy loss) with domain-specific head metrics (Precision@K, Recall@K, partial PR-AUC, Expected Cost@K). This dual evaluation framework reflects both overall statistical performance and real-world business impact, where only a small fraction of transactions can be reviewed. Our empirical evaluation on the BankSim dataset demonstrated that our proposed approach outperforms established baselines in the low-data regime, with particularly strong gains in cost-based performance at realistic operational settings, achieving up to 44% lower Expected Cost@K than strong statistical performers such as random forest in typical operational regimes. While fully supervised neural networks close the performance gap as more labeled samples become available, our approach maintains a clear advantage when labeled data are scarce.
Another key contribution lies in uncertainty quantification. By placing a distribution over network weights, our Bayesian framework produces predictive distributions rather than point estimates, allowing calibrated confidence intervals. Misclassified samples were shown to exhibit consistently higher uncertainty, highlighting the importance of uncertainty-aware predictions in high-risk domains where wrong decisions carry substantial cost. In addition, we evaluated subgroup performance by age and gender, finding that observed performance differences largely reflect prevalence and sample size rather than systematic bias introduced by the model. Relative model rankings remained stable across subgroups, suggesting that no group is systematically disadvantaged.
Nonetheless, limitations remain. Our approach relies on transaction histories, making predictions for new or low-activity customers challenging. Moreover, fraud detection assumes that fraud breaks behavioral patterns, which may fail in cases of repeated fraud resembling past activity. Finally, our evaluation relies on the BankSim dataset. While widely used in fraud detection for its realism, it remains synthetic, and future work should extend validation to real-world datasets as access becomes possible. As with most data-driven systems, model generalization depends on representative training data and evolving fraud tactics will require continuous retraining.
Several avenues for future research extend naturally from this work. First, extending evaluation to real-world datasets and in production environments offers the opportunity to confirm findings under operational constraints. Second, incorporating additional data modalities, such as customer demographics, merchant information, or network structures, could enhance the detection of rare or adaptive fraud cases. Third, fairness and interpretability remain important directions: while subgroup analyses suggest that performance differences primarily reflect data prevalence, future work should explore fairness-aware training objectives and explainable prediction mechanisms to strengthen trust in deployment. Finally, applying the approach to other domains, such as healthcare, is a natural next step to test the generalization of our approach to broader time series classification tasks beyond fraud detection.
Overall, our findings demonstrate that semi-supervised Bayesian generative models, combined with log-signatures for temporal feature encoding, can effectively handle variable-length sequences of irregular sampling frequencies while providing robust, uncertainty-aware decision support. This synergy offers tangible benefits for fraud detection and broader time series classification tasks.