1. Introduction
With the rapid development in blockchain technologies and the expanding crypto-finance ecosystem, the cryptocurrency market has become an indispensable component of the global financial system [
1,
2]. The number of global crypto-asset participants has reached hundreds of millions, and trading activities exhibit increasingly high-frequency, complex, and cross-chain characteristics [
3]. Unlike traditional financial markets, crypto-asset transactions possess decentralization, anonymity, and programmability. While these attributes enhance transactional efficiency and financial innovation, they also provide covert channels for a wide spectrum of anomalous behaviors. To systematically address these risks, crypto-asset anomalies can be formally categorized into three distinct dimensions: Illicit Transaction Flows (e.g., money laundering layering, dark web payments, and terrorist financing) [
4], Structural Behavioral Frauds (e.g., Ponzi schemes, phishing, and DeFi rug pulls), and Market Manipulation Patterns (e.g., wash trading and pump-and-dump schemes). Furthermore, the rise of DeFi and NFT (non-fungible tokens) ecosystems has introduced cross-domain complexity, blurring the boundaries between these categories and imposing substantial challenges to global financial regulation [
5]. Under these circumstances, establishing an efficient, accurate, and real-time anomaly detection system [
6,
7] capable of identifying these multi-dimensional risks is of significant importance for maintaining market integrity and supporting anti-money laundering (AML), counter-terrorist financing (CFT), and financial risk control [
8].
Early financial anomaly detection methods primarily relied on rule matching and statistical analysis [
9]. Typical approaches involve expert-crafted rules such as abnormal trading frequency, sudden amount fluctuations, or frequent transfers between specific accounts [
10]. These methods are intuitive and interpretable but heavily rely on expert experience and historical patterns, making them ineffective against emerging fraud strategies and complex relational behaviors [
11]. Furthermore, rule updates usually lag behind market evolution, resulting in high false-negative rates and low recall [
12]. Statistical modeling methods identify anomalies through probabilistic distributions or distance metrics, such as gaussian mixture models (GMM), mahalanobis distance, or isolation forest [
13]. These approaches assume that normal transactions follow a stable distribution and anomalies deviate from it in probabilistic terms [
14]. However, crypto-asset markets exhibit strong non-stationarity and heavy-tailed distributions, with trading behaviors heavily influenced by external shocks such as crashes, regulatory shifts, and cyberattacks. Consequently, conventional statistical methods often fail in dynamic environments [
15].
With the broader adoption of machine learning in financial risk control [
16], supervised and semi-supervised learning models have been introduced for anomaly detection [
17]. Common models include support vector machines (SVM), random forests (RF), XGBoost and ensemble-based methods [
18,
19]. Although these models can achieve reasonable performance when supported by sufficient feature engineering, they face two major limitations: (1) fraudulent transaction samples are extremely scarce and severely imbalanced, causing overfitting and strong bias toward the majority class [
20]; and (2) these models rely on manually designed low-dimensional features and thus cannot capture complex nonlinear relationships between transaction graphs and temporal dynamics [
21]. Consequently, their generalization capability and real-time performance are limited in highly dynamic and evolving blockchain environments [
22]. To overcome these limitations in high-dimensional complex data modeling, deep learning techniques have increasingly become the mainstream solution for financial anomaly detection [
23]. Yu et al. [
24] proposed a GAN-based real-time transactional anomaly detection framework achieving
accuracy with latency below
. James Uche et al. [
25] integrated explainable AI with generative models for real-time fraud monitoring, improving robustness under adversarial perturbations. Dixit et al. [
26] combined advanced generative models with temporal attention, integrating WGAN-GP, feature preservation, and adaptive thresholding to enhance detection performance and maintain millisecond-level latency. Qu et al. [
27] introduced MFGAN, a multimodal anomaly detection framework combining attention-enhanced AE (autoencoders) and GANs, yielding approximately
improvement in F1 score on real industrial sensor data. Chen et al. [
28] proposed a multimodal anomaly detection method fusing time-domain and frequency-domain features, achieving precision
and F1-score
in regional power grid monitoring. Moreover, hybrid architectures integrating Transformers with other deep learning branches have shown great potential in complex classification tasks; for instance, recent work has proposed a feature cross-layer interaction method based on Res2Net and Transformers to effectively extract and fuse complementary feature information [
29]. However, several fundamental challenges remain: scarcity and imbalance of fraudulent transaction samples, which limits deep model training; difficulty in multimodal data fusion due to the heterogeneity and asynchrony between on-chain structural features and off-chain price dynamics; additionally, interpretability and scalability constraints, as black-box deep models pose challenges for compliance auditing and regulatory adoption.
To address these issues, a multimodal real-time anomaly detection framework is proposed, referred to as the Real-time Multi-modal Anomaly Detection Framework for Crypto-assets (RMAD-Crypto). Specifically, the main innovations of this work include:
An integrated generation–detection mechanism: A GAN-based fraudulent sample generator is introduced to synthesize high-fidelity and diverse fraudulent transactions, mitigating data imbalance and overfitting; the discriminator further assists in anomaly confidence estimation during detection;
Multi-domain latent distribution modeling: A VAE-based feature encoding network is designed to map on-chain structures, behavioral patterns, and price dynamics into a unified latent space, where anomalies are quantified through reconstruction error and latent density estimation;
Cross-modal temporal detection: A dual-branch detection module combining Transformer prediction and online clustering is developed; the Transformer branch captures long-range, cross-modal dependencies, while the clustering branch performs real-time deviation detection, and their outputs are fused for robust anomaly assessment;
Real-time and scalable architecture: The framework supports streaming input and online updates, and its modular design enables deployment across multiple blockchains such as BTC, ETH, and BSC;
Empirical performance improvement: Experiments on real-world crypto-asset datasets demonstrate that recall improves by approximately 15– compared with traditional models, while maintaining millisecond-level latency.
4. Results and Discussion
4.1. Experiments Details
4.1.1. Evaluation Metrics
A comprehensive evaluation of the model performance is conducted using a diverse set of metrics to assess both the anomaly detection capabilities and the quality of the generative data augmentation. To measure the effectiveness of the detection framework, we employ Precision, Recall, the balanced F1-score, the area under the curve (AUC), the false positive rate (FPR), and detection latency. These metrics collectively reflect the accuracy, robustness, and real-time responsiveness of the model in the cryptocurrency anomaly detection task. The mathematical definitions for the detection metrics are provided as follows:
In these definitions, TP denotes the number of anomalous samples correctly identified as anomalous, while FP represents normal samples incorrectly classified as anomalous. The term FN indicates anomalous samples that are not detected, and TN represents normal samples correctly recognized. The quantity TPR denotes the true positive rate, is the detected time of the i-th anomalous event, and is its actual occurrence time, with N representing the total number of anomalous events.
To further evaluate the quality and diversity of the fraudulent samples synthesized by the GAN module, we utilize the Inception Score (IS) and Fréchet Inception Distance (FID). The Inception Score is defined as:
where
represents the distribution of generated samples,
is the Kullback-Leibler divergence, and
is the conditional class distribution predicted by a pre-trained classifier. The Fréchet Inception Distance is calculated by:
where
and
denote the mean vectors and covariance matrices of the feature representations for real and generated fraudulent samples, respectively.
The selection of IS and FID as evaluation metrics is critical for validating the proposed generative augmentation strategy. The Inception Score is employed to measure both the clarity and diversity of the generated transactions; a higher IS indicates that the generator produces samples that can be confidently classified as fraudulent while covering a diverse range of fraud patterns, thereby preventing mode collapse. Complementarily, the Fréchet Inception Distance quantifies the distributional discrepancy between real and synthetic fraudulent transactions in the high-dimensional feature space. A lower FID signifies that the generated samples possess statistical and structural properties, such as transaction graph topology and temporal volatility, that closely match real-world anomalies. This ensures that the downstream detector is trained on data that accurately reflects the manifold of true cryptocurrency fraud rather than unrealistic noise.
4.1.2. Experimental Settings
The experimental environment consisted of a server equipped with dual Intel Xeon Gold 6348 CPUs and four NVIDIA A100 GPUs, each providing 80 GB of HBM2e memory to accelerate deep neural network training and inference. Regarding the software environment, Ubuntu 22.04 LTS served as the operating system, and the deep learning framework PyTorch 2.1 was used in combination with CUDA 12.2 and cuDNN 8.9 to fully exploit GPU acceleration. Regarding hyperparameter settings, the dataset was partitioned into training, validation, and testing subsets following a split to maintain balanced and representative evaluation. In the GAN module, the learning rates of the generator and discriminator were set to and , respectively, with a batch size of 64. The Adam optimizer was applied with momentum parameters and .
4.1.3. Baseline Methods
In the experimental design of this study, three categories of representative models were selected as baseline methods, including classical machine learning models SVM [
50] and Random Forest [
51], single-modal deep models LSTM [
52], GCN [
53], and Ensemble-GNN [
54], as well as the existing multimodal cryptocurrency anomaly detection model MDST-GNN [
55].
4.2. Comparison with Baseline Methods
This experiment aims to systematically evaluate the effectiveness of generative data augmentation in cryptocurrency anomaly detection, particularly under conditions where fraudulent transactions are extremely scarce. The objective of this experiment is to determine whether adversarial generation can alleviate the issue of class imbalance, expand the support of rare abnormal patterns, and enhance the detector’s ability to recognize complex fraudulent behaviors. To ensure a fair and comprehensive comparison, the proposed framework is evaluated against multiple baselines. All methods are trained under identical data splits, random seeds, and optimization settings, and performance is reported using Precision, Recall, F1-score, AUC, and FPR.
The results are presented in
Table 2 and the ROC curves in
Figure 4. Classical learning models exhibit limited detection capabilities; SVM suffers from low recall due to its inability to capture the nonlinear structure of high-dimensional multimodal features, while Random Forest achieves moderate precision but lacks the capacity to model temporal or structural dependencies, yielding an AUC of only 0.742. However, these lightweight models demonstrate the lowest inference latencies (5.2 ms and 6.1 ms) due to their low computational complexity. Deep models show improved detection performance but at the cost of increased computation time: LSTM benefits from sequence modeling to outperform traditional models in Recall, and GCN effectively captures structural properties, with respective latencies rising to 12.4 ms and 10.7 ms. The multimodal MDST-GNN model achieves an AUC of 0.812 by jointly representing graph structure and market dynamics, which pushes the latency to 15.9 ms. Notably, the state-of-the-art Ensemble-GNN demonstrates strong competitiveness, achieving an F1-score of 0.661 and an AUC of 0.842. This validates the effectiveness of integrating diverse graph architectures (GCN, GAT, GIN) to capture complex topological patterns. However, this performance gain comes with a significant computational penalty; the ensemble voting mechanism across multiple subnetworks results in the highest latency of 22.8 ms, potentially hindering its deployment in high-frequency trading environments. In contrast, our proposed method achieves the highest detection performance with a Recall of 0.703 and an AUC of 0.889. While the dual-branch architecture and adaptive fusion mechanism incur a latency of 19.2 ms, this is significantly more efficient than the heavy Ensemble-GNN baseline. Crucially, this latency remains well within the millisecond-level requirement for real-time financial risk control. The results confirm that our framework strikes the optimal balance between security coverage and response speed, delivering superior accuracy without the prohibitive computational cost of ensemble approaches.
4.3. Sample Generation and Data Augmentation Analysis
The objective of this experiment is to evaluate the effectiveness of generative data augmentation in cryptocurrency anomaly detection, with particular focus on whether the issue of class imbalance—caused by the scarcity of fraudulent samples—can be alleviated through the introduction of high-fidelity synthetic data. The experiment begins with a baseline GAN and progressively incorporates feature-consistency constraints, multi-domain joint training, and the final optimized design proposed in this study. The influence of these enhancements is assessed using both generative quality metrics (FID, IS) and downstream detection metrics (recall, F1-score).
As shown in
Table 3, the progressive enhancement of the generative model results in a monotonic decrease in FID and a continuous increase in IS, indicating that the fidelity and diversity of synthetic samples are notably improved. In terms of detection performance, both Recall and F1-score are significantly lower when no GAN is applied, whereas introducing a basic GAN yields considerable gains, demonstrating that synthetic samples effectively supplement the limited fraudulent data. After incorporating feature-consistency constraints, additional improvements are observed due to better alignment of contextual and structural properties between synthetic and real fraudulent behaviors. When multi-domain joint training is introduced, the GAN becomes capable of learning cross-modal behavioral patterns, producing synthetic samples that more naturally reflect transaction structures, price dynamics, and deviation patterns. The final optimized GAN-enhanced model achieves the best overall performance, suggesting that the generated samples closely approximate real fraudulent patterns and substantially strengthen detector learning.
From a theoretical standpoint, a basic GAN expands the support of the minority fraudulent class by approximating its marginal distribution through adversarial learning, inevitably improving Recall. However, purely vector-level or surface-level generation fails to preserve deeper structural properties of fraudulent transactions, leading to pronounced discrepancies in high-dimensional manifolds and higher FID (fréchet inception distance) and IS (inception score) values. Introducing feature-consistency constraints requires the generator to match not only statistical appearance but also graph-level structures, interaction patterns, and temporal volatility signatures, thereby aligning synthetic samples with real fraudulent semantics and enabling clearer decision boundaries for the detector. Multi-domain joint training further enhances this effect by explicitly injecting complementary correlations across temporal, structural, and frequency domains, enabling the generator to cover a wider range of fraudulent modes. The final GAN design stabilizes training dynamics, feature mapping, and cross-modal representations, allowing the synthetic data distribution to closely approximate the true anomaly manifold and thus achieving optimal Recall and F1-score.
4.4. Anomaly Distribution Modeling Analysis
This experiment aims to assess the impact of different anomaly distribution modeling techniques on detecting anomalous cryptocurrency transactions, with emphasis on their ability to characterize abnormal behaviors in high-dimensional complex data. The evaluation includes traditional one-class classification methods, reconstruction-based deep generative models, and enhanced variational methods that incorporate synthetic data. The core objective is to determine whether the proposed multi-domain VAE can more accurately learn the latent distribution of normal behaviors and produce more stable anomaly scores.
As shown in
Table 4, One-Class SVM produces the weakest results across all metrics, particularly Recall, demonstrating its inability to effectively capture the true manifold of normal transactions in high-dimensional nonlinear spaces. The AE improves upon SVM due to reconstruction-based learning, yet its latent representation lacks distributional constraints, leading to unstable anomaly boundaries. The VAE achieves notable improvements in recall, AUC, and FPR by modeling latent distributions through learnable mean and variance parameters, increasing sensitivity to rare anomalous deviations. When enhanced with GAN-generated fraudulent samples, the VAE exhibits further performance gains due to expanded coverage of minority-class regions. The proposed multi-domain VAE achieves the best performance across all metrics, significantly outperforming all alternatives.
4.5. Parameter Sensitivity Analysis
To systematically evaluate the robustness of the proposed framework and determine the optimal hyperparameter configuration, we conducted a sensitivity analysis targeting three critical components: the feature-consistency weight
in the GAN objective (Equation (
13)), the bandwidth
h of the Kernel Density Estimator (KDE), and the fusion balancing coefficient
(Equation (
16)). The experiments were performed by varying one parameter within a specified range while keeping the others fixed at their default settings, using the F1-score as the primary evaluation metric. For the GAN regularizer
, we explored values in the range
to assess the trade-off between adversarial deception and feature matching. For the KDE bandwidth
h, which controls the smoothness of the latent density estimation, the range was set from
to
. Finally, the weighting parameter
, which balances reconstruction error against probabilistic rarity, was tested from
to
with a step size of
.
The quantitative results presented in
Table 5 reveal distinct performance patterns driven by the underlying theoretical properties of each parameter. First, regarding
, performance peaks at
; lower values (
) fail to enforce sufficient structural constraints, leading to invalid graph topologies, while excessive regularization (
) causes the generator to over-fit statistical moments, reducing sample diversity. Second, the KDE bandwidth exhibits a classic bias-variance trade-off: a narrow bandwidth (
) overfits to noise (peaked distribution), whereas a wide bandwidth (
) over-smooths the density, masking true anomalies. The optimal
effectively captures the manifold geometry. Lastly, the fusion weight
achieves its maximum at
, indicating that while structural reconstruction error is the dominant indicator of fraud, incorporating latent probability density (the
term) provides critical complementary information about sample rarity. Relying solely on either reconstruction (
) or density (
) results in suboptimal detection, confirming the necessity of the dual-metric scoring mechanism.
4.6. Multimodal Fusion and Real-Time Detection Performance
The objective of this experiment is to evaluate the effectiveness of multimodal fusion and the dual-branch detection architecture for identifying anomalous behaviors in cryptocurrency transactions, with emphasis on the complementary strengths of Transformer-based temporal modeling and clustering-based structural modeling in terms of both performance and real-time responsiveness. The experiment begins with single-branch models and progressively incorporates dual-branch structures and different fusion strategies to observe trends in detection performance, latency, and false positive rate (FPR).
As shown in
Table 6, the Transformer-only model yields higher AUC and F1-score, but its long-range attention operations introduce substantial latency. The clustering-only branch achieves the lowest latency due to its lightweight distance-based computation, but its inability to capture long-term temporal dependencies results in weaker F1-score. When combining the two branches without fusion, slight performance gains are observed, but the absence of an integration mechanism limits the utilization of complementary information. Fixed-weight fusion further improves performance but lacks adaptability, making it unstable under varying market conditions. The proposed adaptive risk fusion mechanism achieves the best performance across all metrics, reducing FPR while increasing detection accuracy, indicating that dynamically adjusting the importance of each branch produces more reliable anomaly judgments.
As shown in
Figure 5, the Transformer branch models long-range temporal dependencies, and its global attention mechanism captures cumulative temporal deviations characteristic of manipulation-related or long-horizon anomalous behaviors. However, its computational complexity grows quadratically with sequence length, leading to higher inference latency. The clustering branch models density-based deviations in latent space, detecting isolated structural outliers efficiently through distance-to-center measurements. It is more sensitive to short-term jumps or abrupt behavioral changes and excels in real-time responsiveness, yet cannot fully capture cross-domain or cross-temporal composite anomalies. The proposed adaptive fusion method is mathematically equivalent to learning a nonlinear risk mapping in semantic space, enabling the model to autonomously emphasize the more reliable modality—assigning greater weight to the Transformer during volatile intervals and enhancing clustering constraints in stable regions. As the fused representation aligns more closely with the joint distribution of high-dimensional anomaly patterns, improvements in AUC, F1-score, and FPR reflect stronger theoretical separability of anomalous behaviors, ensuring robust and real-time detection in practical trading environments.
4.7. Ablation Studies
This experiment aims to systematically validate the contributions of individual components within the overall framework, clarifying each module’s role, sources of performance improvement, and collaborative interactions. Key modules are removed one by one from the full model, and the resulting changes in precision, recall, F1-score, and AUC are analyzed to reveal differences in data distribution modeling, feature-space representation, and anomaly characterization.
As shown in
Table 7, the full model achieves the highest performance across all metrics, demonstrating the complementary benefits of generative augmentation, distribution modeling, multimodal temporal reasoning, and risk fusion. Removing the GAN module significantly reduces recall, indicating its critical role in expanding abnormal pattern coverage. Excluding the multi-domain + VAE module weakens anomaly distribution modeling, resulting in notable degradation across all metrics. Removing the Transformer branch reduces the ability to identify long-term or manipulation-related anomalies, while removing the clustering branch weakens sensitivity to short-term deviations. Without the fusion mechanism, the model can no longer leverage complementary strengths of the two branches, resulting in inferior performance compared to the complete design.
The theoretical distinctions among these modules stem from their different modeling assumptions in high-dimensional behavior space, as shown in
Figure 6. The GAN module expands the minority-class support region by approximating fraudulent marginal distributions, and its removal directly reduces the recall rate due to diminished boundary coverage. The multi-domain + VAE module constructs a smooth, continuous, high-density latent manifold for normal samples, while anomalies occupy low-density regions; removing this module disrupts the density-based discrimination mechanism, making it more difficult to distinguish anomalies. The Transformer branch provides long-horizon temporal modeling, and its removal eliminates sensitivity to gradual deviations or multi-step manipulations. The clustering branch specializes in detecting localized structural outliers, and its removal impairs the detection of abrupt behavioral shifts. The fusion mechanism mathematically enables a nonlinear combination of temporal and structural modalities, allowing dynamic reweighting based on reliability. Its removal breaks this adaptive balance, leading to systematic degradation of detection performance. These results confirm that the superior performance of the complete model arises from the coordinated interaction of multiple components across high-dimensional, multi-domain, and multi-scale representation spaces, and the absence of any single module disrupts this synergy, leading to consistent declines in detection accuracy.
4.8. Discussion
4.8.1. Convergence Diagnostics and Generative Stability Analysis
To ensure the reliability of the generative augmentation module, we conducted rigorous diagnostics on the training convergence and stability of the GAN architecture. The training dynamics were monitored by tracking the adversarial loss trajectories of both the generator and the discriminator. While initial epochs exhibited characteristic oscillations inherent to the min-max adversarial game, the losses eventually settled into a stable Nash equilibrium, indicating that the generator effectively learned to approximate the target distribution without divergence. Crucially, the feature-consistency loss demonstrated a monotonic decrease throughout the training process, confirming that the generator successfully internalized the structural constraints of the transaction graph and the statistical properties of the price sequences, rather than merely memorizing surface-level noise. Furthermore, we explicitly addressed the risk of mode collapse, a prevalent challenge in financial data synthesis where models may default to generating a single repetitive fraud pattern. Quantitative analysis using the Inception Score (IS) yielded consistently high values, indicating that the synthesized samples maintain significant diversity. Visual inspection of the latent space distribution via t-SNE projections further verified that the generated pseudo-fraudulent samples formed multiple distinct clusters. These clusters effectively covered the heterogeneous behavioral modes of real-world money laundering and market manipulation—such as varying subgraph topologies and temporal volatility signatures—rather than collapsing into a single trivial mode. Finally, potential discriminator overfitting was scrutinized to prevent the “over-optimization” trap, where the discriminator dominates the game by memorizing training examples, thereby vanishing the gradients for the generator. We continuously monitored the discriminator’s accuracy on a held-out validation set and observed that the performance gap between training and validation remained within a narrow bound. This generalization capability is attributed to the spectral normalization and noise injection mechanisms implemented in our architecture, which effectively regularized the network. Consequently, the discriminator provided meaningful and non-vanishing gradients throughout the training lifecycle, sustaining a healthy and robust adversarial learning signal.
4.8.2. Applicability in Real-World Cryptocurrency Scenarios
The real-time anomalous transaction detection framework proposed in this study demonstrates strong applicability and practical value across multiple representative scenarios in cryptocurrency markets. In centralized exchange risk control systems, platforms are required to identify suspicious behaviors within milliseconds, such as rapid inflows of numerous small transfers into a single target address, high-frequency arbitrage activities executed by coordinated bot clusters within narrow time windows, or price manipulation attempts conducted by repeatedly placing and canceling orders to influence market depth. Traditional threshold-based or offline analytical methods struggle to capture these patterns in time. By contrast, the proposed model, benefiting from multimodal feature fusion and temporal deviation modeling, is capable of detecting departures from normal behavioral patterns as soon as the transaction occurs, enabling exchanges to take immediate mitigation measures such as freezing accounts, suspending trading pairs, or initiating KYC (know-your-customer) verification.
In decentralized finance environments, smart contract platforms lack manual auditing mechanisms, and adversaries frequently launch complex attacks via flash loans. Such attacks typically involve multi-contract chained operations that manipulate liquidity pool prices, followed by rapid arbitrage or asset theft. These behaviors span multiple nodes in the transaction graph and leave only subtle anomalies in market price sequences. The proposed framework simultaneously analyzes both on-chain graph structures and temporal price dynamics, allowing the system to detect irregular fund flows that deviate from conventional paths or extreme short-term price shifts in the flash-loan attack chain, thereby enabling earlier activation of risk control responses.
In anti-money laundering monitoring tasks, illicit actors often rely on multi-hop transfers, structuring, and mixing techniques to obscure source identities and construct pseudo-normal behavioral patterns through cross-address transactions. The proposed generative adversarial augmentation module enables the simulation of diverse money-laundering trajectories during training, improving the model’s sensitivity to complex fund-flow patterns. Meanwhile, the clustering branch identifies behavioral clusters in latent space whose distance patterns deviate significantly from those of normal users; even when individual transactions appear benign, deviations emerge clearly at the level of behavioral sequences and transaction-network structures. Furthermore, in cross-chain surveillance tasks conducted by regulatory authorities, structural discrepancies across blockchains make certain anomalous patterns undetectable within a single chain. However, when cross-chain price trends and multi-chain transaction graphs are jointly modeled, abnormalities manifest in the form of cross-domain signatures. The proposed multi-domain joint modeling mechanism is designed precisely for such scenarios, extracting stable representations from temporal, structural, and frequency domains, causing cross-chain anomalies to exhibit stronger consistency in the latent space and providing regulators with more precise risk indicators.
Crucially, to bridge the gap between algorithmic detection and regulatory enforcement, the framework provides an interpretable decision-making process aligned with global Anti-Money Laundering (AML) and Counter-Terrorist Financing (CFT) standards. By analyzing the attention weights assigned to specific transaction subgraphs and temporal frequency bands, the model generates granular evidence explaining why a transaction is flagged. This “white-box” transparency allows compliance officers to trace the specific risk factors—such as sudden structural divergence or cyclical laundering patterns—facilitating the efficient filing of Suspicious Activity Reports (SARs). Consequently, the system supports the Risk-Based Approach (RBA) recommended by the Financial Action Task Force (FATF), ensuring that automated alerts are not only accurate but also auditable and legally actionable.
4.8.3. Module Synergy and Computational Efficiency Analysis
To further elucidate the internal logic of the framework, it is necessary to examine the holistic coordination among its generative, representational, and detection components. The architecture operates as a tightly coupled four-stage pipeline rather than a loose collection of models. At the foundational level, the GAN-based module functions primarily during the training phase, utilizing feature-consistency constraints to synthesize high-fidelity fraudulent samples; this effectively corrects the class imbalance in the hyperspace before any detection occurs, ensuring that subsequent modules are not biased toward the majority class. Following this, the Multi-domain VAE acts as the universal feature encoder, projecting heterogeneous on-chain graph structures and off-chain price dynamics into a unified latent manifold, thereby providing a standardized input representation and an initial anomaly score based on reconstruction probability.
The detection phase employs a multi-view strategy to avoid functional redundancy. The Multi-domain Time Series module utilizes Fast Fourier Transform and convolutional operations to extract explicit signal-level characteristics, such as periodic perturbations typical of automated bot activities. In parallel, the Transformer branch leverages self-attention mechanisms to model implicit semantic dependencies and long-term evolutionary trends, identifying logical inconsistencies in complex transaction chains. Complementing these temporal analyzers, the Online Clustering branch focuses on spatial density within the latent manifold, rapidly identifying structural outliers that deviate from local normality centers. These diverse signals—distributional, signal-based, semantic, and spatial—are finally integrated via the Adaptive Risk Fusion mechanism, which dynamically assigns weights based on the confidence of each branch, ensuring robust decision-making across varying market conditions.
Regarding computational efficiency, the system design optimizes the trade-off between high-dimensional modeling and real-time responsiveness. It is important to note that the computationally intensive GAN module is restricted to the offline training phase and imposes zero overhead during online inference. The VAE encoder and the Multi-domain module (operating at log-linear complexity via FFT) are lightweight and suitable for high-throughput stream processing. The detection latency is primarily dominated by the Transformer’s self-attention mechanism, which scales quadratically with sequence length; however, by employing a sliding window strategy with a fixed localized horizon, the effective input length remains bounded, ensuring deterministic processing times. The Online Clustering branch maintains near-linear complexity using efficient distance metrics. Empirical testing reveals that the total inference latency per transaction averages approximately 19 milliseconds. This performance significantly outperforms traditional offline batch-processing pipelines, which typically exhibit latencies ranging from seconds to minutes, and remains competitive with lightweight single-modal detectors, making it well-suited for pre-confirmation risk checks and real-time AML monitoring.
4.9. Limitation and Future Work
Although the proposed multimodal real-time anomaly detection framework demonstrates strong detection capability and robustness across both experimental evaluations and practical deployment scenarios, several limitations remain. The model relies on on-chain transaction structures, price sequences, and multiple external data sources. While these modalities enhance detection accuracy, the overall performance may degrade when some modalities are missing or delayed under extreme market conditions—particularly during severe network congestion or temporary outages in exchange data feeds. Future research will focus on improving scalability and cross-ecosystem adaptability. Moreover, incorporating self-supervised learning and causal modeling techniques may allow the system to autonomously identify anomaly-driving factors in partially or fully unlabeled settings, enhancing the ability to detect unknown risks and improving generalization and interpretability against emerging attack vectors. Furthermore, given the critical sensitivity of financial data, future iterations will specifically explore privacy-preserving computation methods, such as homomorphic encryption and zero-trust architectures. These mechanisms aim to enable anomaly detection on encrypted data without exposing raw sensitive information, thereby achieving a necessary balance between rigorous risk control and user data confidentiality.