Previous Article in Journal
A Real-Time Improved YOLOv10 Model for Small and Multi-Scale Ground Target Detection in UAV LiDAR Range Images of Complex Scenes
Previous Article in Special Issue
MSF-Net: A Data-Driven Multimodal Transformer for Intelligent Behavior Recognition and Financial Risk Reasoning in Virtual Live-Streaming
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Data-Driven Cross-Lingual Anomaly Detection via Self-Supervised Representation Learning

1
National School of Development, Peking University, Beijing 100871, China
2
School of International Education and Exchange, Beijing Sport University, Beijing 100084, China
3
China Agricultural University, Beijing 100083, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(1), 212; https://doi.org/10.3390/electronics15010212 (registering DOI)
Submission received: 11 December 2025 / Revised: 26 December 2025 / Accepted: 28 December 2025 / Published: 2 January 2026
(This article belongs to the Special Issue Advances in Data-Driven Artificial Intelligence)

Abstract

Deep anomaly detection in multilingual environments remains challenging due to limited labeled data, semantic inconsistency across languages, and the unstable distribution of rare abnormal patterns. These challenges are particularly severe in low-resource scenarios—characterized by scarce labeled anomaly data and non-standardized terminology—where conventional supervised or transfer-based models suffer from semantic drift and feature mismatch. To address these limitations, a data-driven cross-lingual anomaly detection framework, LR-SSAD, is proposed. Targeting paired text and behavioral data without requiring parallel translation corpora, the framework is built upon the joint optimization of complementary self-supervised objectives. A cross-lingual masked prediction module is designed to capture language-invariant semantic structures to align semantic spaces, while a Mamba-based sequence reconstruction module leverages its linear computational complexity ( O ( N ) ) to efficiently model long-range dependencies in transaction histories, overcoming the computational bottlenecks of quadratic attention mechanisms. To further enhance robustness under noisy supervision, a noise-aware pseudo-label refinement mechanism is introduced. Evaluated on a newly constructed real-world financial dataset (spanning January–June 2023) comprising 1.2 million multilingual texts and 420,000 transaction sequences, experimental results demonstrate that LR-SSAD achieves substantial improvements over state-of-the-art baselines. The model achieves an accuracy of 0.932, a precision of 0.914, a recall of 0.891, and an F1-score of 0.902, with the Area Under the Curve (AUC) reaching 0.948. The proposed framework provides a scalable and data-efficient solution for anomaly detection in real-world multilingual environments.

1. Introduction

In the era of accelerating global digital integration, the cross-lingual expression and cross-regional propagation of anomalous patterns have become increasingly prominent challenges across diverse information systems [1]. As digital platforms expand globally, an increasing number of international organizations are required to process large-scale texts and behavioral data from multilingual users [2]. Within this broad landscape, the financial domain represents a critical and high-stakes environment where the identification of high-risk activities [3] has become one of the core challenges faced by intelligent risk control systems [4]. Particularly in low-resource language scenarios—a challenge prevalent in many globalized applications but acute in finance—the scarcity of linguistic resources, the lack of standardized terminology, and the diversity of semantic expressions cause anomalous behaviors to exhibit strong concealment, unstable distributions, and weak transferability, thereby imposing more stringent requirements on existing anomaly detection techniques [5].
Traditional anomaly detection approaches primarily rely on manually designed rule-based systems or statistical modeling techniques, such as threshold-based rule detection, clustering-based outlier identification, and probability distribution-based anomaly modeling [6]. Although these methods can achieve acceptable performance on structured data with homogeneous semantics and stable features, they often exhibit unsatisfactory performance when applied to real-world environments characterized by multilinguality and complex noise, as exemplified by cross-border financial systems [7]. On the one hand, low-resource language texts inherently suffer from limited resources, low lexical coverage, and pronounced semantic drift, making accurate modeling through handcrafted rules difficult [8]. On the other hand, anomalous behaviors in practical scenarios are typically highly covert, and their deep semantic correlations cannot be effectively captured through simple statistical features [9]. Furthermore, traditional methods heavily depend on expert knowledge and lack the ability to generalize automatically as data scale increases, which renders them particularly inadequate in rapidly evolving cross-regional risk environments.
In recent years, deep learning methods have been widely applied to text analysis and anomaly detection tasks due to their strong representation learning capabilities [10]. Cross-lingual understanding techniques based on pretrained language models (PLMs), such as mBERT and XLM-R, enable partial sharing of semantic spaces across multiple languages [11]. However, these models are primarily trained on high-resource languages, resulting in insufficient coverage of low-resource expressions and leading to frequent issues such as semantic bias [12], out-of-vocabulary (OOV) words, and semantic drift when transferred to low-resource scenarios [13]. In addition, most deep learning-based anomaly detection models depend on supervised data, including labeled anomaly samples, which are difficult to obtain in many real-world settings, especially in low-resource financial domains [14]. Even when transfer learning or few-shot learning strategies are introduced, negative transfer often occurs in the presence of large domain discrepancies and cross-lingual distribution inconsistencies, causing significant performance degradation in target languages [15]. To reduce reliance on manual annotations, self-supervised learning has emerged as a key research direction in anomaly detection [16]. Through pretext tasks such as masked prediction, contrastive learning, or reconstruction learning, semantic structures and behavioral patterns can be automatically learned from unlabeled data [17]. However, most existing self-supervised methods are designed for monolingual or single-modality scenarios, and their direct application to complex multilingual environments encounters two major bottlenecks. First, self-supervised objectives often fail to adequately capture semantic shifts across low-resource languages, resulting in insufficient cross-lingual correlation modeling [18]. Second, behavioral sequences play a critical role in dynamic risk control [19], yet existing self-supervised anomaly detection models tend to focus on textual or static features and exhibit limited sensitivity to time-series dependencies, making it difficult to reflect the temporal characteristics of anomalies [20]. Consequently, neither purely text-based self-supervision nor standalone sequence reconstruction is sufficient to address the complexity of low-resource anomaly detection tasks. Sehwag et al. [21] proposed SSD, which relies solely on unlabeled in-distribution data and constructs an anomaly detector by combining self-supervised representation learning with Mahalanobis distance measurement, achieving performance comparable to or even surpassing supervised detectors on multiple benchmarks. Wu et al. [22] proposed a self-supervised anomaly detection algorithm with interpretability, in which self-supervised learning is integrated with feature selection and anomaly score stability is used as the pretext task, enabling effective identification of anomalous samples and their types while highlighting key feature combinations responsible for anomalies, thereby significantly improving both detection accuracy and interpretability. Wang et al. [23] introduced SLA2P, an unsupervised anomaly detection framework that combines random projection-based pseudo-class self-supervised training with classification uncertainty measurement under adversarial perturbations, successfully achieving effective anomaly (outlier) detection using only normal (in-distribution) data. The proposed method outperformed traditional unsupervised approaches across multiple image, text, and tabular datasets and achieved state-of-the-art performance on several evaluation metrics. Pospieszny et al. [24] proposed ADALog, an unsupervised log anomaly detection framework based on a self-attention masked language model, token-level reconstruction probability, and adaptive thresholds. Without relying on log parsing, templates, or anomalous samples, ADALog can be trained solely on normal logs to achieve efficient, general, and highly interpretable anomaly detection across heterogeneous logs, demonstrating superior detection performance on multiple benchmark datasets.
To address the above challenges, a unified self-supervised anomaly detection framework, termed the low-resource self-supervised anomaly detector (LR-SSAD), is proposed. Moving beyond simple system-level integration, this work conceptually advances anomaly detection by establishing a dual-view synergy between cross-lingual semantic consistency and behavioral temporal dynamics. While applicable to the broader problem of anomaly detection in low-resource environments, this work validates the framework within the financial domain, aiming to learn robust anomaly-discriminative representations under extremely limited or even unlabeled conditions. In summary, the main contributions of this work can be summarized as follows:
  • Unlike existing methods that treat linguistic features and behavioral sequences in isolation, we construct a unified framework, LR-SSAD, that leverages the complementarity between semantic invariants and temporal regularities. This design conceptually resolves the bottleneck of feature sparsity in low-resource scenarios by allowing one modality to regularize the representation learning of the other.
  • We propose a joint optimization objective that integrates cross-lingual masked prediction with Mamba-based sequence reconstruction. Distinct from standard auto-encoding or masked modeling approaches, this strategy utilizes the linear complexity O ( N ) of state space models to capture long-range behavioral dependencies, which serve as a stable anchor to mitigate the semantic drift often observed in low-resource language models.
  • To overcome the “confirmation bias” inherent in conventional self-training and pseudo-labeling frameworks, we design a noise-robust pseudo-label refinement mechanism. By dynamically re-weighting samples based on prototype uncertainty, this mechanism ensures stable optimization trajectories and prevents the accumulation of noise in scarce-label environments.
  • We leverage cross-lingual semantic alignment not just for representation matching, but to shape a consistent anomaly decision boundary across languages. This provides a new technical perspective for mitigating negative transfer in robust risk control systems, ensuring that anomaly definitions remain consistent even when transferring from high-resource to low-resource settings.
We will release the dataset and related code after the paper is accepted.

2. Related Work

2.1. Multilingual Financial Text Modeling Methods

In multilingual financial text modeling, cross-lingual pretrained language models (PLMs) such as mBERT and XLM-R have established a foundation based on shared vocabularies and unified semantic spaces [25,26]. These models typically employ masked language modeling (MLM) and cross-lingual alignment objectives to facilitate knowledge transfer from high-resource to low-resource languages [27]. While effective in general domains, their direct application to the financial sector faces significant limitations. First, general-purpose PLMs lack coverage of specialized financial terminology in low-resource languages, leading to frequent out-of-vocabulary (OOV) issues and semantic ambiguity [28]. Second, even with domain-adaptive pretraining (DAPT) on financial corpora, these models often fail to capture the subtle semantic nuances required for risk detection, as the scarcity of low-resource financial data prevents the formation of robust, alignment-friendly semantic structures, resulting in severe semantic drift during cross-lingual transfer [29].

2.2. Self-Supervised Learning and Anomaly Detection

Self-supervised learning has become a pivotal paradigm for anomaly detection, enabling the learning of normal patterns from unlabeled data through pretext tasks such as masked prediction, contrastive learning, and reconstruction [30,31]. These approaches leverage reconstruction errors or latent space distances to distinguish anomalies, mitigating the reliance on expensive labeled data [32,33]. However, existing methods predominantly focus on single-modality inputs (static text or images) and struggle to address the multi-dimensional complexity of financial fraud. Specifically, standard text-based self-supervision often overlooks the crucial temporal dynamics inherent in user transaction sequences, failing to detect anomalies that manifest as behavioral deviations over time rather than textual outliers [34]. Furthermore, in multilingual settings, current self-supervised objectives are not designed to handle cross-lingual semantic noise, making them unstable when detecting anomalies across heterogeneous language distributions [35].

2.3. Low-Resource Transfer Learning and Few-Shot Robust Modeling

Low-resource transfer learning aims to improve model performance under data-scarce conditions by transferring knowledge from high-resource languages via parameter sharing or meta-learning strategies [36]. Approaches such as prompt-based transfer and multilingual Transformer adaptation have achieved notable success in standard NLP tasks by extracting common linguistic knowledge [37,38]. Nevertheless, these strategies are highly susceptible to negative transfer in high-stakes financial scenarios. The domain discrepancy between rich-resource general corpora and low-resource financial texts often leads to feature mismatch, where the transferred knowledge acts as noise rather than a valid signal [39]. Additionally, existing few-shot modeling techniques typically assume a stable distribution in the target domain, a premise that rarely holds in financial anomaly detection where anomalous patterns are diverse, concealed, and distributionally unstable [40].

3. Materials and Method

3.1. Data Collection

The dataset used in this study was constructed around the practical application scenario of low-resource financial anomaly detection. Data collection was conducted continuously from January 2023 to June 2023, covering multiple consecutive business cycles to ensure temporal completeness and stability of financial behaviors and semantic distributions, as shown in Table 1. The overall data sources included multilingual financial textual data, structured transaction behavior sequences, and anomaly risk samples verified through multiple levels of validation. These three types of data were linked during the collection stage using unified timestamps and account identifiers, thereby supporting joint modeling of cross-lingual semantics and behavioral temporal patterns. Financial textual data were mainly collected from transaction remarks generated by cross-border payment systems, customer communication and complaint records, risk review documentation, and multilingual financial discussion texts from selected public online platforms, covering both high-resource languages such as English and multiple low-resource languages. During collection, all textual data were stored in raw log format, with language identifiers, character encodings, and collection timestamps fully preserved. No pre-filtering rules were applied, in order to avoid systematic omissions of low-frequency anomalous expressions.
Transaction behavior sequence data were obtained from structured logs generated by real cross-border payment systems and digital asset trading platforms, with the collection period synchronized with that of the textual data, spanning from January 2023 to June 2023. Each transaction record contained multiple attributes, including transaction amount, transaction direction, transaction category, account state encoding, and temporal intervals between consecutive transactions. These records were aggregated at the account level to construct behavior sequence samples with explicit temporal ordering. This data collection strategy preserved the dynamic evolution of transaction behaviors and provided complete temporal context for subsequent anomaly deviation modeling. Anomalous samples were constructed based on business rule trigger records, outputs from historical risk control systems, and results of manual compliance reviews, covering typical financial anomaly types such as fraudulent transactions, money laundering risks, and abnormal arbitrage patterns, with temporal distributions consistent with the overall data collection period. Normal samples were derived from accounts exhibiting stable behavior throughout the full collection period and without triggering any risk rules, thereby reducing label noise and selection bias. To address the inherent class imbalance in low-resource financial data, no simple down-sampling strategy was applied during data collection and organization. Instead, the original class proportions were fully retained to faithfully reflect the real-world challenges of sparse anomalies and high noise ratios in low-resource settings.

3.2. Dataset Processing and Enhancement

In low-resource financial anomaly detection tasks, data preprocessing and augmentation not only constitute the foundational stage of model training but also directly affect the effectiveness of subsequent representation learning and anomaly pattern modeling. Financial texts exhibit high heterogeneity in terms of language, structure, length, and noise distribution, while transaction behavior sequences present strong temporal dependencies and high-dimensional sparsity. Consequently, a unified data processing strategy is required to jointly account for cross-lingual semantic consistency, temporal stability of behavior sequences, and sensitivity to anomalous signals. This section focuses on text cleaning, subword modeling, outlier identification, sequence normalization, and cross-lingual data augmentation. Through formal definitions and mathematical formulations, the theoretical foundations of these mechanisms and their suitability within the LR-SSAD framework are clarified.
In multilingual text preprocessing, the primary challenge lies in constraining noisy elements. Financial corpora often contain numerical strings, currency symbols, special transaction codes, and cross-lingual mixed tokens. Without normalization, such elements may cause the model to learn meaningless sparse vectors in the subword space. Therefore, for an arbitrary raw text sequence X = { x 1 , x 2 , , x n } , a cleaning function f clean ( · ) is defined to map abnormal character sets C noise to unified symbols or remove them entirely. This operation can be expressed as
X = f clean ( X ) = { f map ( x i ) x i C noise } .
For cross-lingual inputs, a language identification model g lang ( · ) is employed to assign a language label l to each sentence, and the prediction process is formalized as
l = arg max k P ( lang k X ) .
This language label is not only used in subsequent shared subword encoding but also serves as a conditional variable in cross-lingual data augmentation strategies, ensuring that augmented texts preserve language consistency.
For subword encoding, tokenization strategies based on byte pair encoding or SentencePiece are adopted, where a vocabulary V is trained to minimize the overall encoding cost. Given a character sequence X , the objective is to find a subword sequence S = { s 1 , , s m } that maximizes the sequence probability
S * = arg max S j = 1 m P ( s j V ) .
Cross-lingual unified encoding effectively reduces embedding sparsity in low-resource languages and enhances cross-semantic space alignment within the LR-SSAD framework.
For outlier detection and missing value imputation in transaction behavior sequences, statistical properties of time series data are utilized. Let a transaction sequence be defined as
T = { t 1 , t 2 , , t n } , t i R d ,
where each vector t i consists of multiple features such as transaction amount, time interval, and transaction type. A mean- and standard deviation-based outlier detection strategy is employed, where points exceeding a predefined threshold are considered potential noise:
outlier ( t i ) = I | t i μ | × 2 > α σ ,
with μ and σ denoting the mean and standard deviation of the sequence, respectively, and α representing a tunable parameter. For missing values t ^ i , linear interpolation is applied to obtain an estimated value
t ^ i = t i 1 + t i + 1 t i 1 2 ,
which preserves sequence smoothness and gradual variation. Finally, normalization is performed to project feature distributions into a unified space
t i = t i μ σ .
In the cross-lingual data augmentation stage, the primary objective is to improve model robustness to noise, semantic drift, and cross-lingual transformations while preserving semantic consistency. Random word masking augmentation is derived from the principles of masked language modeling, where a proportion p of subwords in the input sequence is randomly masked. Given a subword sequence S = { s 1 , , s m } , the augmented sequence can be expressed as
S mask = { s ˜ j s ˜ j = [ s MASK ] , with probability p , s j , otherwise } .
This mechanism enhances the model’s ability to handle semantic omissions and incomplete inputs, which is particularly beneficial for low-resource languages characterized by higher noise levels and unstable syntactic structures.
Semantic perturbation augmentation is based on the continuity of embedding spaces. By injecting small perturbations into embedding vectors, the model is encouraged to capture variations within semantic neighborhoods. Let e j denote the embedding of subword s j , and the perturbed embedding is given by
e ˜ j = e j + ϵ , ϵ N ( 0 , β 2 I ) ,
where β controls the magnitude of perturbation. This strategy is particularly important for low-resource languages, as their embedding spaces are often sparse and unstable. The introduction of perturbations helps smooth the embedding space and improves cross-lingual transferability.
For temporal masking and order perturbation augmentation of behavior sequences, the theoretical foundation lies in reconstruction sensitivity in anomaly detection tasks. Temporal masking is achieved by masking selected time steps corresponding to a randomly sampled index set Ω , defined as
t i mask = 0 , i Ω , t i , otherwise . .
Under this setting, the behavior reconstruction module is forced to learn stronger global temporal dependencies. Order perturbation is realized through lightweight temporal transformations by randomly swapping partial indices of the original sequence, which can be expressed as
T perm = { t π ( i ) i = 1 , , n } ,
where π ( · ) denotes a perturbation permutation function. This operation aims to improve robustness to temporal disturbances and promotes the learning of stable behavioral structure features.
In summary, the proposed data preprocessing and augmentation pipeline constructs a unified data input space through multilingual text normalization, subword-consistent encoding, statistical modeling of transaction sequences, and cross-lingual augmentation strategies. This pipeline provides high-quality inputs for the self-supervised cross-lingual masked prediction and behavior sequence reconstruction tasks in LR-SSAD. Through multi-level preprocessing mechanisms, semantic stability of low-resource texts is enhanced, and model sensitivity to anomalous patterns in complex transaction behavior sequences is significantly improved, thereby establishing a solid data quality foundation for subsequent anomaly detection model training.

3.3. Proposed Method

3.3.1. Overall

The proposed LR-SSAD constructs a unified self-supervised learning pipeline for low-resource financial anomaly detection from a model architecture perspective. The core idea is to progressively learn robust and discriminative anomaly representations under the absence of explicit anomaly annotations through the collaborative optimization of cross-lingual semantic modeling and behavioral temporal modeling. After data normalization and encoding, multilingual financial texts and the corresponding transaction behavior sequences are first fed into two parallel yet tightly coupled encoding pathways. On the textual side, a multilingual text encoder is employed to model financial text sequences represented using a shared subword vocabulary. Through contextual modeling, word-level information is mapped into sentence-level or sequence-level semantic representations, while an internal cross-lingual masked prediction task is introduced as a self-supervised constraint, enabling the encoded representations to exhibit consistent and alignable semantic structures across different languages. In parallel, transaction sequences on the behavioral side are processed by a sequence encoder constructed based on the Mamba architecture, where account-level transaction evolution patterns are reconstructed along the temporal dimension. By minimizing reconstruction errors, normal behavior distributions are learned, thereby endowing the model with sensitivity to anomalous deviations. The intermediate representations from the text encoder and the behavior sequence encoder are subsequently projected into a unified latent semantic space. Within this space, cross-lingual textual semantics and behavioral temporal features are fused and aligned through shared representations, allowing language-level anomalies and behavior-level anomalies to be characterized at a consistent scale. The entire training process adopts a joint self-supervised strategy that simultaneously optimizes cross-lingual masked prediction objectives and behavior sequence reconstruction objectives, driving the model toward representations that encode both semantic consistency and behavioral stability in a label-free manner. Based on these representations, anomaly confidence scores are generated for each sample, and a confidence-aware pseudo-label updating mechanism is applied to dynamically reweight training samples, ensuring that high-confidence samples dominate the optimization process while the influence of uncertain samples is gradually suppressed. This design effectively prevents noise accumulation and amplification in low-resource settings. As training progresses, the textual semantic space, behavioral sequence distributions, and pseudo-label structures are continuously co-adapted within the unified framework, ultimately yielding an end-to-end self-supervised detection model capable of stably identifying anomalous behaviors in multilingual environments.

3.3.2. Cross-Lingual Masked Prediction Module

The cross-lingual masked prediction module is designed to learn financial text representations that are insensitive to language differences within a unified semantic space. To address the reviewer’s query regarding how language-invariant structures are enforced, we clarify that our approach adopts a hybrid alignment strategy. Instead of relying on computationally expensive adversarial objectives or explicit alignment losses (which require parallel corpora), we enforce cross-lingual consistency through two mechanisms: implicit alignment via parameter sharing and explicit alignment via entity-level semantic injection.
Architecturally, the foundation of alignment is established through a deep Transformer-based encoding network that shares all parameters across languages. As shown in Figure 1, the text encoder consists of L = 12 stacked Transformer encoder layers, each comprising multi-head self-attention ( H = 12 ) and feed-forward sub-networks (hidden dimension d = 768 , expanded to 4 d ). By processing multilingual financial text sequences through this single shared function, the model is architecturally constrained to map inputs from different languages onto a common high-dimensional feature manifold.
However, relying solely on a shared subword vocabulary is often insufficient for strictly aligning distinct language distributions. Therefore, to explicitly impose language invariance, we introduce an entity-driven cross-lingual alignment mechanism. Specifically, strictly aligned cross-lingual entity pairs (e.g., specific financial terms) are identified by clustering high-frequency tokens using K-means in a pretrained multilingual embedding space. During the input phase, we apply a randomized synonym mapping strategy: for a given entity in the input sequence, there is a probability of it being replaced by its semantic counterpart from a different language within the same cluster. This operation forces the encoder to produce identical hidden representations for semantically equivalent entities regardless of their surface language form, thereby acting as a strong regularization signal that actively pulls the distributions of different languages closer without requiring sentence-level parallel data.
Regarding the optimization objective, we employ a strictly probabilistic masking mechanism where a proportion ρ = 0.15 of input positions is perturbed (80% [MASK], 10% random, 10% original). The masked sequence X ˜ is encoded to produce hidden representations H = Encoder ( X ˜ ; θ ) , where θ denotes the shared parameters. The objective is to minimize the conditional log-likelihood over the unified vocabulary:
L clmp = i M log P ( x i h i ) ,
where M denotes the set of masked positions. It is crucial to note that no explicit alignment loss (such as Mean Squared Error between parallel sentence embeddings) is used. Instead, alignment is an emergent property: since the prediction head P ( · | h i ) is shared, the model must learn language-agnostic features h i to correctly predict the masked tokens from the shared vocabulary, effectively performing implicit domain adaptation during training.
From a theoretical perspective, this mechanism minimizes the conditional entropy discrepancy between semantic distributions of different languages. Let Z denote the latent representations; the training effectively encourages:
E a , b D KL P ( X Z , a ) P ( X Z , b ) 0 ,
where a and b represent different language domains. Within the LR-SSAD framework, this module is jointly trained with the behavior sequence reconstruction module ( L = L clmp + λ L rec ). This joint design ensures that the cross-lingual semantic consistency provides a stable anchor for behavioral modeling, while behavioral patterns suppress superficial textual correlations, enabling robust anomaly detection even in low-resource settings.

3.3.3. Behavior Sequence Reconstruction Module

The design objective of the behavior sequence reconstruction module is to automatically learn the intrinsic temporal evolution patterns of normal transaction behaviors for financial accounts under the absence of reliable anomaly annotations, such that anomalous behaviors can be characterized as significant deviations from normal dynamic distributions. To address the high dimensionality, strong temporal dependency, and long-range correlations commonly observed in transaction behavior sequences, this module adopts the Mamba block based on state space models as the core modeling unit, achieving a balance between temporal expressiveness and computational efficiency. As shown in Figure 2, the input account-level transaction behavior sequences are first represented as multivariate time series, where each dimension corresponds to features such as amount variations, transaction intervals, or behavior codes. To enhance the model’s ability to perceive latent dynamical structures, a time-delay embedding strategy is introduced, expanding the original univariate sequences into state trajectory representations that incorporate historical context, thereby enabling approximate recovery of the underlying dynamical state space under limited observations. Subsequently, embeddings from different dimensions are fused along the feature dimension and segmented into fixed-length local temporal patches through a patching operation, allowing the model to capture fine-grained behavior patterns within local time windows while alleviating optimization instability caused by excessively long sequences.
After patching, the behavior segments are fed into a sequence modeling network composed of multiple Mamba blocks. In this module, the number of Mamba layers is set to L = 4 , the hidden state dimension is set to d = 256 , and residual connections are applied at each layer to maintain gradient stability. The Mamba block performs linear recursive modeling of time series via learnable state transition matrices, which can be regarded as approximating the input sequence T = { t 1 , , t n } as a dynamical system. This design ensures that the output hidden states preserve temporal continuity and predictability along the time dimension. During the reconstruction phase, the module performs decomposed prediction over the input sequence by employing a sliding window strategy to forecast behavioral features for future time steps. The mean squared error is adopted as the basic reconstruction loss, such that the discrepancy between predicted behaviors and ground-truth behaviors is minimized.
Furthermore, to prevent the model from achieving low reconstruction error merely through numerical fitting while neglecting structural behavioral properties, an attractor geometry preservation constraint is introduced. Specifically, the true transaction sequence and the model-predicted sequence can be viewed as corresponding to attractors of a dynamical system in the state space. This constraint minimizes the geometric discrepancy between the predicted and true attractors, enabling the model to reconstruct sequences not only in a point-wise error sense but also in terms of global dynamical consistency. From a theoretical perspective, this design ensures that the model learns the intrinsic mechanisms of how normal behaviors evolve over time, rather than simply memorizing short-term statistical patterns.
From the anomaly detection perspective, normal account behaviors typically follow relatively stable dynamical trajectories with continuous and predictable state evolution, whereas anomalous transactions disrupt such regularities, causing simultaneous shifts in reconstruction error and attractor structure. The behavior sequence reconstruction module amplifies such deviations, making anomalies manifest as significant outliers in both temporal reconstruction error and dynamical consistency. Particularly in low-resource financial scenarios, where textual semantics are inherently uncertain, behavior sequences provide a language-independent and stable signal source. This module forms a complementary relationship with the cross-lingual masked prediction module within a unified semantic space, enabling LR-SSAD to maintain reliable anomaly perception even under semantic perturbations and noisy environments.

3.3.4. Pseudo-Label Noise Suppression and Stable Training Module

In weakly supervised training for low-resource financial anomaly detection, pseudo-labels are inevitably affected by semantic instability, distributional shifts, and the sparsity of anomalous samples. Without effective constraints, noise tends to accumulate and be amplified throughout training iterations, leading to the well-known problem of confirmation bias, where the model overfits to its own erroneous predictions. This risk is exacerbated under severe class imbalance, as the majority class (normal samples) may dominate the pseudo-label generation, causing minority anomalies to be ignored. To address these issues, a pseudo-label noise suppression and stable training module is introduced. The core idea is to progressively separate high-confidence representations from noisy samples through mathematical uncertainty modeling and prototype constraints, without relying on ground-truth labels.
As shown in Figure 3, this module receives joint embedding representations from the cross-lingual masked prediction module and the behavior sequence reconstruction module. A lightweight foreground sample filtering network ( d = 512 256 1 , sigmoid activation) is first applied to filter out samples with low anomaly confidence, reducing the interference of obvious background noise in subsequent stages.
To construct robust pseudo-labels, we employ K-means clustering on the filtered embedding set { z i } i = 1 N to establish a structured distribution with centers { c k } k = 1 K . To ensure reproducibility and rigorous definition, the noise estimation and refinement process is mathematically formulated as follows. First, the probability P i , k that a sample z i belongs to prototype c k is calculated using a Student’s t-distribution kernel to measure similarity in the embedding space:
P i , k = ( 1 + z i c k 2 2 / α ) α + 1 2 j = 1 K ( 1 + z i c j 2 2 / α ) α + 1 2 ,
where α represents the degrees of freedom. Subsequently, the noise level of each sample is quantified by the information entropy of its assignment distribution:
H i = k = 1 K P i , k log P i , k .
High entropy indicates that the sample lies near the decision boundary or in a low-density region, suggesting high uncertainty (noise). To mitigate the risk of confirmation bias, we introduce an entropy-modulated weighting function w i = exp ( β H i ) , where β controls the sensitivity. This strictly limits the gradient contribution of uncertain samples. The final weighted optimization objective is defined as:
L proto = i = 1 N w i k = 1 K Q i , k log P i , k ,
where Q i , k is the target auxiliary distribution derived by sharpening P i , k to encourage high-confidence cluster assignments.
The proposed mechanism specifically addresses the risks of confirmation bias and class imbalance through two strategies. First, the parameter separation strategy—where the first two layers of the prototype classifier ( 512 256 128 ) are frozen—preserves the geometry of the initial high-confidence prototypes, preventing the decision boundary from rapidly shifting towards noisy pseudo-labels during early training. Second, unlike direct classification which biases towards the majority class, the prototype-based approach models local density. By dynamically down-weighting high-entropy samples via w i , the model effectively ignores ambiguous samples (often hard negatives or noisy anomalies) that contribute most to confirmation bias, ensuring that the optimization is driven only by reliable, high-confidence anchors.
In the overall LR-SSAD framework, this module operates in a joint optimization scheme. The cross-lingual masked prediction and sequence reconstruction modules learn the feature space, while L proto acts as a cluster-compactness regularizer. This interaction ensures that the learned embeddings are not only semantically and temporally consistent but also discriminative, preventing the representation space from collapsing into a trivial solution under weak supervision.

4. Results and Discussion

4.1. Experimental Setup

4.1.1. Hardware and Software Infrastructure

All model training and evaluation procedures were executed on a high-performance deep learning server designed to handle large-scale cross-lingual pretraining and sequence modeling. The hardware infrastructure was equipped with dual Intel Xeon Scalable processors to ensure efficient multi-threaded data preprocessing and dynamic batch construction. To accelerate tensor computations, we utilized NVIDIA A100 GPUs with over 40 GB of video memory per card, effectively satisfying the memory constraints of the proposed LR-SSAD framework and large-scale baselines. The system memory was configured at 256 GB to facilitate extensive caching of multilingual text embeddings and intermediate states for the Mamba-based sequence reconstruction. Furthermore, high-speed NVMe solid-state drives were employed to maximize I/O throughput, preventing data loading bottlenecks during the processing of multi-source financial logs.
On the software side, the environment was deployed on Ubuntu Server 20.04 LTS to ensure system stability and driver compatibility. The deep learning pipeline was implemented using PyTorch 2.0, leveraging its dynamic computation graph and optimized operators for efficient gradient backpropagation. The cross-lingual masked prediction module utilized the Hugging Face Transformers library for tokenizer integration and baseline model loading (e.g., mBERT, XLM-R). For structured data processing, we employed NumPy and Pandas for transaction log parsing, while Scikit-learn was used to implement traditional anomaly detection baselines such as One-Class SVM and Isolation Forest. The entire environment was accelerated using CUDA 12.1 and cuDNN 8.9 to fully exploit GPU parallel computing capabilities.

4.1.2. Data Partitioning and Evaluation Protocol

To strictly evaluate the generalization capability of the model, the dataset was partitioned using a fixed stratified sampling strategy. Specifically, 70% of the data was allocated for training, 10% for validation, and the remaining 20% was held out for testing. This split ensures that the model learns from a sufficient distribution of normal and anomalous patterns while being evaluated on unseen samples to assess cross-lingual transferability. To mitigate performance fluctuations caused by the unstable distribution of rare anomalies in low-resource scenarios and to ensure statistical reliability, we adopted a 5-fold cross-validation strategy on the training set. For each fold, the training data was evenly partitioned, with subsets alternately serving as the validation set. To further ensure reproducibility and robustness, all experiments were repeated 5 times with distinct random seeds. We report the mean performance across these runs along with the standard deviation to quantify model stability. Furthermore, statistical significance was assessed using the paired t-test (or Wilcoxon signed-rank test), with a significance level set at p < 0.05, to validate that the improvements achieved by LR-SSAD over baselines are statistically significant rather than due to random chance.

4.1.3. Implementation Details and Hyperparameters

The model was optimized using the AdamW optimizer with a weight decay coefficient of β = 0.01 to regularize the parameters and prevent overfitting on low-resource data. The initial learning rate was set to α = 2 × 10 5 , coupled with a cosine annealing scheduler to adjust the learning rate during training. The batch size was fixed at B = 32 to balance memory usage and gradient estimation stability. The training process was governed by a joint self-supervised objective. The weighting coefficients for the cross-lingual masked prediction loss ( λ 1 ) and the behavioral sequence reconstruction loss ( λ 2 ) were initialized at a ratio of 1:1. These coefficients were dynamically adjusted based on the magnitude of validation losses to balance the gradient contribution from both semantic and temporal tasks. The maximum number of training epochs was set to E = 20 . To ensure efficient convergence under weakly supervised conditions, an early stopping mechanism was implemented: training was terminated if the validation loss did not improve for 3 consecutive epochs.

4.1.4. Baseline Models and Evaluation Metrics

For the comparative experiments, several representative baseline models were selected, including mBERT [41], XLM-R [42], BiLSTM [43], Isolation Forest [44], AutoEncoder [45], and One-Class SVM [46]. As a multilingual pretrained language model, mBERT processes texts from different languages within a shared vocabulary space and exhibits strong cross-lingual transfer capability. XLM-R is trained on significantly larger corpora and provides improved multilingual representation quality, with enhanced robustness for low-resource languages. BiLSTM, as a bidirectional recurrent architecture, effectively captures forward and backward contextual dependencies in text sequences and demonstrates strong sequence-level semantic modeling capability. Isolation Forest identifies anomalies through random space partitioning and can be applied without requiring large amounts of labeled data. AutoEncoder-based methods detect anomalies by leveraging reconstruction errors learned from latent distributions of normal samples, offering strong representational capacity for nonlinear features. One-Class SVM defines a hyperspherical boundary in high-dimensional space to characterize normal samples and maintains stable performance even under small-sample conditions. To ensure fair comparison, all baseline models were retrained or fine-tuned on the same multilingual financial anomaly detection dataset constructed in this study.
In anomaly detection research, metrics such as accuracy, precision, recall, F1-score, Area Under the Receiver Operating Characteristic Curve (AUROC), Average Precision (AP), and False Alarm Rate (FAR) are commonly adopted. These metrics evaluate model performance from complementary perspectives, including overall classification correctness, positive prediction quality, anomaly detection capability, comprehensive performance, probabilistic ranking ability, and false alarm behavior. From a mathematical standpoint, these metrics are derived from fundamental quantities in the confusion matrix, namely True Positives ( T P ), True Negatives ( T N ), False Positives ( F P ), and False Negatives ( F N ), and can therefore be described within a unified notational framework.
The mathematical definitions of the core evaluation metrics are provided as follows.
A c c u r a c y = T P + T N T P + T N + F P + F N
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
F 1 = 2 · P r e c i s i o n · R e c a l l P r e c i s i o n + R e c a l l
F A R = F P F P + T N
A U R O C = 0 1 T P R ( F P R ) d ( F P R )
A P = k = 1 n P ( k ) · Δ R ( k )
In these equations, T P denotes the number of anomalous samples correctly identified as anomalies, T N denotes the number of normal samples correctly classified, F P denotes the number of normal samples incorrectly identified as anomalies, and F N denotes the number of anomalous samples that were not detected. T P R represents the true positive rate, which is equivalent to Recall, while F P R denotes the false positive rate, defined as F P / ( F P + T N ) . P ( k ) denotes the precision up to ranking position k, and Δ R ( k ) represents the increment in recall at position k. n denotes the total number of detection instances.
To comprehensively evaluate the ranking capability of the model independent of specific decision boundaries, curve-based metrics are employed. AUROC evaluates the ability to discriminate between positive and negative classes across all possible thresholds, where a value of 1.0 indicates a perfect classifier. Given the extreme class imbalance typical in financial anomaly detection, where anomalies are rare, AUROC can sometimes be overly optimistic due to the large number of True Negatives. Therefore, Average Precision (AP), which approximates the Area Under the Precision–Recall Curve (AUPRC), is also reported to focus strictly on the positive class performance.
Since metrics such as Precision, Recall, and F1-score require converting continuous anomaly scores into binary predictions using a scalar threshold, a dynamic thresholding strategy is adopted to ensure fair comparison. The optimal threshold τ * is selected to maximize the F1-score on the validation set, formulated as
τ * = arg max τ ( 2 · P r e c i s i o n ( τ ) · R e c a l l ( τ ) P r e c i s i o n ( τ ) + R e c a l l ( τ ) ) .
This threshold is then applied to the test set to compute the final binary classification metrics.
Beyond detection accuracy, the practical deployability of the model in high-throughput financial systems is assessed using computational efficiency metrics. Inference time denotes the average latency incurred to process a single batch of transaction sequences during the testing phase, normalized to milliseconds per sample, which is critical for real-time risk control. Finally, GPU memory usage measures the peak memory consumption during the training and inference phases to determine the hardware constraints for deployment in resource-limited environments. The experimental results reported in this paper were obtained independently under the unified experimental configuration of this study.

4.2. Overall Performance Comparison

This experiment was conducted to comprehensively evaluate the overall performance of the proposed model in low-resource financial anomaly detection tasks and to systematically compare it with traditional methods, temporal models, and multilingual pretrained language models. Through this comparison, the effectiveness and stability of LR-SSAD in complex, low-resource, and multilingual scenarios were examined. By simultaneously reporting multiple evaluation metrics, including accuracy, precision, recall, F1-score, AUC, and AP, the experiment not only assessed overall classification correctness but also emphasized the model’s ability to identify anomalous samples, discriminate rankings, and maintain robustness under highly imbalanced class distributions.
As shown in Table 2, traditional methods based on statistical and geometric assumptions, such as Isolation Forest and One-Class SVM, exhibited limited overall performance. This limitation primarily arises because their core assumptions rely on approximate convexity or density separability in the feature space, which are insufficient for characterizing highly nonlinear, cross-modal, and semantically driven anomalous behavior structures in real financial data. Although AutoEncoder improves feature expressiveness through nonlinear mappings, its reconstruction objective mainly focuses on sample-level numerical errors and lacks explicit modeling of semantic-level anomalies and temporal dependencies, resulting in noticeable performance gaps in recall and AUC. BiLSTM introduces temporal modeling capability and achieves further improvements across multiple metrics, indicating that temporal dependencies in behavioral sequences play an important role in anomaly detection. However, its limitation lies in the absence of a unified representation space between sequence modeling and textual semantics, making it difficult to eliminate interference caused by cross-lingual semantic noise.
From the perspective of deep representation models, mBERT and XLM-R leverage large-scale multilingual pretraining to achieve substantially stronger semantic modeling capabilities than traditional methods. Particularly in precision and AUC, these models demonstrate enhanced discriminative power, suggesting that shared semantic spaces effectively reduce expression discrepancies among multilingual financial texts, as shown in Figure 4. Nevertheless, such models remain fundamentally centered on supervised or weakly supervised semantic matching and lack explicit mechanisms for modeling normal behavior distributions. Consequently, their detection capability becomes constrained when anomalous samples are highly concealed in the semantic space. In contrast, LR-SSAD achieves the best results across all evaluation metrics, with especially pronounced advantages in recall, F1-score, and AUC, reflecting stronger capacity in balancing false positives and false negatives and in shaping robust anomaly decision boundaries. From a mechanistic perspective, this advantage arises from the jointly designed self-supervised framework that integrates cross-lingual masked prediction and behavior sequence reconstruction, enabling stable learning of normal distribution structures in both semantic and temporal spaces and linking textual semantic drift with behavioral anomalies through a unified latent space. Moreover, the pseudo-label noise suppression and stable training mechanism further constrains the optimization trajectory, reducing the perturbation of decision boundaries caused by noisy samples in low-resource settings and mathematically ensuring stable convergence under high-dimensional, non-stationary distributions. The combined effect of these constraints allows LR-SSAD to achieve superior generalization performance in complex financial anomaly detection tasks.

4.3. Cross-Lingual Generalization Performance

This experiment was designed to systematically evaluate the generalization capability and stability of different anomaly detection models under cross-lingual transfer conditions, particularly in low-resource language scenarios. Unlike the overall performance comparison, explicit discrepancies were introduced between training and testing languages to characterize the models’ ability to preserve anomalous patterns under substantial semantic distribution shifts. By jointly considering accuracy, F1-score, AUC, AP, and FAR, performance degradation during cross-lingual transfer was assessed from multiple perspectives, including detection completeness, ranking discrimination, and false alarm control.
As shown in Table 3, traditional unsupervised methods such as Isolation Forest and One-Class SVM experience significant performance degradation under cross-lingual conditions. This behavior can be attributed to their reliance on distance-based or boundary-based structures in the feature space, which are directly disrupted by semantic distribution shifts caused by cross-lingual transfer, leading to substantial overlap between anomalous and normal samples in the target language. Although AutoEncoder exhibits certain nonlinear modeling capabilities, its reconstruction objective essentially learns numerical distributions from the training data, which undergo systematic shifts across languages. As a result, some anomalous samples are misclassified as reconstructable, limiting overall performance improvement. In contrast, mBERT and XLM-R benefit from large-scale multilingual pretraining and exhibit a degree of language invariance at the semantic space level, outperforming traditional methods across all metrics. Their relative stability in AUC and AP indicates that shared multilingual semantics effectively reduce transfer loss.
Nevertheless, models exclusively based on pretrained language representations continue to face challenges in low-resource settings, where semantic coverage is insufficient and contextual signals are incomplete. Their semantic alignment capability mainly stems from statistical patterns learned from high-resource languages, and decision boundaries may become unstable when expression forms differ substantially in the target language. By comparison, LR-SSAD consistently achieves superior performance across all metrics, with particularly notable improvements in F1-score and FAR, indicating that the model simultaneously enhances anomaly recall while effectively reducing false alarms induced by cross-lingual transfer. From a theoretical perspective, this performance gain arises from the joint stabilization of both semantic and behavioral spaces. The cross-lingual masked prediction mechanism drives semantic representations from different languages toward consistent distributions under shared parameters, weakening the impact of linguistic discrepancies on the discriminative structure. Meanwhile, the behavior sequence reconstruction module introduces temporal dynamical constraints that provide language-independent stability, allowing anomalies to be detected through behavioral deviations rather than relying solely on textual cues. Additionally, the pseudo-label noise suppression and stable training strategy effectively limits the influence of uncertain samples on model updates during cross-lingual transfer, ensuring that decision boundaries remain smooth and continuous in the target language. The synergy of these mechanisms reduces the sensitivity of the decision function to cross-domain distribution changes, thereby endowing LR-SSAD with strong cross-lingual generalization capability.

4.4. Ablation Study of Different Components

The ablation study was designed to systematically verify the independent contributions and collaborative effects of each core component in LR-SSAD on overall anomaly detection performance. By progressively removing key modules and observing performance variations, the experiment aimed to identify whether the performance gains originate from specific structural designs rather than parameter scale or incidental optimization effects.
As shown in Table 4, a notable degradation was observed in accuracy, recall, and AUC after the cross-lingual masked prediction module was removed, with the most pronounced decline occurring in AUC. This phenomenon indicates that, without cross-lingual semantic constraints, the model becomes more susceptible to language distribution discrepancies, leading to reduced ranking and discrimination capability. When the behavior sequence reconstruction module was removed, an even larger performance drop was observed, with both F1-score and AUC falling below the levels obtained after removing the semantic module. This result suggests that, in financial anomaly detection tasks, reliance on textual semantics alone is insufficient to fully characterize anomalous behaviors, and that stable information provided by temporal behavioral structures plays a critical role in anomaly identification. In contrast, removing the pseudo-label noise suppression module resulted in a relatively smaller performance decrease, yet consistent degradation was still observed across precision, recall, and AUC, indicating that pseudo-label stability exerts a sustained and cumulative influence on final convergence quality.
As shown in Figure 5, the performance impact of each module is closely related to the mathematical constraints introduced into the model. The cross-lingual masked prediction module imposes consistent prediction constraints across different languages within a shared parameter space, effectively reducing the sensitivity of semantic representations to language-specific conditions and ensuring smoother variations of the decision function under cross-lingual inputs. When this module is absent, the distribution gap among samples from different languages in the feature space increases, causing instability in anomaly decision boundaries across languages. The behavior sequence reconstruction module captures the temporal evolution of normal transaction behaviors, introducing a language-independent low-dimensional structural constraint that makes anomalies manifest as deviations from this structure. When this module is removed, the model relies solely on semantic similarity for discrimination, significantly reducing its ability to detect subtle and progressive anomalies. Although the pseudo-label noise suppression module does not directly alter feature representations, it limits the gradient contributions of uncertain samples during optimization, thereby mathematically attenuating noise-induced perturbations in parameter update trajectories. Its absence causes the model to gradually deviate from the optimal discriminative solution in later training stages.

4.5. Discussion

The proposed LR-SSAD exhibits strong practical relevance in real-world financial risk control scenarios, particularly in high-risk and low-annotation environments such as cross-border payment systems, digital asset trading platforms, and multilingual customer service systems. In cross-border payment operations, banks and third-party payment providers are required to process transaction descriptions, account behaviors, and customer communications originating from multiple countries and regions, where a substantial portion of low-resource language transactions lack historical annotations. Traditional rule-based systems or models trained primarily on high-resource languages are prone to missed detections and false alarms in such settings. By constructing a unified representation in a multilingual semantic space and jointly modeling account-level transaction dynamics, LR-SSAD enables the identification of potentially anomalous accounts without reliance on fine-grained labels, thereby providing more precise candidate selections for manual review. In digital asset trading scenarios, anomalous arbitrage or money laundering activities often emerge through multi-step operations rather than isolated transactions, and the behavior sequence reconstruction module is capable of capturing long-term behavioral drift, thus enhancing sensitivity to concealed risks. Meanwhile, cross-lingual user descriptions of transaction purposes often exhibit diverse linguistic expressions, and the cross-lingual masked prediction mechanism effectively mitigates the influence of surface-level language differences, allowing risk assessment to focus on underlying semantic intent. In practical deployment, the pseudo-label noise suppression mechanism further alleviates the impact of distribution shifts introduced by continuously updated data, enabling the model to maintain stable performance when encountering new business types, emerging languages, or evolving transaction patterns.

4.6. Limitation and Future Work

Although LR-SSAD demonstrates strong robustness and cross-lingual generalization capability in low-resource financial anomaly detection tasks, several directions remain worthy of further investigation. On one hand, the integration of cross-lingual semantic modeling and behavior sequence reconstruction results in a relatively complex architecture, which imposes certain computational and latency requirements when deployed in large-scale financial systems, indicating potential for further optimization to balance detection accuracy and inference efficiency. On the other hand, the current study focuses primarily on textual and behavioral sequence data, whereas real-world financial scenarios also involve multi-source information such as graph-structured relationships, geographic attributes, and device fingerprints. Extending LR-SSAD toward a multimodal anomaly detection framework may further enhance comprehensive risk perception and better support the long-term operation of real cross-border risk control systems.

5. Conclusions

With the rapid development of cross-border financial services and digital asset trading, multilingual financial data have grown continuously, while the issues of concealed anomalous behaviors and severe label scarcity in low-resource language scenarios have become increasingly prominent, posing higher requirements for traditional risk identification methods. To address these challenges, a unified self-supervised anomaly detection framework, referred to as LR-SSAD, is proposed for low-resource financial anomaly detection tasks, which targets effective risk identification under weakly supervised or even unlabeled conditions from two key perspectives, namely cross-lingual semantic modeling and behavioral temporal modeling. In terms of methodological design, cross-lingual masked prediction is employed to enforce semantic consistency across different languages, while a behavior sequence reconstruction mechanism is introduced to characterize the temporal evolution structure of normal transactions. Extensive experimental evaluations demonstrate that the proposed method achieves significant performance advantages in overall comparisons, with accuracy reaching 0.932, precision and recall achieving 0.914 and 0.891, respectively, and an F1-score of 0.902, while AUC and AP consistently outperform those of multiple traditional anomaly detection methods and multilingual pretrained models.

Author Contributions

Conceptualization, M.W., N.W., L.M. and M.L.; Data curation, X.L. and S.H.; Formal analysis, Y.L.; Funding acquisition, M.L.; Investigation, Y.L.; Methodology, M.W., N.W. and L.M.; Project administration, M.L.; Resources, X.L. and S.H.; Software, M.W., N.W. and L.M.; Supervision, M.L.; Validation, Y.L.; Visualization, X.L. and S.H.; Writing—original draft, M.W., N.W., L.M., Y.L., X.L., S.H. and M.L., M.W., N.W. and L.M. contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China grant number 61202479.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Federici, F.M. Translating hazards: Multilingual concerns in risk and emergency communication. Translator 2022, 28, 375–398. [Google Scholar] [CrossRef]
  2. Liu, X.Y.; Wang, G.; Yang, H.; Zha, D. Fingpt: Democratizing internet-scale data for financial large language models. arXiv 2023, arXiv:2307.10485. [Google Scholar]
  3. Chen, A.; Wei, Y.; Le, H.; Zhang, Y. Learning by teaching with ChatGPT: The effect of teachable ChatGPT agent on programming education. Br. J. Educ. Technol. 2024, 27, 275–298. [Google Scholar] [CrossRef]
  4. Inaltong, N.U. Anti-Money Laundering Practices in the Scope of Risk Mitigation and Comparison with Anti-Money Laundering Regulations. SSRN Electron. J. 2025. Available online: https://ssrn.com/abstract=5215578 (accessed on 15 September 2025).
  5. Saxena, C. Identifying transaction laundering red flags and strategies for risk mitigation. J. Money Laund. Control 2024, 27, 1063–1077. [Google Scholar] [CrossRef]
  6. Komadina, A.; Martinić, M.; Groš, S.; Mihajlović, Ž. Comparing threshold selection methods for network anomaly detection. IEEE Access 2024, 12, 124943–124973. [Google Scholar] [CrossRef]
  7. Chiu, Y.T.; Bai, Z.H. Translation or multilingual retrieval? evaluating cross-lingual search strategies for traditional chinese financial documents. In Proceedings of the FinTech in AI CUP Special Session, Tokyo, Japan, 10 June 2025. [Google Scholar]
  8. Guo, P.; Ren, Y.; Hu, Y.; Li, Y.; Zhang, J.; Zhang, X.; Huang, H.Y. Teaching large language models to translate on low-resource languages with textbook prompting. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 15685–15697. [Google Scholar]
  9. Ogueji, K.; Zhu, Y.; Lin, J. Small data? No problem! Exploring the viability of pretrained multilingual language models for low-resourced languages. In Proceedings of the 1st Workshop on Multilingual Representation Learning, Punta Cana, Dominican Republic, 11 November 2021; pp. 116–126. [Google Scholar]
  10. Zhang, L.; Zhang, Y.; Ma, X. A new strategy for tuning ReLUs: Self-adaptive linear units (SALUs). In Proceedings of the ICMLCA 2021; 2nd International Conference on Machine Learning and Computer Application, Shenyang, China, 17–19 December 2021; VDE: Berlin, Germany, 2021; pp. 1–8. [Google Scholar]
  11. Barbieri, F.; Anke, L.E.; Camacho-Collados, J. XLM-T: Multilingual language models in Twitter for sentiment analysis and beyond. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 258–266. [Google Scholar]
  12. Mazumder, M.T.R.; Shourov, M.S.H.; Rasul, I.; Akter, S.; Miah, M.K. Anomaly Detection in Financial Transactions Using Convolutional Neural Networks. J. Econ. Financ. Account. Stud. 2025, 7, 195–207. [Google Scholar] [CrossRef]
  13. Aliyu, Y.; Sarlan, A.; Danyaro, K.U.; Rahman, A.S. Comparative Analysis of Transformer Models for Sentiment Analysis in Low-Resource Languages. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 353. [Google Scholar] [CrossRef]
  14. Freire, M.B. Unsupervised Deep Learning to Supervised Interpretability: A Dual-Stage Approach for Financial Anomaly Detection. Master’s Thesis, Pontifícia Universidade Católica do Rio Grande do Sul, Porto Alegre, Brazil, 2024. [Google Scholar]
  15. Niu, S.; Liu, Y.; Wang, J.; Song, H. A decade survey of transfer learning (2010–2020). IEEE Trans. Artif. Intell. 2021, 1, 151–166. [Google Scholar] [CrossRef]
  16. Gui, J.; Chen, T.; Zhang, J.; Cao, Q.; Sun, Z.; Luo, H.; Tao, D. A survey on self-supervised learning: Algorithms, applications, and future trends. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 9052–9071. [Google Scholar] [CrossRef]
  17. Kumar, P.; Rawat, P.; Chauhan, S. Contrastive self-supervised learning: Review, progress, challenges and future research directions. Int. J. Multimed. Inf. Retr. 2022, 11, 461–488. [Google Scholar] [CrossRef]
  18. Tondji, G. Linguistic (In) Security and Persistence in Doctoral Studies: A Mixed-Methods Study of the Impact of Metapragmatic Discourses on the Persistence of Multilingual Doctoral Students. Ph.D. Thesis, The University of Texas Rio Grande Valley, Edinburg, TX, USA, 2025. [Google Scholar]
  19. Li, Q.; Ren, J.; Zhang, Y.; Song, C.; Liao, Y.; Zhang, Y. Privacy-Preserving DNN Training with Prefetched Meta-Keys on Heterogeneous Neural Network Accelerators. In Proceedings of the 2023 60th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 9–13 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
  20. Zhang, K.; Wen, Q.; Zhang, C.; Cai, R.; Jin, M.; Liu, Y.; Zhang, J.Y.; Liang, Y.; Pang, G.; Song, D.; et al. Self-supervised learning for time series analysis: Taxonomy, progress, and prospects. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6775–6794. [Google Scholar] [CrossRef] [PubMed]
  21. Sehwag, V.; Chiang, M.; Mittal, P. Ssd: A unified framework for self-supervised outlier detection. arXiv 2021, arXiv:2103.12051. [Google Scholar] [CrossRef]
  22. Wu, Z.; Yang, X.; Wei, X.; Yuan, P.; Zhang, Y.; Bai, J. A self-supervised anomaly detection algorithm with interpretability. Expert Syst. Appl. 2024, 237, 121539. [Google Scholar] [CrossRef]
  23. Wang, Y.; Qin, C.; Wei, R.; Xu, Y.; Bai, Y.; Fu, Y. Self-supervision meets adversarial perturbation: A novel framework for anomaly detection. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; pp. 4555–4559. [Google Scholar]
  24. Pospieszny, P.; Mormul, W.; Szyndler, K.; Kumar, S. ADALog: Adaptive Unsupervised Anomaly detection in Logs with Self-attention Masked Language Model. arXiv 2025, arXiv:2505.13496. [Google Scholar]
  25. Chi, Z.; Dong, L.; Wei, F.; Yang, N.; Singhal, S.; Wang, W.; Song, X.; Mao, X.L.; Huang, H.Y.; Zhou, M. InfoXLM: An information-theoretic framework for cross-lingual language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 3576–3588. [Google Scholar]
  26. Alhawiti, K.M. Multi-Modal Decentralized Hybrid Learning for Early Parkinson’s Detection Using Voice Biomarkers and Contrastive Speech Embeddings. Sensors 2025, 25, 6959. [Google Scholar] [CrossRef] [PubMed]
  27. Ma, Y. Cross-language text generation using mbert and xlm-r: English-chinese translation task. In Proceedings of the 2024 International Conference on Machine Intelligence and Digital Applications, Ningbo, China, 30–31 May 2024; pp. 602–608. [Google Scholar]
  28. Goyal, N.; Du, J.; Ott, M.; Anantharaman, G.; Conneau, A. Larger-scale transformers for multilingual masked language modeling. arXiv 2021, arXiv:2105.00572. [Google Scholar] [CrossRef]
  29. Al-Laith, A. Exploring the Effectiveness of Multilingual and Generative Large Language Models for Question Answering in Financial Texts. In Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal), Abu Dhabi, United Arab Emirates, 19–20 January 2025; pp. 230–235. [Google Scholar]
  30. Han, Y.; Qi, Z.; Tian, Y. Anomaly classification based on self-supervised learning and its application. J. Radiat. Res. Appl. Sci. 2024, 17, 100918. [Google Scholar] [CrossRef]
  31. Wettig, A.; Gao, T.; Zhong, Z.; Chen, D. Should you mask 15% in masked language modeling? In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, 2–6 May 2023; pp. 2985–3000. [Google Scholar]
  32. Yu, C.W.; Chuang, Y.S.; Lotsos, A.N.; Meier, T.; Haase, C.M. The More Similar, the Better? Associations Between Latent Semantic Similarity and Emotional Experiences Differ Across Conversation Contexts. arXiv 2025, arXiv:2309.12646. [Google Scholar] [CrossRef]
  33. Tan, X.; Qin, T.; Bian, J.; Liu, T.Y.; Bengio, Y. Regeneration learning: A learning paradigm for data generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 22614–22622. [Google Scholar]
  34. Tang, Y.; Khatchadourian, R.; Bagherzadeh, M.; Singh, R.; Stewart, A.; Raja, A. An empirical study of refactorings and technical debt in machine learning systems. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), Madrid, Spain, 25–28 May 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 238–250. [Google Scholar]
  35. Li, S.; Zhang, L.; Wang, Z.; Wu, D.; Wu, L.; Liu, Z.; Xia, J.; Tan, C.; Liu, Y.; Sun, B.; et al. Masked modeling for self-supervised representation learning on vision and beyond. arXiv 2023, arXiv:2401.00897. [Google Scholar]
  36. Huang, K.H.; Ahmad, W.; Peng, N.; Chang, K.W. Improving zero-shot cross-lingual transfer learning via robust training. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 1684–1697. [Google Scholar]
  37. Han, W.; Pang, B.; Wu, Y.N. Robust transfer learning with pretrained language models through adapters. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International joint Conference on Natural Language Processing (Volume 2: Short Papers), Virtual Event, 1–6 August 2021; pp. 854–861. [Google Scholar]
  38. Jafari, A.R.; Heidary, B.; Farahbakhsh, R.; Salehi, M.; Jalili, M. Transfer Learning for Multi-lingual Tasks—A Survey. arXiv 2021, arXiv:2110.02052. [Google Scholar]
  39. Zhang, W.; Deng, L.; Zhang, L.; Wu, D. A survey on negative transfer. IEEE/CAA J. Autom. Sin. 2022, 10, 305–329. [Google Scholar] [CrossRef]
  40. Hong, S.; Lee, S.; Moon, H.; Lim, H.S. MIGRATE: Cross-Lingual Adaptation of Domain-Specific LLMs through Code-Switching and Embedding Transfer. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 9184–9193. [Google Scholar]
  41. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
  42. Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8440–8451. [Google Scholar]
  43. Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef]
  44. Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Washington, DC, USA, 15–19 December 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 413–422. [Google Scholar]
  45. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
  46. Schölkopf, B.; Platt, J.C.; Shawe-Taylor, J.; Smola, A.J.; Williamson, R.C. Estimating the support of a high-dimensional distribution. Neural Comput. 2001, 13, 1443–1471. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Illustration of the cross-lingual masked prediction module.
Figure 1. Illustration of the cross-lingual masked prediction module.
Electronics 15 00212 g001
Figure 2. Illustration of the behavior sequence reconstruction module.
Figure 2. Illustration of the behavior sequence reconstruction module.
Electronics 15 00212 g002
Figure 3. Illustration of the pseudo-label noise suppression and stable training module.
Figure 3. Illustration of the pseudo-label noise suppression and stable training module.
Electronics 15 00212 g003
Figure 4. Overall performance comparison with baseline anomaly detection methods.
Figure 4. Overall performance comparison with baseline anomaly detection methods.
Electronics 15 00212 g004
Figure 5. Ablation study of different components in LR-SSAD.
Figure 5. Ablation study of different components in LR-SSAD.
Electronics 15 00212 g005
Table 1. Overview of the collected multilingual financial anomaly detection dataset.
Table 1. Overview of the collected multilingual financial anomaly detection dataset.
Data TypeSourceLanguage CoverageSample Size
Financial textsCross-border payment logs,
customer records,
online platforms
High- and
low-resource
languages
1,200,000
Transaction sequencesPayment systems and
digital asset platforms
Language-
independent
420,000 accounts
Labeled anomaliesRule-based systems and
expert review
Multilingual38,500
Normal samplesLong-term
stable accounts
Multilingual381,500
Table 2. Overall performance comparison with baseline anomaly detection methods. The results are reported as “Mean ± Standard Deviation” across 5 independent runs. The best results are highlighted in bold. * indicates statistical significance (p < 0.05) compared to the second-best-performing model (XLM-R) based on paired t-tests.
Table 2. Overall performance comparison with baseline anomaly detection methods. The results are reported as “Mean ± Standard Deviation” across 5 independent runs. The best results are highlighted in bold. * indicates statistical significance (p < 0.05) compared to the second-best-performing model (XLM-R) based on paired t-tests.
MethodAccuracyPrecisionRecallF1-ScoreAUCAPInf. Time (ms)Memory (GB)
Isolation Forest [44]0.842 ± 0.0050.811 ± 0.0060.768 ± 0.0070.789 ± 0.0060.861 ± 0.0050.803 ± 0.0060.8 ± 0.10.1 ± 0.0
One-Class SVM [46]0.856 ± 0.0040.832 ± 0.0050.781 ± 0.0060.806 ± 0.0050.874 ± 0.0040.819 ± 0.0051.2 ± 0.20.2 ± 0.0
AutoEncoder [45]0.871 ± 0.0040.845 ± 0.0040.802 ± 0.0050.823 ± 0.0040.891 ± 0.0030.836 ± 0.0042.5 ± 0.31.1 ± 0.1
BiLSTM [43]0.884 ± 0.0030.862 ± 0.0030.817 ± 0.0040.839 ± 0.0030.903 ± 0.0030.852 ± 0.0033.1 ± 0.21.4 ± 0.1
mBERT [41]0.893 ± 0.0030.871 ± 0.0020.836 ± 0.0030.853 ± 0.0020.912 ± 0.0020.864 ± 0.00211.5 ± 0.84.2 ± 0.2
XLM-R [42]0.901 ± 0.0020.879 ± 0.0020.842 ± 0.0030.860 ± 0.0020.919 ± 0.0020.871 ± 0.00212.8 ± 0.94.8 ± 0.3
LR-SSAD (Ours)0.932 ± 0.002 *0.914 ± 0.002 *0.891 ± 0.003 *0.902 ± 0.002 *0.948 ± 0.001 *0.923 ± 0.002 *4.8 ± 0.42.6 ± 0.1
Table 3. Cross-lingual generalization performance on low-resource languages. The results are reported as “Mean ± Standard Deviation” across 5 independent runs. The best results are highlighted in bold. * indicates statistical significance (p < 0.05) compared to the second-best-performing model (XLM-R) based on paired t-tests.
Table 3. Cross-lingual generalization performance on low-resource languages. The results are reported as “Mean ± Standard Deviation” across 5 independent runs. The best results are highlighted in bold. * indicates statistical significance (p < 0.05) compared to the second-best-performing model (XLM-R) based on paired t-tests.
MethodAccuracyF1-ScoreAUCAPFAR ↓Inf. Time (ms)Memory (GB)
Isolation Forest0.801 ± 0.0070.742 ± 0.0080.823 ± 0.0060.761 ± 0.0070.148 ± 0.0050.8 ± 0.10.1 ± 0.0
One-Class SVM0.814 ± 0.0060.758 ± 0.0070.836 ± 0.0050.774 ± 0.0060.136 ± 0.0041.2 ± 0.20.2 ± 0.0
AutoEncoder0.829 ± 0.0050.779 ± 0.0060.851 ± 0.0040.791 ± 0.0050.129 ± 0.0032.5 ± 0.31.1 ± 0.1
mBERT0.846 ± 0.0040.801 ± 0.0040.867 ± 0.0030.814 ± 0.0040.116 ± 0.00311.5 ± 0.84.2 ± 0.2
XLM-R0.858 ± 0.0030.816 ± 0.0030.879 ± 0.0020.827 ± 0.0030.109 ± 0.00212.8 ± 0.94.8 ± 0.3
LR-SSAD (Ours)0.901 ± 0.002 *0.862 ± 0.002 *0.914 ± 0.001 *0.889 ± 0.002 *0.081 ± 0.001 *4.8 ± 0.42.6 ± 0.1
Table 4. Ablation study of different components in LR-SSAD. The results are reported as “Mean ± Standard Deviation” across 5 independent runs. The best results are highlighted in bold. * indicates statistical significance (p < 0.05) compared to the best-performing variant (LR-SSAD w/o PLNS) based on paired t-tests.
Table 4. Ablation study of different components in LR-SSAD. The results are reported as “Mean ± Standard Deviation” across 5 independent runs. The best results are highlighted in bold. * indicates statistical significance (p < 0.05) compared to the best-performing variant (LR-SSAD w/o PLNS) based on paired t-tests.
Model VariantAccuracyPrecisionRecallF1-ScoreAUCAP
LR-SSAD w/o CLMP0.894 ± 0.0040.871 ± 0.0040.843 ± 0.0050.857 ± 0.0040.906 ± 0.0030.875 ± 0.004
LR-SSAD w/o BSR0.887 ± 0.0050.865 ± 0.0050.832 ± 0.0060.848 ± 0.0050.899 ± 0.0040.862 ± 0.005
LR-SSAD w/o PLNS0.902 ± 0.0030.881 ± 0.0030.854 ± 0.0040.867 ± 0.0030.912 ± 0.0020.885 ± 0.003
LR-SSAD (full model)0.932 ± 0.002 *0.914 ± 0.002 *0.891 ± 0.003 *0.902 ± 0.002 *0.948 ± 0.001 *0.923 ± 0.002 *
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, M.; Wang, N.; Mei, L.; Li, Y.; Liu, X.; Hua, S.; Li, M. Data-Driven Cross-Lingual Anomaly Detection via Self-Supervised Representation Learning. Electronics 2026, 15, 212. https://doi.org/10.3390/electronics15010212

AMA Style

Wang M, Wang N, Mei L, Li Y, Liu X, Hua S, Li M. Data-Driven Cross-Lingual Anomaly Detection via Self-Supervised Representation Learning. Electronics. 2026; 15(1):212. https://doi.org/10.3390/electronics15010212

Chicago/Turabian Style

Wang, Mingfei, Nuo Wang, Lingdong Mei, Yunfei Li, Xinyang Liu, Surui Hua, and Manzhou Li. 2026. "Data-Driven Cross-Lingual Anomaly Detection via Self-Supervised Representation Learning" Electronics 15, no. 1: 212. https://doi.org/10.3390/electronics15010212

APA Style

Wang, M., Wang, N., Mei, L., Li, Y., Liu, X., Hua, S., & Li, M. (2026). Data-Driven Cross-Lingual Anomaly Detection via Self-Supervised Representation Learning. Electronics, 15(1), 212. https://doi.org/10.3390/electronics15010212

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop