Next Article in Journal
Spatial Plane Positioning of AR-HUD Graphics: Implications for Driver Inattentional Blindness in Navigation and Collision Warning Scenarios
Previous Article in Journal
Comparing Explainable AI Models: SHAP, LIME, and Their Role in Electric Field Strength Prediction over Urban Areas
Previous Article in Special Issue
A Robust Corroded Metal Fitting Detection Approach for UAV Intelligent Inspection with Knowledge-Distilled Lightweight YOLO Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MSF-Net: A Data-Driven Multimodal Transformer for Intelligent Behavior Recognition and Financial Risk Reasoning in Virtual Live-Streaming

1
School of Management, Guangzhou College of Technology and Business, Guangzhou 510850, China
2
National School of Development, Peking University, Beijing 100871, China
3
China Agricultural University, Beijing 100083, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(23), 4769; https://doi.org/10.3390/electronics14234769
Submission received: 13 November 2025 / Revised: 27 November 2025 / Accepted: 2 December 2025 / Published: 4 December 2025
(This article belongs to the Special Issue Advances in Data-Driven Artificial Intelligence)

Abstract

With the rapid advancement of virtual human technology and live-streaming e-commerce, virtual anchors have increasingly become key interactive entities in the digital economy. However, emerging issues such as fake reviews, abnormal tipping, and illegal transactions pose significant threats to platform financial security and user privacy. To address these challenges, a multimodal emotion–finance fusion security recognition framework (MSF-Net) is proposed, which integrates visual, audio, textual, and financial transaction signals to achieve cross-modal feature alignment and multi-signal risk modeling. The framework consists of three core modules: the multimodal alignment transformer (MAT), the fake review detection (FRD) module, and the multi-signal fusion decision module (MSFDM), enabling deep integration of semantic consistency modeling and emotion–behavior collaborative recognition. Experimental results demonstrate that MSF-Net achieves superior performance in virtual live-streaming financial security detection, reaching a precision of 0.932, a recall of 0.924, an F1-score of 0.928, an accuracy of 0.931, and an area under curve (AUC) of 0.956, while maintaining a real-time inference speed of 60.7 FPS, indicating outstanding precision and responsiveness. The ablation experiments further verify the necessity of each module, as the removal of any component leads to an F1-score decrease exceeding 4%, confirming the structural validity of the model’s hierarchical fusion design. In addition, a lightweight version of MSF-Net was developed through parameter distillation and quantization pruning techniques, achieving real-time deployment on mobile devices with an average latency of only 19.4 milliseconds while maintaining an F1-score of 0.923 and an AUC of 0.947. The results indicate that MSF-Net exhibits both innovation and practicality in multimodal deep fusion and security risk recognition, offering a scalable solution for intelligent risk control in data-driven artificial intelligence applications across financial and virtual interaction domains.

1. Introduction

In recent years, with the rapid advancement of digital technologies, virtual humans have gradually emerged as important carriers in the fields of entertainment, marketing, and education, particularly demonstrating broad application prospects in live e-commerce and digital tourism promotion [1]. Virtual humans can attract user attention through anthropomorphic appearances and vivid interactive behaviors, enhancing user experience and purchase conversion rates while exhibiting higher controllability and scalability compared with traditional human anchors [2]. However, along with the commercial value they bring, virtual human live-streaming platforms also expose a series of potential financial security risks [3]. Examples include fake reviews, malicious tipping, abnormal payment transactions, and virtual gift money laundering, which not only disrupt market order but may also lead to user property losses, privacy leakage, and reputational damage to platforms [4]. In virtual live environments, users usually find it difficult to assess information authenticity, and traditional approaches relying on manual review or rule-based detection fail to achieve full-scale and efficient risk prevention [5]. Therefore, achieving multimodal financial security recognition in virtual live-streaming scenarios has become a critical issue that needs to be addressed by both academia and industry.
In early studies of financial security and virtual live-streaming, conventional methods mainly relied on rule-based and statistical feature detection [6]. For instance, in fake review detection, lexicon-based or frequency-statistical approaches have been employed to analyze textual patterns, identify repeated phrases, abnormal rating distributions, or keyword frequencies to determine authenticity [7]. For payment and tipping behaviors, typical techniques include transaction amount thresholding, user activity statistics, and behavioral path analysis [8]. Although these approaches exhibit a certain degree of interpretability and intuitiveness, they face significant limitations in practice. First, traditional rule-based methods suffer from poor adaptability to complex and dynamic environments, making them inadequate for handling the highly diverse interaction data in virtual live streams [9]. Second, single-modality features fail to capture cross-modal dependencies; for example, abnormal tipping behaviors are often correlated with the anchor’s facial expressions, tone of voice, and interactive comments—information typically ignored by traditional approaches [10]. Moreover, such methods often depend on expert knowledge and manual feature engineering, which cannot efficiently process large-scale, heterogeneous, multi-source data and are vulnerable to newly emerging fraud patterns [11]. With the increasing complexity of virtual human content and user interactions, traditional single-modality or rule-based financial risk detection methods can no longer meet the requirements of real-time response, accuracy, and scalability.
To address these limitations, deep learning technologies have been introduced into the field of financial security recognition for virtual live-streaming [12]. Deep neural networks, particularly convolutional neural networks (CNNs), recurrent neural networks (RNNs), and Transformer architectures, have achieved remarkable success in text sentiment analysis, speech emotion recognition, and visual behavior understanding [13]. Xu et al. [14] proposed MEMF, which integrates multiple entities such as anchors, products, and audiences with multimodal information including visual, audio, textual, and transactional signals, significantly improving live e-commerce sales forecasting accuracy. Song et al. [15] introduced a dynamic multimodal data fusion framework for digital financial risk perception and intelligent prevention, achieving millisecond-level real-time risk identification and adaptive control verification. Karjee et al. [16] presented a lightweight multimodal fusion computing model for Emotional Streaming on edge platforms, realizing real-time emotion stream recognition using visual, auditory, and textual data with lightweight deployment and low-latency inference. Sun et al. [17] proposed a two-level multimodal fusion strategy for sentiment analysis in public security, achieving superior accuracy through data- and decision-level fusion across text, audio, and video modalities. Nevertheless, these methods still demonstrate deficiencies in financial security scenarios [18]. On one hand, they primarily focus on user experience or emotional understanding while lacking the ability to model financial signals such as payment behaviors or transaction flows [19]. On the other hand, existing multimodal approaches often struggle to achieve fine-grained alignment of cross-modal features and temporal sequences, thereby failing to fully capture dynamic correlations among behavioral, emotional, and transactional cues in virtual live streams [20]. Consequently, effectively integrating multimodal sentiment information with financial behavioral data to achieve real-time and accurate risk identification remains a critical research challenge.
To address the aforementioned issues, a multimodal sentiment-finance fusion network (MSF-Net) tailored for virtual human live-streaming scenarios is proposed in this study. The method integrates multi-source data including visual, audio, textual, and transactional streams, and employs a multimodal Transformer for cross-modal feature alignment combined with transactional behavior modeling to jointly detect fake reviews, abnormal tipping, and illicit payments in live commerce. Specifically, MSF-Net consists of three key modules: (1) the multimodal alignment transformer (MAT), which performs semantic alignment of visual, auditory, and textual features through cross-modal attention and captures temporal dependencies between behaviors and transaction signals via a temporal transformer; (2) the fake review detection module (FRD), which utilizes a dual-channel semantic–affective network combining contrastive learning and transaction frequency features to identify fake reviews and anomalous interactions; and (3) the multi-signal fusion decision module (MSFDM), which integrates emotional, behavioral, and financial signals through multi-channel attention fusion to output a comprehensive financial security index for real-time alerting and risk management. Compared with traditional single-modality or rule-based methods, MSF-Net enables deep cross-modal fusion and temporal correlation modeling, thereby achieving high-precision and low-latency security recognition in complex virtual live environments. The main contributions of this work are summarized as follows:
  • We construct, to the best of our knowledge, the first standardized multimodal financial security dataset tailored for virtual live-streaming, integrating synchronized visual, audio, textual, and transactional streams, thereby providing a reproducible benchmark for subsequent research on virtual human risk analysis.
  • We design a cross-signal feature alignment network based on a multimodal Transformer, which performs unified temporal modeling over virtual human behaviors and transaction sequences, enabling the capture of subtle and long-range behavior–finance anomalies that cannot be represented by single-modality or transaction-only models.
  • We propose a fake review detection and anti-fraud fusion mechanism that jointly leverages semantic content, affective dynamics, and transactional frequency patterns, allowing MSF-Net to simultaneously address fake reviews, abnormal tipping, and illicit payment behaviors within a single coherent framework.
  • We implement MSF-Net and conduct comprehensive evaluations in controlled virtual live-streaming environments, as well as on an external multimodal benchmark, demonstrating promising performance in terms of accuracy, real-time capability, and interpretability compared with representative financial risk detection baselines.
Through the above contributions, this study not only expands the theoretical boundary of multimodal sentiment recognition and financial security detection but also provides practical technological solutions for risk prevention and control in virtual human live-streaming platforms.

2. Related Work

2.1. Virtual Human Live Streaming and Multimodal Analysis

Virtual human and digital human technologies have rapidly advanced across entertainment, marketing, and education applications, and have become increasingly integrated into live e-commerce platforms [21]. Existing systems typically rely on two major technical foundations—vision-driven and voice-driven interaction frameworks [22]. Vision-driven approaches focus on facial expression analysis, lip synchronization, and pose generation by mapping visual frames into high-dimensional feature representations [23], where deep models extract facial key points, action units (AUs), and body-motion dynamics [24]. Similarly, voice-driven models capture prosodic cues such as tone, rhythm, and intensity to enhance the perceived expressiveness and realism of virtual anchors.
Although multimodal information has been widely used to improve virtual human interactivity and emotional expressiveness, prior work primarily concentrates on content generation, animation quality, and user engagement. Only limited research explores how multimodal cues—such as emotional fluctuations, linguistic inconsistencies, or behavior–transaction correlations—can be leveraged for risk perception or security assessment in live-streaming environments [25]. Existing virtual human live-streaming systems generally lack mechanisms to monitor suspicious behaviors such as abnormal tipping, coordinated fake reviews, bot-driven interactions, or laundering-like virtual gifting patterns. More critically, prior work does not provide a unified framework that jointly models visual, audio, textual, and financial signals, nor does it establish cross-modal alignment mechanisms necessary for detecting subtle fraudulent behaviors. These limitations highlight the need for a multimodal, behavior-aware financial security system capable of integrating emotional expression, interaction semantics, and transactional dynamics—an area that remains largely underexplored and that MSF-Net aims to address.

2.2. Sentiment Recognition and Cross-Modal Feature Fusion

Multimodal sentiment recognition has been extensively explored in virtual human interaction and social media analysis [26,27], where the goal is to jointly model visual, audio, and textual cues to understand user or anchor emotions [28,29]. Transformer-based fusion architectures such as MMBT, CLIP, and multimodal Transformers introduce cross-modal self-attention to capture correlations among facial expressions, vocal prosody, and semantic content, achieving notable advances in emotion classification accuracy and robustness [30]. Despite these successes, existing methods are primarily developed for entertainment, affective computing, or advertising contexts, and they exhibit several limitations when transferred to financial security analysis in virtual live-streaming environments.
First, current sentiment models do not incorporate financial transactional signals—such as payment bursts, tipping irregularities, or anomalous money-flow patterns—which are essential for identifying high-risk behaviors [31]. Their feature space focuses on affective and semantic cues rather than behavior–transaction interactions, restricting their ability to detect fraudulent or coordinated manipulations. Second, most existing fusion strategies operate within fixed or short temporal windows, making it difficult to capture long-range dependencies across emotional trajectories, user–anchor interactions, and evolving transaction streams [32]. This constraint limits their capacity to model cross-modal causality, such as emotional shifts that precede abnormal gifting or inconsistent sentiment patterns associated with synthetic reviews. These limitations highlight the gap between general multimodal sentiment recognition and the demands of real-time financial risk detection. Addressing this gap requires a unified architecture capable of aligning multimodal affective cues with transactional dynamics and performing temporal reasoning over long-term multimodal patterns—a challenge that the proposed MSF-Net is specifically designed to tackle.

2.3. Financial Risk Detection and Deep Security Learning

In financial security research, deep learning has been broadly applied to fraud detection, anti-money laundering, and abnormal transaction identification [33]. Early approaches relied on manually crafted transactional features—such as amount, frequency, and inter-transaction intervals—and applied classical models such as logistic regression or decision trees. Although effective for simple patterns, these methods struggle to capture nonlinear dependencies and multi-factor behavioral structures [34]. More recent models based on LSTM or Transformer architectures have improved anomaly detection by learning temporal dynamics in transaction sequences [35]. However, these approaches remain limited when transferred to virtual live-streaming environments, where transactional behaviors are closely intertwined with emotional expressions, vocal cues, interactive semantics, and user–anchor behaviors.
Existing financial anomaly detection models focus almost exclusively on transaction logs, lacking the ability to incorporate visual or audio emotional signals that often precede or accompany fraudulent patterns [36]. As a result, they fail to detect multi-source inconsistencies, such as emotionally incongruent tipping spikes, coordinated bot-driven commenting, or laundering-like virtual gifting behaviors. Furthermore, current models typically treat financial signals as isolated sequences and do not provide mechanisms for cross-modal alignment or joint reasoning over behavior–emotion–transaction interactions—capabilities essential for real-time security monitoring in virtual human live-streaming platforms. These limitations underscore the need for a multimodal financial risk detection framework that explicitly integrates affective cues, semantic interactions, and transactional dynamics. The proposed MSF-Net is designed to address this gap through unified multimodal alignment, cross-signal fusion, and behavior–finance co-reasoning, providing a more comprehensive and context-aware solution for financial security detection in virtual live-streaming ecosystems.

3. Materials and Methods

3.1. Data Collection

The experimental data were obtained from a fully simulated virtual human live commerce platform constructed for this study, which operated from September 2024 to March 2025 to systematically collect multimodal data of virtual anchors in live-streaming scenarios, as summarized in Table 1. The platform was designed to mimic the core interaction patterns of commercial live-streaming while preserving full controllability over data generation and annotation. Four modalities were recorded in a time-synchronized manner—visual, audio, text, and financial transactions—with the goal of capturing the dynamic coupling between virtual human behaviors, user interactions, and payment activities under diverse virtual live-streaming scenarios.
For the visual modality, the system generated virtual human video streams based on the Unity live-streaming engine and a three-dimensional facial driving model (FaceRig SDK). Multiple virtual anchor avatars with distinct appearances and personalities (e.g., energetic host, calm explainer, promotional salesperson, gaming companion, and virtual idol) were created. During each session, the rendering engine randomized scene backgrounds (indoor room, studio, store, gaming environment), lighting conditions, and camera parameters (viewpoint, focal length, distance) to enhance scene diversity. Facial expressions, gestures, and postures were controlled by scripted behavior graphs combined with stochastic perturbations to simulate natural variability. Video frames were rendered at 120 frames per second and processed by OpenPose and FaceMesh for key-point extraction and expression intensity estimation, resulting in approximately 1.2 TB of visual data at a resolution of 1920 × 1080 . Each live broadcast lasted on average 40 min, yielding 200 virtual live events in total.
For the audio modality, speech streams were obtained from a mixture of pre-written scripts and semi-improvised speech recorded by voice actors to emulate typical live-stream commentary (e.g., product introduction, promotional slogans, interactive responses). The audio was recorded in stereo at a sampling rate of 44.1 kHz. To approximate realistic acoustic conditions, background noise (such as low-volume music or ambient crowd sounds) from public ambience corpora was mixed into the clean speech, with noise-to-signal ratios uniformly sampled in the range [ 0.05 , 0.25 ] . Voice Activity Detection (VAD) was applied to segment active speech segments, and the Librosa library was used to extract Mel-spectrograms, energy envelopes, and pitch contours. For emotion labeling, 3–5 s audiovisual clips were independently annotated by three raters into three categories (positive, neutral, negative) based on tone, rhythm, facial expression, and semantic context. Inter-annotator agreement measured by Cohen’s κ reached 0.82; disagreements were resolved by majority voting. The total duration of the audio data was approximately 480 h, with a total file size of about 360 GB.
The text modality primarily consisted of real-time comments and barrage messages captured via the simulation API, including comment content and user ID, synchronized with timestamps to align with the audio and visual streams. To enrich linguistic diversity, comment sequences were generated using a hybrid scheme: approximately 30% of comments were produced by sentiment-conditioned templates, 50% by stochastic generators (e.g., trigram/Markov models trained on publicly available live chat corpora), and 20% were manually authored to introduce rare phenomena such as slang, code-switching, emojis, and intentional typos. Comments were further categorized as genuine or fake based on their generation mechanism and semantic–affective consistency with the surrounding audiovisual context. During annotation, each comment was aligned to a ± 2 s time window of the corresponding video and audio, and labeled by two annotators as “genuine” or “manipulated” considering repetitiveness, lexical anomalies, and mismatch with the ongoing emotional tone; only comments with consistent labels were retained. In total, about 680,000 text entries were collected, with an average length of 18 words. After deduplication and tokenization, the texts were converted into BERT embeddings for subsequent semantic and sentiment modeling.
The financial modality data were generated by a built-in payment and gifting simulation engine that emulates typical live commerce transaction flows, including virtual gift purchases, currency recharges, red-packet transfers, and withdrawals. User accounts were instantiated according to three behavioral archetypes: regular users, high-activity users, and bots/malicious users. The temporal evolution of transactions followed a mixture of Poisson processes to capture both background interaction and bursty activities around events such as flash sales, product demonstrations, and emotionally salient moments. Each transaction record contained a transaction ID, user account, transaction amount, timestamp, payment method, and a pseudo device fingerprint. Abnormal and fraudulent patterns were injected according to a rule-based protocol inspired by documented online fraud behaviors, including (1) sudden large-value tipping that is inconsistent with local emotional and interaction context, (2) high-frequency repeating micro-transactions within short time windows, (3) cyclic transfer chains resembling laundering-like flows, and (4) synchronized gifting behaviors driven by bot accounts. All automatically flagged anomalous events were reviewed by two annotators to ensure contextual plausibility before being used as ground truth for training and evaluation.
All data streams were transmitted through secure encrypted channels within the simulation environment and stored on local encrypted servers to ensure data integrity. A unified global clock was used to synchronize visual, audio, text, and financial streams, and a sliding-window mechanism was employed to align multimodal features at the frame or segment level. The resulting dataset offers a diverse and richly annotated multimodal mapping between emotional behaviors, user interactions, and financial events in a controlled virtual setting, providing a high-quality foundation for training and analyzing the proposed MSF-Net, while its synthetic nature and differences from real-world operational data are explicitly acknowledged and further discussed in the limitation and generalization analysis.

3.2. Data Preprocessing and Augmentation

In the processing of multimodal data from virtual human live streaming, data preprocessing and augmentation are critical steps to ensure training stability and improve model generalization capability. Since each modality exhibits heterogeneous characteristics, the preprocessing methods must be designed to address specific signal properties and noise distributions.

3.2.1. Image Augmentation

For the visual modality, the input data consisted of frame sequences extracted from virtual human live-streaming videos, denoted as I = { I 1 , I 2 , . . . , I T } . The original videos may include variations in illumination, background interference, and resolution differences, all of which can affect the quality of visual feature extraction. To mitigate these effects, random cropping, brightness perturbation, and background mixup augmentations were applied. Random cropping was performed by selecting subregions I t c from the original frame I t , defined as:
I t c = I t [ x : x + w , y : y + h ] ,
where ( x , y ) represents the starting coordinates of the cropping window, and ( w , h ) denote its width and height. Random sampling of these parameters ensures diversity across training samples. Brightness perturbation was applied to simulate varying illumination conditions and can be formulated as follows:
I t b = α I t + β , α U ( 0.8 , 1.2 ) , β U ( 0.1 , 0.1 ) ,
where α is a brightness scaling factor and β is a bias term, both sampled from a uniform distribution U , enhancing the model’s robustness to illumination fluctuations. Additionally, background mixup augmentation was introduced by blending the current frame with a randomly sampled background B:
I t m = λ I t + ( 1 λ ) B , λ Beta ( γ , γ ) ,
which simulates background interference in complex visual scenes and improves generalization of visual representations. Here, λ follows a Beta distribution, and γ is a hyperparameter controlling the degree of blending.

3.2.2. Audio Augmentation

For the audio modality, the live-streaming audio signal A ( t ) is often affected by environmental noise, microphone characteristics, and echo distortion. Therefore, denoising and augmentation are required. Spectral subtraction was employed for noise reduction, where the clean signal spectrum S ^ ( f , t ) is expressed as follows:
S ^ ( f , t ) = max ( S ( f , t ) α N ( f , t ) , 0 ) ,
with S ( f , t ) and N ( f , t ) representing the signal and noise spectra, respectively, and α denoting the noise suppression coefficient controlling the subtraction intensity. To enhance temporal robustness, a time masking strategy was applied, in which continuous audio frames [ t 1 , t 2 ] were randomly masked:
A ˜ ( t ) = 0 , t 1 t t 2 A ( t ) , otherwise .
Furthermore, pitch shifting was introduced to augment frequency variations, defined as follows:
A ^ ( t ) = A t · 2 Δ p 12 ,
where Δ p denotes the semitone shift, randomly sampled during training to improve the model’s adaptability to pitch variations in speech signals.

3.2.3. Series-Data Augmentation

For the textual modality, processing primarily involved semantic representation and alignment. The live comments and bullet texts T = { t 1 , t 2 , , t N } are often short and noisy, sometimes containing spelling errors. A BERT-based pre-trained model was adopted to embed each text into a fixed-dimensional semantic vector space:
F t = BERT ( t i ) , F t R d t ,
where d t denotes the feature dimension of the textual representation. To achieve cross-modal alignment, textual features were synchronized with visual F v , auditory F a , and transactional F f features along the temporal axis. For a given time step t, the alignment operation was defined as follows:
F ˜ t = 1 | N ( t ) | t N ( t ) F t ,
where N ( t ) represents the neighboring feature set around time t. Weighted averaging or interpolation was applied to handle cases of missing modality data. The financial transaction modality included user tipping, payment logs, and virtual gift streams F f = { f 1 , f 2 , , f T } , whose features are often unevenly distributed and contain outliers. To stabilize model training, standardization was applied as follows:
f ^ t = f t μ σ ,
where μ and σ represent the mean and standard deviation of the feature distribution. To capture temporal dynamics, transaction data were segmented into sliding windows of size w, forming sequential subseries { F f ( t w + 1 : t ) } for subsequent modeling by sequence-based networks such as LSTM or Transformer.
For multimodal synchronization and missing data handling, differences in sampling rates among visual, audio, text, and transaction modalities may lead to temporal misalignment or missing data. Linear interpolation and neighbor-weighted estimation were applied to recover missing modality features F m ( t ) based on neighboring valid samples V ( t ) :
F ^ m ( t ) = t V ( t ) w t F m ( t ) t V ( t ) w t ,
where w t represents the weight assigned according to temporal distance or modality importance. This ensures that the interpolated features remain smooth and temporally consistent with other modalities. Through the above preprocessing and augmentation procedures, visual, audio, textual, and transactional data were temporally aligned, enhancing the model’s robustness to noise, illumination variations, pitch shifts, and transactional anomalies, thereby providing a reliable foundation for MSF-Net in multimodal feature fusion and financial risk recognition.

3.3. Proposed Method

3.3.1. Overall

The overall methodology is organized around the architectural flow of MSF-Net, in which visual, audio, textual, and financial streams are progressively aligned, integrated, and jointly inferred. These four modalities are selected because they capture complementary dimensions of behavior in virtual live-streaming environments: visual signals reflect facial expressions and posture associated with emotional intensity; audio signals encode prosodic variations such as tone and rhythm that indicate affective shifts; textual signals reveal semantic intent and interaction patterns, including potential inconsistencies or synthetic comments; and financial transactions directly reflect tipping bursts, abnormal transfers, and other high-risk monetary behaviors. After initial normalization, the three perceptual modalities (visual, audio, text) are encoded by modality-specific embedding layers and processed by MAT, where cross-modal attention enforces semantic correspondence and temporal synchronization, yielding a unified multimodal interaction representation that captures evolving emotional and behavioral dynamics. This aligned representation is then propagated along two paths: one feeds directly into the fusion layers of MSFDM for high-level cross-signal reasoning, and the other interacts with textual features in FRD to form a dual semantic–affective pipeline that estimates comment authenticity and outputs credibility embeddings and comment-level abnormality indicators. In the final stage, the temporally aligned representation from MAT, the credibility and consistency features from FRD, and financial signals (e.g., transaction intensity, fluctuation patterns, account profiling) are jointly processed by MSFDM, where multi-channel attention and gated aggregation adaptively recalibrate cross-signal contributions and generate a unified risk embedding, followed by a dual-branch prediction head that outputs behavior category labels and continuous risk scores. Key hyperparameters and tensor shape transitions across these modules are summarized in Table 2. Training is performed with a joint objective that balances classification loss, risk regression loss, and contrastive learning for fake-review supervision, maintaining equilibrium between semantic consistency modeling, affective coupling, and financial anomaly detection. During inference, MSF-Net operates under a sliding temporal window to update representations in real time, while thresholding and calibration layers—combined with platform-specific rules—trigger alerts and log suspicious cases, enabling robust, interpretable, and low-latency financial risk recognition in virtual live-streaming environments.

3.3.2. Multimodal Alignment Transformer

The multimodal alignment Transformer (MAT) is the core encoder of MSF-Net, responsible for unifying visual, audio, and textual representations and correcting their inherent temporal and semantic misalignment. After initial feature extraction, all modalities are projected into a shared d-dimensional space, producing P v , P a , and P t for the visual, audio, and textual streams, respectively.
As illustrated in Figure 1, the visual stream uses a lightweight CNN with patch embeddings to model frame-level cues; the audio stream processes Mel-spectrograms through temporal convolution and a 1D Transformer to capture prosodic patterns; and the textual stream employs a BERT encoder to obtain contextual semantic features. Since these modalities differ in sampling rate, temporal resolution, and expressive structure, MAT adopts a hierarchical encoder composed of two key sublayers per block: (1) an intra-modal self-attention layer for learning temporal dependencies within each modality, and (2) a cross-modal attention layer that explicitly aligns heterogeneous features.
The cross-modal attention module uses visual queries to attend to audio and text features, enabling direct interaction across modalities:
Z v = Softmax Q v K a T d V a + Softmax Q v K t T d V t ,
where ( Q v , K a , V a ) and ( Q v , K t , V t ) are the linear projections of visual–audio and visual–text pairs, respectively. Similar operations are applied symmetrically for the audio and text streams. This multi-head formulation allows MAT to learn fine-grained correspondences among facial expressions, prosodic cues, and linguistic semantics in separate subspaces.
To enhance long-range temporal coherence, each block incorporates a Mamba module, which provides gated state-space dynamics and strengthens the model’s ability to capture gradual emotional shifts and temporally extended behavioral patterns. Stacking four such encoder layers with hidden dimension d = 512 yields the aligned multimodal representation H m a t R T × d .
Conceptually, MAT approximates the optimization of a semantic discrepancy objective:
L a l i g n = i f v ( V i ) f a ( A i ) 2 + f a ( A i ) f t ( T i ) 2 ,
where attention acts as an adaptive mechanism to enforce correspondence in both temporal structure and semantic meaning. Compared with early- or late-fusion schemes, MAT offers three advantages: (1) adaptive cross-modal weighting suppresses modality-specific noise (e.g., facial occlusions or unstable audio); (2) the hierarchical design balances local-window modeling and global context reasoning, enabling it to capture abrupt emotional transitions as well as long-term behavioral trends; and (3) the integration of Mamba dynamics strengthens MAT’s sensitivity to subtle emotional anomalies and cross-modal inconsistencies that often precede suspicious or fraudulent activities in live-streaming.

3.3.3. Fake Review Detection Module

The fake review detection module (FRD) aims to identify bot-generated comments and emotionally inconsistent interactions by jointly modeling textual semantics and dynamic affective cues from the live-stream context. As illustrated in Figure 2, FRD contains two complementary branches: a semantic channel that focuses on the linguistic content of comments, and an affective channel that encodes moment-by-moment emotional patterns from the audio–visual streams. The goal of FRD is to assess the bidirectional consistency between what is written and how the virtual anchor behaves emotionally, which is a strong indicator of review authenticity in live-streaming environments.
The semantic channel receives BERT-encoded comment embeddings T s R L × 768 , which are compressed through two lightweight convolutional layers (kernel sizes 3 and 1, with output dimensions 256 and 128). These layers extract local contextual patterns and reduce redundancy in the original high-dimensional representation. A Transformer encoder is then applied to capture long-range semantic dependencies, producing the compact semantic matrix S t R L × 128 .
The affective channel operates on the fused audio–visual embeddings E a R T × 512 provided by MAT. Three temporal convolutions with progressively smaller kernels (5, 3, 3) extract emotion trajectories at multiple time scales. A gated attention mechanism then identifies emotionally salient moments and aggregates them into a global affective descriptor E g R L × 64 , which is aligned in temporal resolution with the textual stream.
To integrate both branches, FRD employs a cross-dimensional interaction module that encourages semantic–affective correspondence. Denoting S t = [ s 1 , , s L ] and E g = [ e 1 , , e L ] , the fused representation is computed as follows:
H = tanh ( W 1 S t + W 2 E g + b ) ,
where W 1 and W 2 project semantic and affective cues into a shared latent space. An affective-guided selection mask,
M = σ ( W m H T ) ,
is then applied to emphasize comment tokens that align with the emotional state of the stream. The final authenticity prediction is
y = σ ( W f ( H M ) ) ,
where the Hadamard product highlights emotionally consistent regions.
To further improve discrimination between genuine and synthetic reviews, FRD introduces a semantic–affective contrastive loss:
L f r d = 1 N i log exp ( sim ( H i + , H i ) / τ ) j exp ( sim ( H i , H j ) / τ ) ,
which pulls context-consistent comment features closer and pushes fake ones apart. When combined with a semantic–affective consistency constraint
C ( H ) = S t E g 2 ,
minimizing L f r d + C ( H ) encourages the two representations to become approximately collinear, enabling FRD to detect subtle inconsistencies between textual sentiment and audiovisual affect—an important signal of manipulated or bot-generated comments.
Integrated with MAT, which provides temporally aligned, emotion-rich features, FRD achieves three advantages: (1) fine-grained authenticity assessment through dual semantic–affective pathways; (2) explicit modeling of cross-signal consistency via learned alignment; and (3) improved robustness against noisy or adversarially generated comments through contrastive supervision. Together, these properties make FRD an effective component for identifying fake reviews within the MSF-Net framework.

3.3.4. Multi-Signal Fusion Decision Module

The multi-signal fusion decision module (MSFDM) operates as the high-level reasoning component of MSF-Net. Its objective is to combine three complementary information streams—(1) temporally aligned multimodal embeddings from MAT, (2) semantic–affective credibility features from FRD, and (3) real-time financial transaction signals—and to jointly infer potential financial risks and behavioral anomalies in live-streaming environments. As depicted in Figure 3, MSFDM adopts a multi-channel Transformer-based fusion architecture that focuses on capturing correlations between emotional dynamics, textual interactions, and transaction behaviors.
The three input streams— H m a t R T × 512 , H f r d R L × 128 , and normalized financial vectors F R M × 64 —are projected into a shared embedding space of d = 512 before entering the fusion layers. Each fusion layer contains the following: (1) a self-attention block that refines intra-signal patterns and (2) a cross-signal attention block that models interactions across emotional, semantic, and financial modalities.
To capture behavior–transaction correlations, financial embeddings are used as queries, while the fused audio–visual–textual features from MAT and FRD serve as keys and values:
Z = Softmax Q F ( K V A ) T d V V A ,
where Q F = W Q F and K V A = W K [ H m a t ; H f r d ] , V V A = W V [ H m a t ; H f r d ] . This formulation enables MSFDM to attend to emotionally abnormal segments, inconsistent comments, or abrupt reward spikes that may indicate high-risk behavior. Residual connections and LayerNorm are used to stabilize optimization across the four stacked fusion layers.
The decoder adopts a dual-branch design. The first branch performs regression to produce a continuous risk score r s , while the second branch conducts multi-class classification of behavioral states (normal, suspicious, anomalous). Here, normal denotes regular interaction and payment patterns consistent with the ongoing emotional and semantic context; suspicious corresponds to borderline or weakly irregular behaviors (e.g., mild transaction bursts or partially inconsistent comments) that may require further review; and anomalous captures clearly abnormal or high-risk patterns such as abrupt high-value tipping, laundering-like transfer chains, or strongly inconsistent bot-like comments. Both branches contain three fully connected layers with GELU activations. The joint learning objective is
L m s f d m = λ 1 | r s r ^ s | 2 λ 2 i y i log p c , i ,
where y i is the behavior label and p c , i is the predicted probability. This formulation simultaneously enforces accurate risk estimation and stable behavioral categorization. From a functional perspective, MSFDM leverages the complementary strengths of its inputs. MAT provides temporally synchronized semantic–affective information, while FRD supplies interaction credibility cues that highlight inconsistent or synthetic comment behavior. By treating financial flows as queries in the fusion process, the module learns to focus on cross-modal patterns that often precede fraudulent or anomalous events, forming a closed-loop interaction among emotion, behavior, and transaction signals.
Compared with traditional financial risk models that rely solely on transaction statistics, MSFDM dynamically adapts cross-modal weights based on real-time emotional and textual context, improving resilience to noisy interactions and abrupt behavioral fluctuations. Its bounded attention output and hierarchical normalization strategy further ensure stable gradients and low-latency inference, making MSFDM a reliable and efficient component for financial security detection in virtual live-streaming environments.

4. Results and Discussion

4.1. Experimental Configuration

4.1.1. Hardware and Software Platform

In terms of the experimental hardware and software environment, the hardware platform consisted primarily of high-performance computing servers and graphics processing units (GPU). The servers were equipped with Intel Xeon Gold series CPUs operating at a base frequency of 2.6 GHz and 256 GB of memory to ensure efficient processing of large-scale multimodal data. For graphical computation, multiple NVIDIA A100 GPUs with 40 GB of memory per card were employed, supporting mixed-precision training and parallel computation for large-scale Transformer models.
Regarding the software platform, the experiments were conducted under the Linux operating system (Ubuntu 20.04 ), with the deep learning framework PyTorch 2.1 utilized to enable GPU acceleration and distributed training. CUDA 12.1 and cuDNN 8.9 were applied to accelerate matrix operations during training and inference. The AdamW optimizer was adopted, combined with mixed-precision training to enhance computational efficiency. Matplotlib 3.8.0 and Seaborn 0.13.0 were employed for visualization, while TensorBoard was used to monitor loss curves, accuracy trends, and attention weights during training, ensuring the stability and interpretability of the model optimization process.
For hyperparameter settings and dataset partitioning, the entire dataset was divided into training, validation, and testing subsets with proportions of 70 % , 20 % , and 10 % , respectively, ensuring balanced distributions of modality data and annotated events across all subsets. The learning rate α was set to 2 × 10 5 , batch size B was 16, and the dropout probability p was 0.1 to prevent overfitting. To further assess the robustness and generalization capability of the model, a 5-fold cross-validation strategy was applied. The training set was split into 5 subsets, where 4 subsets were used for training and the remaining one for validation in each iteration. The process was repeated five times, and the average performance metrics were reported to evaluate the overall stability of MSF-Net under different data partitions.

4.1.2. Baseline Models and Evaluation Metrics

In the comparative experiments, several representative baseline models were selected to comprehensively verify the performance advantages of the proposed MSF-Net in the multimodal sentiment–finance fusion scenario. The single-modality models included Text-BERT [37], Audio-ResNet [38,39], and Visual-ViT [40]. The multimodal models comprised MMBT [41], LXMERT [42], and CLIP [43]. In addition, GraphSAGE [44] and FinBERT [45] were selected as financial detection baselines for comparison.
A variety of evaluation metrics were employed to comprehensively assess model performance, including classification metrics such as precision, recall, F1-score, accuracy, and area under curve (AUC), as well as real-time performance metrics such as frames per second (FPS) and inference latency (ms). These metrics jointly reflect the effectiveness of the model in virtual human live-streaming financial security recognition tasks. The corresponding mathematical definitions are provided as follows:
Precision = T P T P + F P , Recall = T P T P + F N , F 1 = 2 · Precision · Recall Precision + Recall ,
Accuracy = T P + T N T P + T N + F P + F N , AUC = 0 1 TPR ( F P R ) d F P R ,
FPS = N frames T process , Latency = T process N frames × 1000 ms ,
where T P , T N , F P , and F N denote true positives, true negatives, false positives, and false negatives, respectively. TPR and FPR represent the true positive rate and false positive rate; N frames indicates the number of processed video frames, and T process represents the total processing time (in seconds). Each metric emphasizes different aspects of model performance: precision evaluates the correctness of positive predictions, recall measures the model’s ability to identify true positive cases, and F1-score provides a harmonic balance between precision and recall. accuracy reflects the overall prediction correctness, AUC quantifies the classification capability across different thresholds, while FPS and latency measure the model’s real-time responsiveness. In the context of virtual human live-streaming financial security, the combined use of these metrics enables a comprehensive assessment of the model’s accuracy, stability, and deployment feasibility.

4.2. Performance Comparison

This experiment aims to evaluate the overall performance of MSF-Net on the proposed virtual live-streaming financial security detection task and to assess its advantages over traditional single-modal models, multimodal fusion models, and financial risk detection models. In addition, to examine the generalization capability of MSF-Net beyond the simulated environment, we further validate its multimodal alignment and semantic–affective consistency modeling on the external CMU-MOSEI benchmark. The comprehensive performance is evaluated using precision, recall, F1-score, accuracy, AUC, and FPS, which collectively measure detection precision, recall ability, classification stability, and real-time efficiency. As shown in Table 3 and Figure 4, single-modal models such as Text-BERT, Audio-ResNet, and Visual-ViT exhibit limited performance, with F1-scores below 0.83, indicating that relying solely on text, audio, or visual information fails to capture the complex emotional and transactional dependencies inherent in live-streaming scenarios. Multimodal fusion models including MMBT, LXMERT, and CLIP achieve noticeable improvements in F1-score (around 0.87) due to enhanced semantic understanding and inter-modal feature interactions, demonstrating that multimodal attention mechanisms can establish more stable correlations among heterogeneous signals. Financial detection models such as GraphSAGE and FinBERT perform relatively well in identifying anomalous payments and fraudulent transactions, achieving AUC values of 0.887 and 0.919, respectively; however, they remain limited in modeling emotional signals and behavioral interactions, and thus fail to fully represent cross-modal anomaly patterns within virtual live-stream ecosystems. In contrast, MSF-Net achieves the best overall performance, with an F1-score of 0.928, an AUC of 0.956, and a real-time speed of 60.7 FPS, reflecting a superior balance between precision and computational efficiency.
From a theoretical perspective, the performance discrepancies among models stem from their mathematical differences in feature modeling and attention structures. Single-modal models are constrained by limited input space and concentrated feature distributions, lacking the capacity to establish cross-modal correlations in high-dimensional subspaces. Multimodal Transformer-based models introduce cross-attention mechanisms to project heterogeneous features into shared latent spaces but often suffer from uneven feature weighting and unstable temporal synchronization. The graph neural network model GraphSAGE captures local structural dependencies through neighborhood aggregation but demonstrates limited generalization in unstructured semantic contexts. FinBERT, despite its strong language comprehension ability based on pretraining on financial corpora, lacks joint modeling of emotional and behavioral dynamics. The superiority of MSF-Net lies in its multi-signal coupled Transformer architecture, which establishes cross-modal feature alignment mappings, enabling visual, audio, textual, and financial streams to achieve temporal consistency and semantic convergence within a unified vector space. By leveraging multi-channel attention distributions, MSF-Net performs gradient coordination across tasks within the optimization objective, leading to more stable convergence toward local optima in the risk prediction space. This structurally optimized combination of multimodal alignment and multi-signal fusion represents the fundamental reason for MSF-Net’s superior performance both theoretically and practically.
As shown in Table 4, MSF-Net achieves the best overall performance among all compared models on CMU-MOSEI, with the highest precision, recall, F1-score, accuracy, and AUC, while maintaining a competitive inference speed in terms of FPS. Compared with multimodal Transformer-based baselines such as MMBT, LXMERT, and CLIP, MSF-Net yields an F1-score improvement of approximately 2–5 percentage points, indicating that its multimodal alignment and semantic–affective consistency modeling mechanisms transfer effectively to real-world audiovisual–textual data. Single-modal models (Text-BERT, Audio-ResNet, Visual-ViT) and structurally constrained models such as GraphSAGE and FinBERT lag behind across most metrics, reflecting their inability to fully leverage cross-modal dependencies in user-generated videos. In contrast, MSF-Net benefits from MAT-driven temporal alignment, which explicitly synchronizes visual expressions, vocal prosody, and textual semantics into a unified latent space, and from FRD, which enhances sensitivity to consistency between spoken content, tone, and facial cues. The consistently strong performance on CMU-MOSEI thus provides external evidence that the core representational mechanisms of MSF-Net possess substantial cross-domain generalization ability beyond the simulated financial security environment, while the full end-to-end system remains tailored to the proposed multimodal finance–emotion risk detection task.

4.3. Ablation Study

This experiment aims to evaluate the individual contributions of the core components of MSF-Net and analyze the influence of multimodal collaboration through systematic ablation comparisons. By sequentially removing the multimodal alignment transformer (MAT), fake review detection module (FRD), and multi-signal fusion decision module (MSFDM), as well as restricting input modality combinations, variations in precision, recall, F1-score, AUC, and latency were examined to assess the role of each component in the model’s overall performance.
As shown in Table 5 and Figure 5, removing any individual module of MSF-Net leads to a noticeable decline in performance across all evaluation metrics. The degradation, however, is not uniform across tasks. The exclusion of the MSFDM module results in the most substantial performance drop, reducing the F1-score from 0.928 to 0.874. This is primarily because MSFDM is responsible for integrating heterogeneous signals from emotion cues, textual interactions, and transactional dynamics; without this high-level fusion layer, the model becomes less capable of capturing cross-source inconsistencies that are essential for both fraud detection and emotion–behavior alignment. Fraudulent behaviors such as burst gifting or suspicious transfer chains rely heavily on the joint reasoning of multimodal cues, and the absence of MSFDM disproportionately weakens the model’s ability to identify these patterns.
Removing the MAT module also yields a significant reduction in performance, with the AUC dropping to 0.935. MAT plays a crucial role in aligning temporal features across visual, audio, and textual streams. Without MAT, emotion-related tasks—such as detecting abrupt affective changes preceding abnormal transactions or identifying semantic–prosodic mismatches—are particularly affected. The cross-modal drift introduced by removing MAT leads to inconsistencies between facial expressions, vocal cues, and comment semantics, which downstream modules cannot fully compensate for. Consequently, the detection of emotion-driven fraudulent events (e.g., sudden high-value tipping unrelated to affective state) becomes less reliable.
The removal of the FRD module results in a decrease of the F1-score to 0.887 and particularly harms text-centered tasks, such as identifying fake reviews, bot-generated interactions, and sentiment–context contradictions. Since FRD explicitly models the coherence between textual semantics and multimodal emotional cues, its removal makes the system less sensitive to mismatches between user comments and the ongoing affective state of the stream. As a result, scenarios involving coordinated fake reviews or emotionally inconsistent bot comments are more difficult for the model to detect.
Restricting the model to partial modalities (Text+Financial or Visual+Audio) further degrades performance. Text–financial inputs alone struggle with emotion-driven fraud patterns, while visual–audio inputs lack transactional context and thus cannot capture abnormal payment behaviors. These observations confirm that single-modality or dual-modality systems fail to represent the complex multimodal dependencies inherent in virtual live-streaming fraud scenarios.
In contrast, the full MSF-Net configuration achieves the highest precision and recall with the lowest latency (21.3 ms), benefiting from the complementary strengths of all components. This demonstrates that the coordinated interaction among MAT, FRD, and MSFDM is critical not only for enhancing overall accuracy but also for maintaining robustness across both emotion recognition and financial anomaly detection tasks, ultimately achieving an optimal balance between multimodal expressiveness and real-time performance.

4.4. Interpretability Analysis and Visualization

To validate the interpretability and decision transparency of MSF-Net in virtual live-streaming financial security detection, analyses were conducted from three perspectives: visual heatmaps, risk evolution curves (Figure 6), and attention weight visualization (Figure 7).
By superimposing anomaly detection heatmaps on live-stream video frames, the model’s focus areas during potential risk identification can be intuitively observed. The results indicate that when abnormal tipping, fake reviews, or sudden emotional shifts occur, the model’s attention concentrates notably on the virtual anchor’s facial expressions, voice energy fluctuations, and high-frequency comment segments. The heatmap color transition from blue to red reflects the intensity evolution of abnormal behaviors, demonstrating that MSF-Net effectively captures nonlinear correlations across multimodal signals and leverages them for behavioral risk prediction. Furthermore, the risk evolution curve illustrates the continuous evaluation process of the model’s risk index over the live-stream timeline. The curve remains stable during normal interaction periods but rises sharply when suspicious transactions or semantic anomalies occur, indicating the system’s temporal sensitivity and real-time responsiveness. Finally, the visualization of multi-head attention weights within the Transformer reveals diversified focus patterns among different heads: certain heads emphasize the emotional resonance between visual and audio modalities, while others highlight the alignment between textual semantics and transactional features. This hierarchical attention distribution reveals the cross-modal information decomposition and reweighting mechanisms embedded within MSF-Net’s architecture, enabling not only highly accurate risk identification but also traceable causal explanations that provide reliable analytical support for platform security supervision and risk intervention.

4.5. Practical Deployment and Case Analysis

During the deployment stage, MSF-Net was integrated into the security monitoring subsystem of the virtual live-streaming platform, and a lightweight version was developed for real-time execution on mobile and edge devices to satisfy low-power and high-throughput requirements. The overall deployment pipeline consists of four stages: model distillation, structural pruning, parameter quantization, and heterogeneous acceleration. Initially, knowledge distillation was applied to map the multimodal feature space of the original MSF-Net into a lightweight student model M S F l i t e , where the feature distributions between the teacher model output H T and the student model output H S are constrained by minimizing the distribution discrepancy term
L d i s t i l l = | H T H S | 2 2
to ensure consistency in multimodal alignment and risk prediction capability. Subsequently, sparsity-based pruning was performed on the multi-head attention layers of the Transformer to remove low-importance channels from the weight distribution, with the L1-norm regularization term defined as follows:
L s p a r s e = λ i = 1 N | W i | 1 ,
where λ denotes the sparsity regularization coefficient controlling the pruning ratio. This procedure significantly reduced the storage complexity of the attention mapping matrices, compressing the total number of parameters from 180 million to 62 million. The convolutional and linear weights were then quantized using symmetric 8-bit fixed-point mapping, converting floating-point weights w f into integer weights w q , with quantization error defined as:
E q = 1 n i = 1 n ( w f i s · w q i ) 2 ,
where the optimal scaling factor s is determined by minimizing E q to balance computational precision and latency performance. The final model achieved an average inference time of 19.4 ms per frame on Qualcomm Snapdragon 8 Gen2 and iOS A17 Pro chipsets, corresponding to a 2.7× acceleration compared with the original model. To validate its practical applicability, back-testing was conducted on typical fraudulent events within the platform, where the lightweight MSF-Net maintained an AUC of 0.947 and an F1-score of 0.923 under real-time streaming conditions. The multimodal embedding sharing and quantized distillation mechanisms ensure theoretical approximate linear separability of the parameter matrices, maintaining convexity of the decision boundary in the reduced-dimensional space. Consequently, computational complexity and energy consumption were significantly reduced with minimal performance degradation, achieving efficient and scalable deployment for mobile-oriented financial security monitoring.

4.6. Differences from Real Data and Implications for Generalization

Although the multimodal dataset constructed in this study offers substantial advantages in terms of scale, temporal alignment, and controllability, it inevitably differs from real commercial live-streaming data in several important aspects, which may affect the generalization capability of MSF-Net. First, user behaviors in real-world platforms exhibit significantly higher randomness and noise. Emotional expressions, interactive patterns, and comment content vary across individuals, languages, and cultural backgrounds, resulting in complex distributions that are difficult to fully reproduce in a simulation environment. While our simulated comments and emotional dynamics were designed based on prior literature and empirical characteristics of live-streaming activities, they may not capture the full variability and spontaneity of user-generated content observed in commercial platforms. Furthermore, financial fraud behaviors in real systems often possess adversarial, evolving, and highly irregular properties. Abnormal tipping, virtual-gift laundering, and bot-generated interactions follow long-tailed and rapidly shifting distributions shaped by attackers’ strategies. In contrast, the anomalous transaction patterns in our simulated dataset are generated through controlled, rule-based mechanisms, which—while useful for creating reproducible benchmarks—cannot fully emulate the adaptive and strategic nature of real fraudulent activities. Additionally, real platforms are influenced by external factors such as recommendation algorithms, marketing campaigns, and platform policies, all of which introduce additional variability absent in the simulated environment.
These differences imply that the strong performance demonstrated by MSF-Net on the simulated dataset primarily reflects its effectiveness under controlled conditions. When applied to real-world scenarios, the model require domain adaptation, threshold recalibration, or fine-tuning with a small amount of real labeled data to mitigate distributional shifts.

4.7. Limitations and Future Work

Although MSF-Net demonstrates superior experimental performance and real-time capability in virtual live-streaming financial security recognition, several limitations remain that merit further investigation. First, the dataset utilized in this study was primarily constructed from a simulated virtual live-streaming platform. While representative in terms of data scale and scene complexity, it still diverges from real-world commercial environments, particularly regarding cross-platform behavioral discrepancies and linguistic diversity, which may affect generalization. Second, although MSF-Net’s multimodal fusion architecture theoretically captures higher-order dependencies between emotional and financial signals, extreme data imbalance (e.g., rare fraud instances) may still lead to attention bias and false alarms during feature aggregation. Future research will focus on cross-platform and cross-lingual multimodal transfer learning to enhance the robustness and scalability of MSF-Net across diverse virtual ecosystems. In addition, the integration of adversarial training based on generative defense mechanisms will be explored to strengthen model resilience through synthetic fraudulent transaction generation. Furthermore, combining federated learning with secure multi-party computation will be investigated to enable distributed multimodal security detection while ensuring data privacy, thereby improving the system’s intelligence and trustworthiness in real-world financial applications.

5. Conclusions

This study investigates financial security recognition in virtual live-streaming scenarios and proposes MSF-Net, a multimodal emotion–finance fusion framework for detecting fake reviews, abnormal tipping, and illegal transactions. MSF-Net integrates MAT for cross-modal alignment, FRD for comment authenticity assessment, and MSFDM for multi-signal risk inference, enabling joint modeling of visual, audio, textual, and financial streams. Experiments on a large-scale simulated dataset show that MSF-Net consistently outperforms representative baselines in terms of precision, recall, F1-score, accuracy, and AUC, while maintaining real-time performance. Ablation results further confirm that each component makes a non-trivial contribution to both emotion-related and fraud-related tasks, validating the effectiveness of the hierarchical fusion design. Lightweight deployment experiments indicate that a compressed version of MSF-Net can still achieve competitive performance with low latency, suggesting potential for use on mobile and embedded devices. At the same time, several challenges remain for future work. Because the dataset is simulated, transferring MSF-Net to real commercial platforms will require careful handling of domain shift, more diverse user behaviors, and privacy-constrained financial logs. In addition, deploying MSF-Net on edge devices involves balancing model complexity, latency, and limited computational resources, which motivates further research on hardware-aware model compression, streaming inference, and robust adaptation to evolving fraud patterns. Addressing these issues will be essential to bridge the gap between controlled experimental results and large-scale practical deployment.

Author Contributions

Conceptualization, Y.S., L.Z., R.Z. and M.L.; Data curation, M.D. and X.H.; Formal analysis, H.Z. and R.C.; Funding acquisition, M.L.; Investigation, H.Z. and R.C.; Methodology, Y.S., L.Z. and R.Z.; Project administration, M.L.; Resources, M.D. and X.H.; Software, Y.S., L.Z. and R.Z.; Supervision, M.L.; Validation, H.Z. and X.H.; Visualization, M.D. and R.C.; Writing—original draft, Y.S., L.Z., R.Z., H.Z., M.D., X.H., R.C. and M.L.; Y.S., L.Z., and R.Z. contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China grant number 61202479.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Olugbade, T.; He, L.; Maiolino, P.; Heylen, D.; Bianchi-Berthouze, N. Touch technology in affective human–, robot–, and virtual–human interactions: A survey. Proc. IEEE 2023, 111, 1333–1354. [Google Scholar] [CrossRef]
  2. Zhang, Y.; Wang, X.; Zhao, X. Supervising or assisting? The influence of virtual anchor driven by AI–human collaboration on customer engagement in live streaming e-commerce. Electron. Commer. Res. 2025, 25, 3047–3070. [Google Scholar] [CrossRef]
  3. Cui, L.; Liu, J. Virtual human: A comprehensive survey on academic and applications. IEEE Access 2023, 11, 123830–123845. [Google Scholar] [CrossRef]
  4. Wronka, C. “Cyber-laundering”: The change of money laundering in the digital age. J. Money Laund. Control 2022, 25, 330–344. [Google Scholar] [CrossRef]
  5. Wang, K.; Wu, J.; Sun, Y.; Chen, J.; Pu, Y.; Qi, Y. Trust in human and virtual live streamers: The role of integrity and social presence. Int. J. Hum.- Interact. 2024, 40, 8274–8294. [Google Scholar] [CrossRef]
  6. Aditya Wahyu Febriyantoro, A. Applying EU Anti-Money Laundering Regulations and De-Risking Policy on Social Live Streaming Service Platforms and Live-Streamers. Master’s Thesis, Utrecht University, Utrecht, The Netherlands, 2024. [Google Scholar]
  7. Huang, Z.; Dong, W.; Kim, J.N.; Hollenczer, J.; Lee, H.; Jensen, M.; Bessarabova, E.; Talbert, N.; Zhu, R.; Li, Y. Efficient Duplicate Comment Detection for Rulemaking Agencies With Unsupervised Deep Learning: A Cost-Effective and High-Accuracy Approach. IEEE Trans. Comput. Soc. Syst. 2025, 12, 2675–2684. [Google Scholar] [CrossRef]
  8. Rao, G.; Wang, Z.; Liang, J. Reinforcement learning for pattern recognition in cross-border financial transaction anomalies: A behavioral economics approach to AML. Appl. Comput. Eng. 2025, 142, 116–127. [Google Scholar] [CrossRef]
  9. Gao, G.; Liu, H.; Zhao, K. Live streaming recommendations based on dynamic representation learning. Decis. Support Syst. 2023, 169, 113957. [Google Scholar] [CrossRef]
  10. Wu, D.; Zhou, L. Cross-modal stream transmission: Architecture, strategy, and technology. IEEE Wirel. Commun. 2023, 31, 134–140. [Google Scholar] [CrossRef]
  11. Fan, J.; Peng, L.; Chen, T.; Cong, G. Regulation strategy for behavioral integrity of live streamers: From the perspective of the platform based on evolutionary game in China. Electron. Mark. 2024, 34, 21. [Google Scholar] [CrossRef]
  12. Li, Q.; Ren, J.; Zhang, Y.; Song, C.; Liao, Y.; Zhang, Y. Privacy-Preserving DNN Training with Prefetched Meta-Keys on Heterogeneous Neural Network Accelerators. In Proceedings of the 2023 60th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 9–13 July 2023; pp. 1–6. [Google Scholar]
  13. Chen, W. Simulation application of virtual robots and artificial intelligence based on deep learning in enterprise financial systems. Entertain. Comput. 2025, 52, 100772. [Google Scholar]
  14. Xu, G.; Ren, M.; Wang, Z.; Li, G. MEMF: Multi-entity multimodal fusion framework for sales prediction in live streaming commerce. Decis. Support Syst. 2024, 184, 114277. [Google Scholar] [CrossRef]
  15. Song, Y.; Ni, Y.; Xu, R.; Sun, C.; Sun, B. Research on Dynamic Perception and Intelligent Prevention and Control of Digital Financial Security Risks by Integrating Multimodal Data. Available online: https://ssrn.com/abstract=5450724 (accessed on 3 November 2024).
  16. Karjee, J.; Kakwani, K.R.; Anand, K.; Naik, P. Lightweight Multimodal Fusion Computing Model for Emotional Streaming in Edge Platform. In Proceedings of the 2024 IEEE 21st Consumer Communications & Networking Conference (CCNC), Las Vegas, NV, USA, 6–9 January 2024; pp. 419–424. [Google Scholar]
  17. Sun, J.; Yin, H.; Tian, Y.; Wu, J.; Shen, L.; Chen, L. Two-Level Multimodal Fusion for Sentiment Analysis in Public Security. Secur. Commun. Netw. 2021, 2021, 6662337. [Google Scholar] [CrossRef]
  18. Li, Q.; Zhang, Y. Confidential Federated Learning for Heterogeneous Platforms against Client-Side Privacy Leakages. In Proceedings of the ACM Turing Award Celebration Conference, Changsha, China, 5–7 July 2024; pp. 239–241. [Google Scholar]
  19. Ali, M.; Naeem, F.; Kaddoum, G.; Hossain, E. Metaverse communications, networking, security, and applications: Research issues, state-of-the-art, and future directions. IEEE Commun. Surv. Tutor. 2023, 26, 1238–1278. [Google Scholar] [CrossRef]
  20. Jiang, Y.; Ning, K.; Pan, Z.; Shen, X.; Ni, J.; Yu, W.; Schneider, A.; Chen, H.; Nevmyvaka, Y.; Song, D. Multi-modal time series analysis: A tutorial and survey. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, Toronto, ON, Canada, 3–7 August 2025; pp. 6043–6053. [Google Scholar]
  21. Wan, A.; Jiang, M. Can virtual influencers replace human influencers in live-streaming e-commerce? An exploratory study from practitioners’ and consumers’ perspectives. J. Curr. Issues Res. Advert. 2023, 44, 332–372. [Google Scholar] [CrossRef]
  22. Seaborn, K.; Miyake, N.P.; Pennefather, P.; Otake-Matsuura, M. Voice in human–agent interaction: A survey. ACM Comput. Surv. CSUR 2021, 54, 1–43. [Google Scholar] [CrossRef]
  23. Zhang, Y.; He, S.; Wa, S.; Zong, Z.; Lin, J.; Fan, D.; Fu, J.; Lv, C. Symmetry GAN detection network: An automatic one-stage high-accuracy detection network for various types of lesions on CT images. Symmetry 2022, 14, 234. [Google Scholar] [CrossRef]
  24. Nguyen, E.; Goel, K.; Gu, A.; Downs, G.; Shah, P.; Dao, T.; Baccus, S.; Ré, C. S4nd: Modeling images and videos as multidimensional signals with state spaces. Adv. Neural Inf. Process. Syst. 2022, 35, 2846–2861. [Google Scholar]
  25. Lu, Z.; Kazi, R.H.; Wei, L.y.; Dontcheva, M.; Karahalios, K. StreamSketch: Exploring multi-modal interactions in creative live streams. Proc. ACM Hum.-Comput. Interact. 2021, 5, 1–26. [Google Scholar] [CrossRef]
  26. Zhang, L.; Zhang, Y.; Ma, X. A new strategy for tuning ReLUs: Self-adaptive linear units (SALUs). In Proceedings of the ICMLCA 2021, 2nd International Conference on Machine Learning and Computer Application, Shenyang, China, 17–19 December 2021; pp. 1–8. [Google Scholar]
  27. Hu, F.; He, K.; Wang, C.; Zheng, Q.; Zhou, B.; Li, G.; Sun, Y. STRFLNet: Spatio-Temporal Representation Fusion Learning Network for EEG-Based Emotion Recognition. IEEE Trans. Affect. Comput. 2025, 16, 1–16. [Google Scholar] [CrossRef]
  28. Chandrasekaran, G.; Nguyen, T.N.; Hemanth D, J. Multimodal sentimental analysis for social media applications: A comprehensive review. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2021, 11, e1415. [Google Scholar] [CrossRef]
  29. Yu, P.; He, X.; Li, H.; Dou, H.; Tan, Y.; Wu, H.; Chen, B. FMLAN: A novel framework for cross-subject and cross-session EEG emotion recognition. Biomed. Signal Process. Control 2025, 100, 106912. [Google Scholar] [CrossRef]
  30. Xu, P.; Zhu, X.; Clifton, D.A. Multimodal learning with transformers: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12113–12132. [Google Scholar] [CrossRef] [PubMed]
  31. Omoifo, D. Exploring Customer Behavior Through Emotion Detection. Master’s Thesis, Kennesaw State University, Kennesaw, GA, USA, 2024. [Google Scholar]
  32. Li, S.; Tang, H. Multimodal alignment and fusion: A survey. arXiv 2024, arXiv:2411.17040. [Google Scholar] [CrossRef]
  33. Udayakumar, R.; Joshi, A.; Boomiga, S.; Sugumar, R. Deep fraud Net: A deep learning approach for cyber security and financial fraud detection and classification. J. Internet Serv. Inf. Secur. 2023, 13, 138–157. [Google Scholar]
  34. Teng, T.; Ma, L. Deep learning-based risk management of financial market in smart grid. Comput. Electr. Eng. 2022, 99, 107844. [Google Scholar] [CrossRef]
  35. Luo, Q.; Zeng, W.; Chen, M.; Peng, G.; Yuan, X.; Yin, Q. Self-attention and transformers: Driving the evolution of large language models. In Proceedings of the 2023 IEEE 6th International Conference on Electronic Information and Communication Technology (ICEICT), Qingdao, China, 21–24 July 2023; pp. 401–405. [Google Scholar]
  36. Hu, L.; Zhang, B.; Zhang, P.; Qi, J.; Cao, J.; Gao, D.; Zhao, H.; Feng, X.; Wang, Q.; Zhuo, L.; et al. A Virtual character generation and animation system for e-commerce live streaming. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 1202–1211. [Google Scholar]
  37. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (Long and Short Papers), pp. 4171–4186. [Google Scholar]
  38. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  39. Hershey, S.; Chaudhuri, S.; Ellis, D.P.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B.; et al. CNN architectures for large-scale audio classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 131–135. [Google Scholar]
  40. Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  41. Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Volume 2019, p. 6558. [Google Scholar]
  42. Tan, H.; Bansal, M. Lxmert: Learning cross-modality encoder representations from transformers. arXiv 2019, arXiv:1908.07490. [Google Scholar] [CrossRef]
  43. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PmLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
  44. Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 2017, 30, 1025–1035. [Google Scholar]
  45. Araci, D. Finbert: Financial sentiment analysis with pre-trained language models. arXiv 2019, arXiv:1908.10063. [Google Scholar]
Figure 1. The architecture of the multimodal alignment transformer (MAT) module.
Figure 1. The architecture of the multimodal alignment transformer (MAT) module.
Electronics 14 04769 g001
Figure 2. The architecture of the fake review detection module (FRD).
Figure 2. The architecture of the fake review detection module (FRD).
Electronics 14 04769 g002
Figure 3. The architecture of the multi-signal fusion decision module (MSFDM).
Figure 3. The architecture of the multi-signal fusion decision module (MSFDM).
Electronics 14 04769 g003
Figure 4. Performance comparison of different models.
Figure 4. Performance comparison of different models.
Electronics 14 04769 g004
Figure 5. Bar chart of MSF-Net ablation results under different module and modality configurations.
Figure 5. Bar chart of MSF-Net ablation results under different module and modality configurations.
Electronics 14 04769 g005
Figure 6. Risk evolution curves over time, where dashed lines denote key live events and shaded regions indicate detected high-risk intervals.
Figure 6. Risk evolution curves over time, where dashed lines denote key live events and shaded regions indicate detected high-risk intervals.
Electronics 14 04769 g006
Figure 7. Multi-head attention weight visualization across modalities and time steps; each sub-plot corresponds to one attention head focusing on different modality–time patterns.
Figure 7. Multi-head attention weight visualization across modalities and time steps; each sub-plot corresponds to one attention head focusing on different modality–time patterns.
Electronics 14 04769 g007
Table 1. Statistics of the multimodal virtual human live-streaming financial security dataset.
Table 1. Statistics of the multimodal virtual human live-streaming financial security dataset.
DataData SourceScaleMain Features
VisualVirtual human video stream (Unity + FaceRig)1.2 TB (200 sessions)Facial expressions, motion, and posture sequences
AudioAnchor voice recordings (44.1 kHz stereo)360 GB (480 h)Mel-spectrogram, pitch contour, and energy envelope
TextReal-time comments and barrage API logs680k entriesComment semantics, sentiment tags, and user interactions
FinancialPlatform payment and gifting simulation450k recordsTransaction amount, timestamp, payment method, and anomaly labels
Table 2. Key hyperparameters and tensor shape transitions across MSF-Net modules.
Table 2. Key hyperparameters and tensor shape transitions across MSF-Net modules.
ModuleOperationInput ShapeOutput Shape/Hyperparameters
MATVisual patch embedding ( T , H , W , 3 ) ( T , 512 ) ; patch size 16, stride 16
Audio temporal encoder ( T , 128 ) Mel-spec ( T , 512 ) ; 1D-Transformer depth 2
Text encoder (BERT) ( L , 768 ) ( L , 512 ) ; hidden dim d = 512
Cross-modal attention ( T , 512 ) ( T , 512 ) ; 8 heads, FFN dim 2048
Mamba block ( T , 512 ) ( T , 512 ) ; state dim 256
FRDText conv encoder ( L , 768 ) ( L , 256 ) ( L , 128 ) ; kernels 3 , 1
Affective temporal conv ( T , 512 ) ( L , 64 ) ; kernels 5 , 3 , 3
Cross-dimensional fusion ( L , 128 ) + ( L , 64 ) ( L , d ) ; d = 256
Contrastive projection ( L , 256 ) ( L , 128 ) ; temperature τ = 0.2
MSFDMInput projection ( T , 512 ) , ( L , 128 ) , ( M , 64 ) All ( · , 512 )
Cross-signal attention ( M , 512 ) query ( M , 512 ) ; 8 heads
Fusion layers (4 ×) ( M , 512 ) ( M , 512 ) ; FFN dim 2048
Decision heads ( M , 512 ) Risk: ( 1 ) ; Class: ( 3 )
Final OutputRisk score + behavior class ( M , 512 ) r ^ s R p c R 3
Table 3. Performance comparison of different models on the proposed virtual live-streaming financial security detection task.
Table 3. Performance comparison of different models on the proposed virtual live-streaming financial security detection task.
ModelPrecisionRecallF1-ScoreAccuracyAUCFPS
Text-BERT0.8420.8160.8290.8350.87145.3
Audio-ResNet0.7940.7810.7870.8020.84652.6
Visual-ViT0.8210.8070.8140.8260.85949.1
MMBT0.8650.8490.8570.8610.89341.8
LXMERT0.8720.8640.8680.8730.90238.7
CLIP0.8810.8730.8770.8820.91340.4
GraphSAGE0.8530.8410.8470.8510.88756.2
FinBERT0.8890.8750.8820.8870.91950.9
MSF-Net (Proposed)0.9320.9240.9280.9310.95660.7
Table 4. Generalization evaluation on CMU-MOSEI.
Table 4. Generalization evaluation on CMU-MOSEI.
ModelPrecisionRecallF1-ScoreAccuracyAUCFPS
Text-BERT0.7280.6700.6980.7120.743120.3
Audio-ResNet0.6510.5940.6210.6340.671128.5
Visual-ViT0.6750.6180.6450.6570.692115.2
MMBT0.7410.6830.7110.7240.75882.7
LXMERT0.7580.7000.7280.7360.77378.4
CLIP0.7690.7110.7390.7480.78180.1
GraphSAGE0.7010.6550.6770.6890.732110.4
FinBERT0.7420.6840.7120.7250.752118.7
MSF-Net0.7930.7350.7630.7720.81284.6
Table 5. Ablation study results of MSF-Net under different module and modality configurations.
Table 5. Ablation study results of MSF-Net under different module and modality configurations.
ConfigurationPrecisionRecallF1-ScoreAccuracyAUCLatency (ms)
w/o MAT module0.9040.8920.8980.9010.93528.3
w/o FRD module0.8950.8790.8870.8890.92726.5
w/o MSFDM module0.8820.8670.8740.8790.91924.7
Text + Financial only0.8860.8720.8790.8810.92222.4
Visual + Audio only0.8730.8640.8680.8700.91123.1
Full model (MSF-Net)0.9320.9240.9280.9310.95621.3
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Song, Y.; Zhang, L.; Zhang, R.; Zhan, H.; Dai, M.; Hu, X.; Chen, R.; Li, M. MSF-Net: A Data-Driven Multimodal Transformer for Intelligent Behavior Recognition and Financial Risk Reasoning in Virtual Live-Streaming. Electronics 2025, 14, 4769. https://doi.org/10.3390/electronics14234769

AMA Style

Song Y, Zhang L, Zhang R, Zhan H, Dai M, Hu X, Chen R, Li M. MSF-Net: A Data-Driven Multimodal Transformer for Intelligent Behavior Recognition and Financial Risk Reasoning in Virtual Live-Streaming. Electronics. 2025; 14(23):4769. https://doi.org/10.3390/electronics14234769

Chicago/Turabian Style

Song, Yang, Liman Zhang, Ruoyun Zhang, Haoyuan Zhan, Mingyuan Dai, Xinyi Hu, Ranran Chen, and Manzhou Li. 2025. "MSF-Net: A Data-Driven Multimodal Transformer for Intelligent Behavior Recognition and Financial Risk Reasoning in Virtual Live-Streaming" Electronics 14, no. 23: 4769. https://doi.org/10.3390/electronics14234769

APA Style

Song, Y., Zhang, L., Zhang, R., Zhan, H., Dai, M., Hu, X., Chen, R., & Li, M. (2025). MSF-Net: A Data-Driven Multimodal Transformer for Intelligent Behavior Recognition and Financial Risk Reasoning in Virtual Live-Streaming. Electronics, 14(23), 4769. https://doi.org/10.3390/electronics14234769

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop