Dual-Stream Transformer with LLM-Empowered Symbol Drift Modeling for Health Misinformation Detection

Wang, Jingsheng; Fu, Zhengjie; Jiang, Chenlu; Li, Manzhou; Zhan, Yan

doi:10.3390/app15189992

Open AccessArticle

Dual-Stream Transformer with LLM-Empowered Symbol Drift Modeling for Health Misinformation Detection

by

Jingsheng Wang

¹

,

Zhengjie Fu

²,

Chenlu Jiang

^2,3,

Manzhou Li

^2,3 and

Yan Zhan

^2,4,*

¹

School of Journalism and Communication, Northwest University, Xi’an 710127, China

²

National School of Development, Peking University, Beijing 100871, China

³

China Agricultural University, Beijing 100083, China

⁴

Artificial Intelligence Research Institute, Tsinghua University, Beijing 100084, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(18), 9992; https://doi.org/10.3390/app15189992

Submission received: 19 August 2025 / Revised: 6 September 2025 / Accepted: 8 September 2025 / Published: 12 September 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

In the era of big-data-driven multi-platform and multimodal health information dissemination, the rapid spread of false and misleading content poses a critical threat to public health awareness and decision making. To address this issue, a dual-stream Transformer-based multimodal health misinformation detection framework is presented, incorporating a symbol drift detection module, a symbol-aware text graph neural network, and a crossmodal alignment fusion module. The framework enables precise identification of implicit misleading health-related symbols, comprehensive modeling of textual dependency structures, and robust detection of crossmodal semantic conflicts. A domain-specific health-symbol-sensitive lexicon is constructed, and contextual drift intensity is quantitatively measured and embedded as explicit features into the text GNN. Bidirectional cross-attention and contrastive learning are further employed to enhance crossmodal semantic alignment. Extensive experiments on a large-scale real-world multimodal health information dataset, encompassing heterogeneous data sources typical of big data environments, demonstrate that the proposed method consistently outperforms state-of-the-art baselines in CTR prediction, multimodal recommendation, and ranking tasks. The results indicate substantial improvements in both accuracy and ranking quality, while ablation studies further verify the contributions of symbol drift modeling, graph-structured representation, and crossmodal fusion. Overall, the proposed approach advances big data analytics for multimodal misinformation detection and provides an interpretable and scalable solution for public health communication governance.

Keywords:

text graph neural network; cross-modal semantic alignment; large language model; contrastive learning for multimodal fusion; cross-modal attention mechanism

1. Introduction

With the rapid expansion of new media content and the increasing complexity of user behavior patterns, short video platforms and news portals have imposed higher requirements on personalized recommendation systems [1]. Click-Through Rate (CTR) prediction, as a core metric in recommendation ranking, has traditionally relied on static features (e.g., user profiles, content tags), which are limited in capturing the dynamic evolution of user behavior. These challenges become particularly critical in the context of health information dissemination, where misinformation spreads rapidly on social media and video platforms by exploiting sensational language, emotionally charged symbols, and multimodal presentations (e.g., pairing misleading text with persuasive images). Unlike conventional e-commerce or entertainment scenarios, health misinformation often relies on subtle shifts in terminology—such as the misleading use of “sugar-free,” “detox,” or “natural”—to attract clicks and influence decision making. Traditional CTR models, with their reliance on static or low-order features, are unable to capture such semantic drifts or the crossmodal inconsistencies that arise when textual claims conflict with visual cues. Early CTR prediction models, primarily based on traditional machine learning methods such as logistic regression, further highlight these limitations by relying heavily on manually designed features, including user demographics (e.g., age, gender, geographic location), content-related attributes (e.g., category labels, keyword tags), and simple interaction statistics (e.g., historical click counts or frequency of visits) [2,3,4]. While such features can partially reflect user interests and content attributes, they fail to capture the dynamic evolution of user behavior and the temporal propagation of content. Linear models assume feature independence and cannot represent nonlinear interest dynamics, while factorization machines (FMs) represent an improvement by capturing second-order interactions but remain insufficient for modeling higher-order or sequential dependencies [5,6,7,8].

Short video and news content inherently possess significant temporal propagation characteristics and multimodal fusion features (text, image, audio, interaction, etc.), which influence both actual propagation potential and user preference response. For example, studies have indicated that the dissemination effectiveness of short videos from scientific journals is affected by image content, audio, publishing time, and duration, with the probability of a user clicking on the same video exhibiting an exponential decay trend over time. Static feature-based models cannot adapt to such time sensitivity, whereas multimodal discourse analysis provides theoretical and methodological support for enhancing the scientific and communicative value of short-video-based knowledge services in scientific journals [9]. In short video multi-label classification tasks, multimodal feature extraction and fusion techniques have been employed to improve classification performance. By integrating multimodal information such as images, text, and audio, models can more accurately comprehend video content, thereby enhancing classification accuracy and robustness [10]. These findings demonstrate that multimodal fusion can enhance content-level propagation potential assessment, providing richer feature support for recommendation systems. The key challenge of multimodal fusion in short video recommendation lies in crossmodal alignment and temporal synchronization. For example, a natural temporal correlation exists between the audio content and image frames of a video, yet existing methods often process each modality independently, leading to information loss. Recent research has proposed Dynamic Time Warping (DTW) techniques to align temporal features between audio and image modalities, significantly improving the accuracy of multimodal content propagation potential assessment [11]. Furthermore, contrastive learning has been utilized to enhance the discriminability of multimodal features, such as by constructing image–text contrastive pairs to enforce the learning of more robust crossmodal representations [12].

In recent years, deep learning techniques have achieved remarkable progress in recommendation systems [13]. Models such as multilayer perceptrons (MLPs), convolutional neural networks (CNNs), and Transformers have been widely applied to CTR prediction tasks [14]. These models can automatically capture complex relationships and nonlinear interactions in data, enabling the acquisition of higher-order, more abstract feature representations. However, existing methods still exhibit limitations in modeling dynamic changes in user behavior and in integrating multimodal content [15]. For example, the IARM model integrates user interest modeling with multi-head attention to effectively capture dynamic changes in user interests [16]. The HyperCTR framework constructs hypergraphs at the user and item levels to model temporal dependencies in user behavior and group features [17]. The ContentCTR model employs a multimodal Transformer architecture to achieve frame-level CTR prediction, enabling the capture of dynamic changes in live-stream content [18]. The Multi-Head Temporal Latent Attention (MTLA) method compresses key–value caches to significantly improve Transformer inference speed while maintaining performance, making it suitable for real-time recommendation scenarios [19]. These studies suggest that recommendation frameworks integrating temporal attention mechanisms with multimodal features can substantially enhance CTR prediction performance and recommendation accuracy. In summary, a unified recommendation framework integrating temporal attention mechanisms and multimodal Transformer architecture is proposed, which jointly models user behavior evolution and content propagation potential, optimizing CTR prediction and ranking performance. The main contributions of this framework are as follows:

Construction of a Transformer-based sequential modeling network for click behavior to simulate the evolution of user interests and content click dynamics.
Design of a multimodal propagation potential perception module that integrates text, image, audio, and user bullet comment information to enhance content-level propagation potential assessment.
Joint incorporation of propagation potential factors and click sequence features to optimize recommendation ranking, achieving precise recommendations driven by both interest and popularity.

2. Related Work

2.1. CTR Prediction and Temporal Modeling Methods

CTR prediction, a core task in recommendation systems, has evolved from traditional statistical models to deep learning approaches. Early methods such as logistic regression (LR) and factorization machines (FMs) played important roles: LR predicts clicks through linear combinations of handcrafted features, while the FM captures second-order feature interactions [20]. However, both struggle with complex feature relationships, interest drift, and sequential dependencies [8]. With the introduction of deep learning, models such as MLP, CNN, and RNN can automatically learn nonlinear patterns and temporal dependencies. Combining these with Transformer architectures enables effective modeling of long-term dependencies, significantly improving CTR performance. Representative models include DeepFM, which integrates FM and DNN to capture both low- and high-order feature interactions [21].

Deep learning breakthroughs mainly lie in dynamic interest modeling and long-term dependency modeling. DIN and DIEN enhance interest representation by using attention and interest evolution mechanisms [22]; Transformer4Rec applies self-attention to efficiently capture ultra-long behavior sequences [23]; and TPP-CTR leverages temporal point processes to model burstiness and periodicity of user interests, improving cold-start scenarios. Despite these advances, challenges remain: Transformers face

O (n^{2})

complexity in very long sequences, existing methods struggle to capture abrupt interest shifts (e.g., news-driven clicks), and the integration of temporal and static features still requires further optimization [24].

2.2. Multimodal Content Understanding and Propagation Modeling

The development of multimodal fusion has undergone technical iterations from early concatenation to deep semantic alignment. Early methods (e.g., MMGCN, GNN-CON) directly concatenated feature vectors from modalities such as text and image but ignored semantic heterogeneity between modalities, leading to low recall rates in crossmodal retrieval [25,26]. In the deep fusion stage, ViLT performs interactions between image patches and text tokens, achieving end-to-end vision–language modeling and improving multimodal classification accuracy [27,28]; FLAVA learns unified cross-modal representations through pretraining, significantly increasing video title matching recall rates, with an average score of 69.92 on multimodal tasks, outperforming other models [29]; in domain adaptation, VideoBERT treats video frames as “visual tokens” to capture temporal semantics, improving CTR in short video recommendation [30]; CLIP employs image–text contrastive learning to learn joint crossmodal representations, greatly enhancing multimodal content understanding and generation capabilities [31].

In terms of breakthroughs in propagation modeling, the traditional SIR model cannot account for user heterogeneity. While SIR can reasonably describe topic dissemination in web forums, its homogeneity assumption limits applicability to real-world populations with complex spatial and group structures [32]. In contrast, diffusion models have been used to predict content propagation potential with much higher accuracy than SIR [33]. Additionally, popularity prediction and diffusion models can aid recommendation systems in assessing potential content impact by analyzing propagation paths and speeds. For instance, some studies have simulated content dissemination processes using diffusion models, significantly improving propagation potential estimation accuracy for short video popularity [34].

2.3. Content Popularity and Propagation Potential Modeling in Recommendation Systems

Modeling content popularity and propagation potential is critical for enhancing the dynamism and precision of recommendation systems, with the core challenge being the capture of temporal evolution patterns of content relevance and the interaction-driven mechanisms of user behavior [35,36]. In recent years, researchers have integrated deep learning, temporal analysis, and network modeling techniques to establish multidimensional content dynamics evaluation frameworks, greatly improving system responsiveness to fluctuations in user interests and trends in content dissemination.

Propagation potential assessment requires combining content semantic features with user interaction networks. Early studies evaluated propagation power by measuring watch time, likes, and similar metrics, but they ignored inherent semantic associations in content and collaborative effects in user behavior [37]. Recently, multimodal fusion and graph neural network techniques have refined propagation modeling. In recommendation systems, multimodal fusion can integrate diverse content features such as text, images, and audio, while contrastive learning can enhance a model’s ability to distinguish between different levels of content popularity and user preferences, thereby improving recommendation precision [31,38]. In addition, dynamic sparse activation and mixture-of-experts (MoE) models combine sparse activation with expert subnetworks to reduce computation and improve performance. In content recommendation scenarios, MoE models can dynamically allocate computational resources according to different content types and user groups, capturing changes in content popularity and user interests more efficiently [39].

3. Materials and Method

3.1. Data Collection

The dataset was constructed through systematic acquisition, cleaning, and annotation of health-related multimodal information from multiple platforms, encompassing dimensions such as content sources, collection periods, data modalities, and semantic labels, as shown in Table 1. The collection process was conducted continuously from January 2023 to December 2024 to ensure that the samples reflected the most recent trends and characteristics of health information dissemination. Data sources included two major social media platforms—Weibo and Xiaohongshu. For Weibo, original data were obtained from topic pages, post content, and user comment sections through the official open API and customized web crawlers. Search queries covered core health-related terms such as “sugar-free”, “weight loss”, and “light diet” and were expanded using logical retrieval to include semantically related terms such as “natural”, “low-fat”, and “healthy eating”. For Xiaohongshu, data were collected via keyword search result pages and recommendation feeds, covering the titles, main text, images, tags, and comments of posts. The collection process strictly preserved the pairing between images and corresponding text, with mismatched samples eliminated through a combination of automated image–text matching algorithms and manual inspection to ensure semantic consistency between the two modalities. For text data, original punctuation, emojis, hashtags, and unit symbols were retained to facilitate subsequent symbolic semantic drift modeling. For image data, original resolution storage was maintained to preserve fine details, and metadata were recorded, including collection time, anonymized author ID, as well as counts of likes, comments, and shares, to support subsequent multimodal propagation potential modeling. The annotation process was carried out by three members with expertise in health communication, clinical nutrition, and public health. The labeling criteria were defined into three categories: authentic health information, which must conform to authoritative guidelines or peer-reviewed scientific literature; potentially misleading information, which often contains unverified health claims, exaggerated comparisons, or selectively cited scientific findings; and explicitly false information, which involves statements directly contradicting scientific facts or entirely incorrect health assertions. Multiple rounds of cross-annotation and consistency checks were conducted, achieving a Cohen’s

κ

value of 0.87. The final multimodal fake health information detection dataset maintained balanced platform distribution, label proportions, and modality types, covering both text and image modalities. This dataset provides a robust foundation for symbolic drift analysis, text graph modeling, and multimodal semantic alignment experiments.

3.2. Data Preprocessing and Data Augmentation

In the construction of multimodal recommendation systems, raw data often exhibit substantial modality differences, distribution imbalance, and scale inconsistency. Without scientifically designed preprocessing and augmentation procedures, model training may fail to achieve stable convergence and effective generalization. To address this, a comprehensive preprocessing pipeline was developed for multimodal data obtained from short video platforms (e.g., Douyin, Bilibili) and news platforms, including audio, image, and text modalities. This pipeline incorporates feature alignment and normalization, user behavior sequence modeling, and multi-strategy data augmentation to ensure that all modality features have consistent scales, compatible semantic spaces, and enhanced robustness before being fed into the model.

For the audio modality, each raw audio signal

x_{audio}

was converted into a mel-spectrogram to improve the utility of spectral features. Specifically,

x_{audio}

was first transformed using short-time Fourier transform (STFT) and subsequently mapped through a mel filter bank defined in the following equation:

M_{mel} = {STFT}_{mel} (x_{audio}),

(1)

where

M_{mel}

denotes the mel-spectrogram matrix,

x_{audio}

is the raw audio signal, and

{STFT}_{mel} (\cdot)

represents the operation of applying time–frequency transformation followed by mel-scale filtering aligned with human auditory perception. This processing preserves local time–frequency characteristics while reducing high-frequency redundancy, thereby enhancing the model’s ability to identify audio patterns.

For the image modality, each input image

x_{image}

was processed using a pretrained CNN to extract high-level visual features using the following equation:

F_{image} = CNN (x_{image}),

(2)

where

F_{image}

is the image feature vector, and

CNN (\cdot)

denotes a feature extraction network composed of convolutional, pooling, and nonlinear activation layers. This process enables the automatic capture of key information such as edge structures, texture distributions, and color patterns while leveraging pretrained weights to improve feature generalization.

For the text modality, each text sequence

x_{text}

was encoded using a BERT model to generate contextual embeddings defined as follows:

E_{text} = BERT (x_{text}),

(3)

where

E_{text}

denotes the contextual embedding representation, and

BERT (\cdot)

employs a multi-layer bidirectional Transformer architecture to model global semantic dependencies, capturing complex semantic features. To ensure comparability during feature fusion, all modality features were standardized to zero mean and unit variance.

For user behavior data modeling, the publication time of news items or the upload time of videos was treated as a global temporal anchor, and user actions such as clicks, likes, and comments were ordered chronologically to form complete behavior sequences defined as follows:

B_{u} = [b_{1}, b_{2}, \dots, b_{T}],

(4)

where

B_{u}

represents the complete behavior sequence of user u,

b_{i}

denotes the i-th chronologically ordered behavior, and T is the sequence length. To satisfy the model’s fixed input length requirements and improve training efficiency, each sequence was segmented into subsequences of length L as defined below:

B_{u}^{(i)} = [b_{i}, b_{i + 1}, \dots, b_{i + L - 1}],

(5)

where

B_{u}^{(i)}

denotes the subsequence starting from the i-th behavior, and L is the fixed subsequence length. This slicing strategy preserves temporal dependencies while highlighting interest fluctuation patterns within short time windows.

In the data augmentation stage, three strategies were implemented to improve model robustness and generalization under varying data distributions: temporal perturbation, click sparsity simulation, and multimodal information missing simulation. In temporal perturbation, Gaussian noise was added to behavior timestamps using the following equation:

t_{i}^{'} = t_{i} + δ_{i}, δ_{i} \sim N (0, σ^{2}),

(6)

where

t_{i}

is the original timestamp of the i-th behavior,

t_{i}^{'}

is the perturbed timestamp, and

δ_{i}

is a normally distributed random variable with mean 0 and variance

σ^{2}

. This simulates natural temporal variations in user behavior and prevents overfitting to fixed time intervals.

In click sparsity simulation, a proportion p of behavior events was randomly removed using the following the equation:

b_{i}^{'} = \{\begin{matrix} b_{i} & with probability 1 - p, \\ 0 & with probability p \end{matrix},

(7)

where

b_{i}^{'}

denotes the i-th behavior after augmentation, and 0 indicates a removed event. This forces the model to learn effective preference patterns even under sparse interaction conditions.

In multimodal information missing simulation, image features were randomly occluded, and text embeddings were randomly masked using the following equation:

F_{image}^{'} = Occlude (F_{image}, p), E_{text}^{'} = Replace (E_{text}, p, [M A S K]),

(8)

where

F_{image}^{'}

is the occluded image feature at probability p,

Occlude (\cdot)

denotes random masking of feature regions,

E_{text}^{'}

is the masked text embedding at probability p, and

Replace (\cdot, p, [M A S K])

denotes replacing a proportion of tokens with the

[M A S K]

symbol. This augmentation simulates real-world recommendation scenarios where multimodal information may be incomplete.

3.3. Proposed Method

3.3.1. Overall

The proposed method is designed from the model architecture perspective, taking as input an aligned text sequence and its corresponding image. The process begins with the symbol drift detection module, in which candidate symbols in the text are retrieved and scored based on a domain-specific health symbol-sensitive lexicon and contextual semantic embeddings, yielding both word-level drift weights and a sample-level drift score. The drift weights are concatenated with the word embeddings as explicit features to form a “drift-aware” embedding matrix, and the sample-level score is propagated as global side information throughout subsequent layers. The input then proceeds to the Symbol-Aware Text GNN module, where a word-level dependency graph is constructed for the same text, with node features formed by concatenating BERT contextual embeddings and drift weights. The graph edges are defined by syntactic dependencies and intra-paragraph global connections. Attention-based GCN layers iteratively aggregate neighborhood information, ensuring that high-drift nodes exert greater influence on semantic propagation. Multi-scale pooling then produces a global text representation, while retaining interpretable node importance distributions. In parallel, the visual branch encodes the image into region-level visual tokens and a global image vector using ViT. Both branches merge in the image-text alignment multimodal fusion module: First, a dual-tower similarity learning mechanism maps the global text and image representations into a shared semantic space, using in-batch negatives to provide discriminative signals; then, cross-attention is applied to achieve fine-grained alignment, where text tokens query image region features and image regions reciprocally query text tokens, yielding both local consistency and conflict evidence. In the fusion stage, the aligned global text and image representations, crossmodal interaction vectors, and the sample-level drift score are concatenated and passed through multiple nonlinear projection layers to produce a fused representation, which is fed into a three-class classifier to predict “authentic,” “potentially misleading,” or “explicitly false.” The training objective is multi-task: Cross-entropy optimizes the main classification performance; a contrastive similarity loss draws matched samples closer while pushing mismatched ones apart; a consistency constraint suppresses over-alignment for image-true/text-false cases; and a drift regularizer encourages sparse and discriminative attention distributions on symbol nodes. During inference, the same forward path is followed, and the model outputs the predicted class, sample-level consistency score, and word/region-level interpretability heatmaps, enabling end-to-end joint modeling and explainable discrimination of symbol drift, text structure, and image–text consistency.

3.3.2. Symbol-Sensitive Lexicon and Drift Score Extraction

The implementation of the symbol-sensitive lexicon and drift score extraction module centers on explicitly modeling misleading health-related symbols from a semantic perspective and quantifying their semantic shift in a given context, thereby providing high-discriminability features for subsequent text structural modeling and multimodal fusion. As shown in Figure 1, a symbol-sensitive lexicon for the health domain is constructed through a combination of manual curation and data-driven expansion. The initial vocabulary is compiled by experts in health communication and nutrition, selecting approximately 300 high-frequency health-related terms prone to misleading use, such as “sugar-free,” “low-fat,” and “natural.” This seed list is then expanded by calculating cosine similarity between seed entries and candidate terms in a large-scale health corpus (drawn from PubMed abstracts, verified health portals, and social media health discussions) using BERT embeddings. A similarity threshold of 0.6 is applied to retain semantically related terms, and PMI co-occurrence scores are further employed to filter candidates that frequently appear in misleading or ambiguous contexts. Terms not meeting the threshold or judged irrelevant during expert review are discarded. After denoising and manual validation, the final lexicon

V_{s}

is obtained, ensuring both domain coverage and practical relevance for health misinformation detection.

In the feature extraction phase, for an input token sequence

{w_{1}, w_{2}, \dots, w_{n}}

, symbol words matching

V_{s}

are identified, and their positions

P_{s}

are recorded. BERT is then used to generate contextual embeddings

e (w_{i})

for each token. To compute a symbol’s drift degree in the current context, a reference semantic representation

e_{r e f} (w)

from a control corpus

C_{r e f}

is retrieved for the same symbol. The semantic difference is computed as

Δ (w_{i}) = {∥ e (w_{i}) - e_{r e f} (w_{i}) ∥}_{2}

and combined with its normalized term frequency weight

α (w_{i})

in the text to yield the symbol drift strength

S (w_{i}) = α (w_{i}) \cdot Δ (w_{i})

. The sample-level drift score

D_{sample}

is then obtained as the weighted average of all symbol drift strengths in the sample.

Architecturally, the module consists of an embedding layer, a symbol matching and masking layer, a BERT encoder, a difference computation unit, and a fully connected mapping layer. The BERT encoder comprises 12 Transformer layers with hidden dimension 768 and 12 attention heads. Its output is concatenated with the drift score and compressed via a linear mapping to produce a fixed-length feature vector

d_{sym}

. This vector is used both as the initialization attribute for symbol nodes in the text graph and as a global auxiliary feature input to the multimodal fusion module.

From a mathematical standpoint, this design treats symbol drift modeling as a problem of measuring displacement in a cross-domain semantic space, using the authoritative health domain as the reference domain, and quantifying displacement magnitude via embedding differences. This approach offers two key advantages: first, the drift measurement is strongly context-dependent, allowing dynamic adaptation to shifts in symbol meaning across different textual environments, thus avoiding the generalization limits of static lexicon methods; second, by injecting symbol drift features into subsequent networks along both node-level and global paths, the design highlights the importance of high-drift nodes in local structure updates while also providing explicit discriminative signals in global classification, thereby improving the detection of implicitly misleading “scientific veneer”-type false health information.

3.3.3. Symbol-Aware Text GNN

The Symbol-Aware Text GNN module is designed to integrate the contextual dependency structure of the text with symbol drift features into a unified graph representation, enabling the simultaneous capture of local semantic dependencies and symbol importance during iterative graph updates. As shown in Figure 2, the input to this module is a word-level embedding matrix

X \in R^{n \times d}

derived from the symbol-sensitive lexicon and drift score extraction stage, in which each node vector is formed by concatenating the BERT-encoded contextual representation with the scalar drift intensity. The construction of the text graph is based on dependency parsing results to generate an adjacency matrix

A \in R^{n \times n}

, with paragraph-level fully connected edges added to enhance global information flow across sentences.

The network architecture employs a three-layer symbol-aware graph convolutional network (Symbol-Aware GCN), where each layer consists of graph convolution, symbol-weighted attention, and residual connections. The first layer takes

H^{(0)} = X

as input, with hidden dimension

d_{1} = 256

; the second layer maintains

d_{2} = 256

; the third layer expands to

d_{3} = 512

to strengthen global representation capacity. The update in each layer is computed as

H^{(l + 1)} = σ (\tilde{A} H^{(l)} W^{(l)} + β \cdot M_{s} H^{(l)}) + H^{(l)},

(9)

where

\tilde{A} = D^{- \frac{1}{2}} (A + I) D^{- \frac{1}{2}}

is the normalized adjacency matrix with self-loops,

W^{(l)} \in R^{d_{l} \times d_{l + 1}}

is the learnable weight matrix for layer l,

σ

is the ReLU activation,

M_{s}

is a diagonal symbol weight matrix whose elements correspond to the node-level symbol drift intensities, and

β

is the symbol attention coefficient that explicitly amplifies the influence of high-drift nodes on their neighbors during updates.

To show that this structure effectively integrates symbol information with graph-structural features, consider node i at layer

l + 1

, whose representation can be expanded as

h_{i}^{(l + 1)} = σ (\sum_{j \in N (i) \cup {i}} \frac{1}{\sqrt{{\tilde{d}}_{i} {\tilde{d}}_{j}}} h_{j}^{(l)} W^{(l)} + β m_{s, i} h_{i}^{(l)}) + h_{i}^{(l)},

(10)

where

{\tilde{d}}_{i}

is the normalized degree of node i, and

m_{s, i}

is its symbol weight. It follows that when

m_{s, i}

is large, the node’s features receive greater emphasis both through the residual term and neighborhood updates, thereby exerting a stronger influence on the global representation over multiple propagation steps. Symmetric normalization

\tilde{A}

ensures numerical stability in feature propagation, while residual connections mitigate over-smoothing in deeper graph convolution layers.

In the global representation generation phase, global average pooling and max pooling are applied to the outputs of each of the three layers, and the resulting vectors are concatenated to form the final text representation

f_{text} \in R^{2 d_{3}}

. This vector encodes long-range semantic relationships captured via dependency structure, while preserving the salience of symbol drift at the local node level. This module is jointly utilized with the downstream image–text alignment multimodal fusion module via the shared

f_{text}

and the global symbol drift score vector

d_{sym}

, enabling explicit use of symbol information for semantic conflict detection in cross-modal alignment. The overall algorithm flow can be found in Appendix A (Algorithm A1).

This design confers significant advantages for the task at hand: Symbol drift information is encoded into node features and preserved and amplified during graph convolution propagation, aiding in the detection of highly covert patterns such as “true image, false text” and “scientific veneer” in deceptive health information; the structured dependency graph enables consistent context modeling in long texts; and through co-optimization with the multimodal fusion module, both text classification accuracy and crossmodal consistency modeling are improved, supporting robust detection of multimodal health misinformation.

3.3.4. Multimodal Fusion Module for Image–Text Alignment

As shown in Figure 3, this module takes as input the global text vector

f_{text} \in R^{512}

from the Symbol-Aware Text GNN and the image representation from the visual encoder. Through a shared projection head, bidirectional cross-attention, and discriminative contrastive learning, image–text alignment and fusion are achieved, resulting in a fused vector for three-way classification. On the visual side, a ViT-B/16 structure is employed, where the input image is resized to

224 \times 224

and divided into

16 \times 16

patches, producing

14 \times 14 = 196

visual tokens with channel dimension

C_{img} = 768

. After 12 Transformer layers, the

CLS

vector

g_{img} \in R^{768}

and the region token matrix

R \in R^{196 \times 768}

are extracted. On the textual side,

f_{text}

is linearly projected to

C_{txt} = 768

to obtain

g_{txt} \in R^{768}

, and the token sequence

T \in R^{L \times 768}

(

L \leq 128

) from the word-level graph readout is retained. Both sides are first passed through a shared pre-projection head (two-layer MLP,

768 \to 1024 \to 768

, GELU activation with LayerNorm, dropout

= 0.1

) to enter the alignment space, followed by two layers of bidirectional cross-attention blocks (each with 8 attention heads, head dimension

d_{h} = 64

, model dimension 768, feed-forward width 3072, dropout

= 0.1

) to generate fine-grained interaction features.

To incorporate symbol drift into the alignment process, the sample-level drift score

D_{sym}

is added as a bias modulation term to the similarity and attention scores of matching pairs, thereby enhancing the discriminability of conflict regions in “true image, false text” scenarios. For clarity, define text queries

Q_{t} = T W_{Q}

, image keys

K_{i} = R W_{K}

, and values

V_{i} = R W_{V}

(

W_{*}

has dimension

768 \times 768

). The drift-aware attention scores and crossmodal readouts are computed as

A_{t \to i} = softmax (\frac{Q_{t} K_{i}^{⊤}}{\sqrt{d_{h}}} + γ D_{sym} \cdot 1), Z_{t \to i} = A_{t \to i} V_{i},

(11)

where

γ > 0

is the modulation coefficient, and

1

is an all-ones matrix used to uniformly raise the attention temperature for matching pairs. Symmetrically,

A_{i \to t}

and

Z_{i \to t}

are defined. Both readouts are average-pooled, passed through residual projection, and concatenated with the global vectors to form the fused representation

h_{fuse} = [g_{txt}; |; g_{img}; |; pool (Z_{t \to i}); |; pool (Z_{i \to t}); |; D_{sym}] \in R^{(4 \cdot 768 + 1)} .

(12)

The classification head is a two-layer MLP (

1024 \to 512 \to 3

, GELU+LayerNorm, dropout

= 0.2

) that outputs the three-class probability vector. To obtain a discriminative shared representation and jointly optimize crossmodal consistency and classification performance, the training objective combines a temperature-scaled contrastive alignment loss with cross-entropy. Let

u_{j} = W_{u} g_{txt}

and

v_{j} = W_{v} g_{img}

(

W_{u}, W_{v} \in R^{768 \times 768}

) be the projections of the j-th matched image–text pair, and let

s_{j k} = u_{j}^{⊤} v_{k}

denote their similarity. The contrastive loss is then

L_{align} = - \frac{1}{B} \sum_{j = 1}^{B} [log \frac{exp ((s_{j j} + λ D_{sym}) / τ)}{\sum_{k = 1}^{B} exp (s_{j k} / τ)} + log \frac{exp ((s_{j j} + λ D_{sym}) / τ)}{\sum_{k = 1}^{B} exp (s_{k j} / τ)}],

(13)

where

τ

is the temperature, and

λ > 0

controls the weight of drift on the positive pair score. The classification cross-entropy is denoted as

L_{cls}

, and the overall objective is

L = L_{cls} + α L_{align} + β L_{cons}

, where the consistency regularization term is

L_{cons} = | g_{txt} - ϕ (Z_{t \to i}) |_{2}^{2} + {| g_{img} - ψ (Z_{i \to t}) |}_{2}^{2}

, with

ϕ, ψ

as shared linear mappings.

This design works in conjunction with the “symbol-sensitive lexicon and drift extraction” and “Symbol-Aware Text GNN” modules:

D_{sym}

modulates the separability of similarity and attention, while

f_{text}

, having explicitly encoded the importance of drift nodes via graph convolution, directs crossmodal attention to image regions and text segments related to misleading symbols. In the contrastive term, adding

λ D_{sym}

to the numerator ensures that when

D_{sym} > 0

, the softmax probability of the positive pair monotonically increases, which can be formally expressed as

\frac{\partial}{\partial D_{sym}} log \frac{exp ((s_{j j} + λ D_{sym}) / τ)}{\sum_{k} exp (s_{j k} / τ)} = \frac{λ}{τ} \cdot \frac{exp ((s_{j j} + λ D_{sym}) / τ)}{\sum_{k} exp (s_{j k} / τ)} > 0,

(14)

thus encouraging the model during training to learn an “attention proportional to drift” alignment strategy. With bidirectional isomorphic constraints in cross-attention and consistency regularization mapping local readouts back into the global representation space, the objective function is equivalent to maximizing the margin of matched pairs in the shared alignment space while minimizing local–global deviations, ultimately enabling stable conflict visualization and higher discriminative margins in “true image, false text” and “true text, false image” scenarios. The implementation details and parameter layout correspond to the “shared pre-projection—bidirectional cross-attention—fusion head” pipeline and paired alignment branch in the architecture diagram. The specific implementation of crisscross attention can be found in Appendix A (Algorithm A2).

4. Results and Discussion

4.1. Experimental Setup and Evaluation Metrics

In this study, a series of experiments was designed to validate the effectiveness and superiority of the proposed dual-stream Transformer recommendation framework. The following subsections detail the evaluation metrics and baseline models adopted for comparison.

4.1.1. Evaluation Metrics

Multiple evaluation metrics were employed to comprehensively assess the performance of the model, capturing both the accuracy and relevance of the recommendation results from different perspectives. The metrics used include the AUC (Area Under the ROC Curve), NDCG@10 (Normalized Discounted Cumulative Gain at 10), Precision@10, Recall@10, F1-score, and Hit Rate. These indicators collectively measure the classification and ranking performance of the model. Specifically, AUC evaluates the capability of distinguishing between positive and negative samples, with values closer to 1 indicating superior performance. NDCG@10 measures the quality of the top-10 recommendation results, considering both relevance and rank position. Precision@10 calculates the proportion of correctly predicted items within the top 10 recommendations. Recall@10 determines the proportion of relevant items correctly retrieved within the top 10. The F1-score, as the harmonic mean of Precision and Recall, assesses the balance between these two metrics. The Hit Rate measures whether the target item is included in the recommendation list, reflecting the success rate of the recommendation. The corresponding formulas are as follows:

AUC = \frac{1}{N} \sum_{i = 1}^{N} \frac{T P_{i}}{T P_{i} + F P_{i}},

(15)

NDCG @ 10 = \sum_{i = 1}^{10} \frac{{rel}_{i}}{{log}_{2} (i + 1)},

(16)

Precision @ 10 = \frac{T P}{T P + F P},

(17)

Recall @ 10 = \frac{T P}{T P + F N},

(18)

F1-score = 2 \times \frac{Precision \times Recall}{Precision + Recall},

(19)

Hit Rate = \frac{Number of Hits}{Total Number of Trials} .

(20)

Here,

T P

denotes true positives,

F P

denotes false positives,

F N

denotes false negatives, and

{rel}_{i}

denotes the relevance score of the i-th recommended item.

4.1.2. Baseline Models

To verify the effectiveness of the proposed model, comparisons were made with a range of baseline models representing different architectures and methodologies in deep learning for CTR prediction and recommendation ranking tasks, thereby enabling a comprehensive performance evaluation.

For CTR prediction models, the baselines included Wide&Deep [40], DeepFM [21], DIN [22], and Transformer4Rec [23]. Wide&Deep combines wide linear models with DNNs to capture both low-order and high-order feature interactions. DeepFM integrates FM with DNNs to jointly capture low- and high-order interactions. DIN applies attention mechanisms to model the dynamic evolution of user interests. Transformer4Rec leverages the Transformer architecture to model sequential dependencies in user behavior.

For multimodal recommendation models, the baselines included MMoE [41], CLIP4Rec [42], MMRec [43], and VideoBERT [30]. MMoE (Multi-gate Mixture-of-Experts) utilizes multiple gating units to dynamically allocate weights to different expert networks, addressing task correlation in multi-task learning. CLIP4Rec builds on the CLIP model, employing image–text contrastive learning to improve recommendation performance. MMRec fuses multiple modalities, such as text and images, to enhance recommendation accuracy. VideoBERT pretrains sequential models on video segments to capture both temporal and semantic aspects of video content.

For ranking models, the baselines included LambdaRank [44], RankNet [45], and LightGBM-Rank [46]. LambdaRank optimizes ranking-specific metrics such as NDCG. RankNet learns ranking functions by comparing the relative order of sample pairs. LightGBM-Rank employs gradient boosting decision trees to perform efficient ranking.

These baseline models span from traditional approaches such as LR and FM to deep CTR models based on DNNs, RNNs, and Transformers, as well as recent advances in multimodal recommendation and ranking methods. By comparing against these baselines, the performance advantages of the proposed model across different tasks can be thoroughly evaluated.

4.2. CTR Prediction Baselines vs. Proposed Model

This experiment was designed to validate the effectiveness of the proposed dual-stream Transformer recommendation framework in CTR prediction tasks and to comprehensively evaluate its advantages in accuracy and ranking performance by comparison with classic CTR prediction models of different structural types.

As shown in Table 2 and Figure 4 and Figure 5, the conventional Wide&Deep model achieved the lowest performance across all metrics, with an AUC of only 0.882, reflecting its limited ability to combine linear and nonlinear features effectively. DeepFM surpassed Wide&Deep by modeling both low- and high-order feature interactions, but its improvements remained modest due to a lack of sequential interest modeling. DIN introduced an attention mechanism to capture dynamic changes in user interests, yielding further gains, particularly in Precision and Recall. Transformer4Rec leveraged the Transformer architecture to model long-range user behavior dependencies, achieving higher performance than DIN and demonstrating advantages in sequential modeling. In contrast, the proposed model outperformed all baselines on every metric, with especially notable improvements in NDCG@10, Precision@10, and Recall@10, which directly reflect ranking quality and click prediction accuracy. These performance differences can be attributed to how each model handles feature interaction and sequential dependency. Wide&Deep relies on static linear and deep nonlinear features, making it less adaptive to evolving user interests. DeepFM introduces factorization for higher-order interactions but still struggles with temporal dynamics. DIN’s attention mechanism enables non-uniform weighting of historical behaviors, allowing relevant actions to contribute more, yet it remains limited to short-range dependencies. Transformer4Rec’s self-attention provides global dependency modeling and captures diverse interaction patterns but lacks explicit optimization for multimodal data or symbol drift. Importantly, misleading content tends to exhibit higher average drift scores than authentic content, and models without explicit drift-awareness are less effective in distinguishing such cases. Our proposed dual-stream Transformer explicitly fuses symbol drift features with multimodal semantics, enabling both symbol sensitivity and crossmodal alignment. This allows the model to adapt to drift-induced inconsistencies and improves robustness in detecting misleading content. Furthermore, multitask optimization enhances ranking consistency. Together, these structural enhancements enable simultaneous improvements in click prediction accuracy and ranking quality, resulting in consistently superior performance.

4.3. Multimodal and Ranking Baselines vs. Proposed Model

This experiment was conducted to assess the overall performance of the proposed model in multimodal recommendation and ranking tasks, comparing it with representative multimodal and ranking baseline models to verify its advantages in multimodal feature alignment, crossmodal semantic modeling, and high-precision ranking.

As shown in Table 3 and Figure 6, among multimodal models, MMoE exploited a multi-gate expert mechanism to handle multitask correlations, outperforming single-modal CTR models but still lagging behind stronger alignment-based methods in AUC and NDCG@10. CLIP4Rec leveraged image–text contrastive learning to enhance crossmodal consistency, achieving higher performance. MMRec integrated multiple modalities, obtaining the best results among multimodal baselines. VideoBERT captured both temporal and semantic video information, performing stably on ranking-related metrics. For ranking models, LambdaRank, RankNet, and LightGBM-Rank achieved competitive results on NDCG but were constrained in overall AUC and Recall due to limited deep feature interaction capabilities. In contrast, the proposed method achieved the highest values in all metrics, with substantial advantages in NDCG@10, Precision@10, and Recall@10, which measure ranking quality and recommendation accuracy, indicating its strong capabilities in multimodal alignment and ranking optimization.

From a theoretical perspective, the performance differences stem mainly from the depth of feature modeling, the presence or absence of crossmodal alignment mechanisms, and the design of ranking optimization strategies. MMoE mathematically allocates feature subspaces to different tasks via gating units but does not explicitly optimize intermodal alignment. CLIP4Rec and MMRec employ contrastive learning to minimize intermodal distances in the embedding space, outperforming other multimodal methods in modeling crossmodal semantic consistency, but remain limited in combining structured symbol information with local context. VideoBERT excels in temporal information modeling but has limited capacity for global crossmodal consistency in static features. Ranking models such as LambdaRank and RankNet directly optimize metrics like NDCG, mathematically equivalent to adjusting pairwise sample order in gradient space, but lack intermodal semantic alignment and explicit noise suppression. The proposed method explicitly integrates symbol drift modeling, graph-neural-network-based textual structural representation, and visual Transformer encoding, applying bidirectional cross-attention for fine-grained alignment, and it incorporates contrastive alignment loss with consistency regularization into the optimization objective. This theoretically maximizes both modal consistency and ranking metrics, yielding overall performance superior to all baseline models.

4.4. Comparison of Ranking-Based Baselines and the Proposed Framework

This experiment was designed to evaluate the performance advantages of the proposed method in typical ranking tasks by comparing it with multiple classic ranking models, thereby verifying its comprehensive capabilities in optimizing ranking metrics, improving recommendation accuracy, and capturing user preferences.

As shown in Table 4, traditional ranking models such as LambdaRank, RankNet, and LightGBM-Rank demonstrated relatively strong performance on ranking-quality-oriented metrics such as NDCG@10 and Precision@10, with LightGBM-Rank achieving the best results among them due to the nonlinear modeling capability of gradient boosting decision trees. However, these methods exhibited significantly lower performance than the proposed method in metrics such as AUC and Recall, indicating their limitations in capturing crossmodal feature interactions, suppressing noise interference, and modeling global semantic consistency. In contrast, the proposed approach achieved the highest scores across all metrics, with particularly pronounced advantages in Recall@10 and F1-score, which jointly reflect the trade-off between recall and precision, demonstrating its ability to balance the coverage and accuracy of recommendation results.

From a mathematical perspective, LambdaRank explicitly optimizes the gradient of NDCG, theoretically enabling direct improvement of ranking relevance, but its reliance on manually designed features limits its capability to capture complex semantic relations and multimodal feature interactions. RankNet models the pairwise order of samples probabilistically, optimizing by minimizing the ranking error rate, but it is more susceptible to noise in high-dimensional feature spaces. LightGBM-Rank constructs additive models by splitting nodes, offering better nonlinear feature mapping than the previous two methods, but it lacks the ability to explicitly align semantic distributions across modalities. The proposed method incorporates symbol-sensitive features, textual graph neural network modeling, and visual Transformer encoding, while employing bidirectional cross-attention and contrastive learning to align text and image representations in a shared semantic space, with consistency regularization to maintain coherence between local and global representations. Mathematically, this design is equivalent to simultaneously maximizing the ranking relevance function and crossmodal similarity, which accounts for its ability to outperform all ranking-based baselines in experimental results.

4.5. Ablation Study of the Proposed Framework

This experiment was conducted to verify the independent contributions of each key module within the proposed framework and to analyze their functional roles in overall performance. By progressively removing components such as symbol drift detection, the textual graph neural network, crossmodal cross-attention, and contrastive alignment loss, the changes in model performance under different configurations were evaluated.

As shown in Table 5, the full model achieved the highest scores across all evaluation metrics, confirming that the synergy among its modules significantly enhances recommendation performance. Removing the symbol drift module resulted in a noticeable decline in the AUC, NDCG@10, and F1-score results, indicating that explicitly incorporating symbol drift features plays a crucial role in detecting semantic misleading cues and improving crossmodal alignment precision. Replacing the text GNN with pooled BERT embeddings further weakened the modeling of long-range dependencies and symbol importance, leading to simultaneous drops in Recall and Precision. Eliminating crossmodal cross-attention reduced the extent of multimodal semantic interaction, diminishing the alignment between local and global features. Removing the contrastive alignment loss weakened the discriminative capacity of the shared semantic space, lowering both the ranking relevance and matching accuracy. Removing both the drift detection and contrastive loss led to the most substantial degradation, bringing the performance close to that of certain multimodal baselines, thereby verifying the central role of these two modules in global discrimination and feature separability.

From a mathematical standpoint, the symbol drift module constructs a sensitive term weight matrix that explicitly amplifies the influence factors of high-drift nodes during feature propagation, which is equivalent to applying node-specific weighting on the adjacency matrix of the graph convolution to enhance aggregation of critical semantic units. The text GNN builds a sparse dependency graph and incorporates residual connections, which help stabilize gradient propagation and avoid over-smoothing, ensuring that global representations retain local salience during multi-hop propagation. Crossmodal cross-attention constructs conditional probability distributions in the shared space via query–key–value mechanisms, enabling bidirectional information flow between text and image, which is mathematically equivalent to introducing conditional dependency terms into the attention weight matrix to improve semantic alignment precision. The contrastive alignment loss optimizes an objective function that essentially maximizes the mutual information of matched pairs while minimizing the distributional overlap of unmatched pairs, thereby enhancing intermodal separability from a probabilistic distribution perspective. Consequently, the superiority of the full model in experiments arises from the complementary synergy of these modules in graph structural modeling, attention allocation, and similarity measurement, leading to greater robustness and discriminative power in complex multimodal recommendation tasks.

4.6. Sensitivity Analysis of Key Hyperparameters

In addition to module-level ablation, we conducted a sensitivity analysis to examine the effect of two critical hyperparameters on model performance: the symbol attention coefficient (

β

) in the Symbol-Aware Text GNN and the drift weight (

λ

) in the contrastive loss.

β

controls the amplification of drift intensity during graph propagation, while

λ

balances the contribution of drift-sensitive pairs in the crossmodal contrastive alignment. Analyzing the sensitivity to these parameters provides insights into the stability of the proposed framework and offers practical guidance for parameter tuning in future applications. We varied

β \in 0.0, 0.25, 0.5, 1.0, 2.0

and

λ \in 0.0, 0.25, 0.5, 1.0, 2.0

independently while keeping all other settings fixed. The performance was evaluated on the main test set with three runs for each configuration, and we report the mean AUC and NDCG@10 values.

From Table 6, we observe that removing symbol attention (

β = 0

) led to a notable performance drop, confirming the importance of drift-aware weighting in the GNN. The performance improved steadily as

β

increased, with the best results obtained at

β = 1.0

. However, excessively large

β

values (

β = 2.0

) slightly reduced the performance outcomes, suggesting that over-amplifying drift signals introduces noise and degrades ranking quality. Similarly, Table 7 shows that

λ

plays a crucial role in contrastive alignment. When

λ = 0

, the model lacked drift-aware boosting and performed worse than configurations with moderate values. The best results were achieved at

λ = 0.5

, after which further increases led to diminishing or slightly negative returns, indicating potential overfitting to drift-heavy instances. Overall, these results demonstrate that the model is relatively robust to moderate variations in

β

and

λ

and that balanced settings (around

β = 1.0

and

λ = 0.5

) yield the most stable and effective performance. This analysis provides practical guidance for parameter selection in future deployments of the proposed framework.

4.7. Case Study

To further illustrate the interpretability and practical value of the proposed framework, we conducted a case study using real-world health-related posts. Figure 7 presents an example where the text claims that “a homemade herbal drink can cure COVID-19 in three days,” which is accompanied by an image of traditional medicine ingredients. Our model correctly identified this post as misleading. The attention visualization shows that the words “cure” and “COVID-19” are assigned high drift scores in the Symbol-Aware Text GNN, reflecting their strong association with misinformation contexts. On the image side, the visual attention map highlights the herbal mixture, which conflicts with verified medical knowledge. The bidirectional cross-attention mechanism aligns these symbolic and visual cues, leading to a consistent misinformation prediction. In contrast, another example in Figure 7 demonstrates a challenging case. The post contains a vague statement such as “boost your immunity naturally”, which is paired with a generic fruit image. Here, the textual content lacks explicit misleading symbols, and the visual cues are weakly informative. The model showed lower confidence in its prediction, which aligns with the error analysis indicating that ambiguous or low-signal posts are particularly challenging. These examples demonstrate that the proposed framework not only achieves strong quantitative performance but also provides interpretable evidence for its predictions. By highlighting which words or image regions trigger drift or semantic misalignment, the model enhances transparency and offers a practical tool for human moderators in real-world deployment scenarios.

5. Discussion

5.1. Practical Application Analysis

In practical application scenarios, the proposed multimodal symbol-aware recommendation and detection framework demonstrates direct deployable value within health information dissemination platforms. Taking large-scale social media platforms as an example, users browsing health-related content are often simultaneously exposed to textual descriptions and accompanying images, such as beverage advertisements labeled as “sugar-free” or promotional images of dietary supplements under the “natural” concept. Such content may appear visually authentic and credible, yet the textual component may contain semantic drift or implicit misleading elements. The symbol drift detection module developed in this study can accurately locate and assign weights to potentially misleading symbols in text before the user views the content. Even when these symbols appear in complex contexts, the model can identify anomalies through contextual feature analysis and comparison with authoritative semantics in the health domain. Additionally, the Symbol-Aware Text GNN module, when processing long-form notes, user comments, or multi-paragraph health advice, preserves syntactic dependency structures and the importance of symbol nodes, thereby maintaining stable identification of key information in the presence of complex narratives or cross-sentence semantic associations.

During the multimodal alignment and fusion stage, the framework effectively addresses scenarios such as “image-true/text-false” or “text-true/image-false.” For instance, in a multimodal post where the image genuinely depicts low-fat ingredients but the text exaggerates their weight-loss effects, or where subtitles in a video emphasize the safety of a food product while the footage reveals inconsistent manufacturing conditions, the system can identify semantic inconsistencies. By incorporating crossmodal cross-attention and drift-aware modulation, the system aligns high-drift features from symbol-related text with corresponding image regions, quickly detecting contradictions between semantic claims and visual evidence, thereby enabling pre-recommendation risk alerts. This approach is equally applicable in environments such as news aggregation platforms, medical science communities, and online shopping platforms—for example, detecting false advertising of health foods in e-commerce, filtering patient-shared exaggerations of therapeutic effects in medical forums, or identifying unscientific wellness advice in short video platforms—thus contributing substantively to safeguarding public health awareness and improving the quality of the information ecosystem.

5.2. Computational Complexity Analysis

In addition to detection performance, the computational efficiency of multimodal misinformation detection frameworks is of critical importance for large-scale deployment. We therefore analyzed and compare the time complexity of the proposed dual-stream Transformer with several representative baseline approaches. Traditional CTR prediction models such as logistic regression and factorization machines exhibit low computational cost, with complexity typically bounded by

O (n d)

, where n is the number of training samples and d is the feature dimension. Deep learning based baselines, such as Wide&Deep and DeepFM, extend this cost to

O (n d k)

due to additional embedding and interaction layers, where k denotes the embedding dimension. Sequence modeling architectures, including DIN and Transformer4Rec, incur higher cost by modeling temporal dependencies. For Transformer-based architectures, the self-attention mechanism has complexity

O (L^{2} d)

, where L is the sequence length, resulting in quadratic growth with respect to input sequence length. Multimodal baselines such as CLIP4Rec or MMRec further increase complexity due to crossmodal feature fusion, typically requiring

O (L_{t}^{2} d_{t} + L_{v}^{2} d_{v})

operations, where

L_{t}

and

L_{v}

denote the lengths of textual and visual sequences, respectively. The proposed dual-stream Transformer introduces additional modules for symbol drift modeling and crossmodal alignment. The Symbol-Aware Text GNN requires neighborhood aggregation over textual dependency graphs, with the complexity being approximately

O (E_{t} d_{t})

, where

E_{t}

is the number of edges in the text graph. The bidirectional cross-attention mechanism across modalities incurs

O (L_{t} L_{v} d)

complexity, which is linear in each modality but multiplicative across them. While this adds extra computational overhead compared to unimodal baselines, the overall complexity remains polynomially bounded and comparable to existing multimodal Transformer architectures.

In practice, the training and inference time of our framework is higher than traditional CTR prediction models but remains tractable for real-world deployment with GPU acceleration. Moreover, techniques such as model pruning, knowledge distillation, and efficient attention mechanisms could be adopted in future work to further reduce inference latency and computational cost, facilitating integration into large-scale platform moderation pipelines.

5.3. Error Analysis

Despite the overall superior performance of the proposed framework, several categories of misleading content remain challenging for accurate detection. First, ambiguous claims framed in vague or generalized language (e.g., “boost your immunity naturally” or “detox your body with simple habits”) often lack explicit symbolic drift cues, resulting in lower model confidence. Second, visually benign or generic images, such as fruits or landscapes, when paired with misleading text, provide weak crossmodal signals, which may cause the alignment module to underperform. Third, emerging health terminology or culturally specific expressions that are not yet represented in the symbol lexicon can lead to semantic drift misclassification. In addition, adversarially crafted posts that intentionally avoid strong symbolic markers—such as using euphemisms or indirect references—pose further challenges by exploiting the model’s reliance on explicit symbols. Finally, multimodal contradictions that are subtle or context-dependent, such as an image of medical equipment paired with an exaggerated health claim, are also prone to errors due to the need for nuanced domain knowledge. These observations suggest that future work should prioritize adaptive symbol expansion, integration of external medical knowledge bases, and adversarial robustness to better address these difficult cases.

5.4. Limitation and Future Work

Although significant performance gains have been achieved in multimodal health misinformation detection, certain limitations remain. First, the construction of the symbol-sensitive lexicon relies on a hybrid approach of manual curation and data-driven expansion. While this balances coverage and domain relevance, it may lag in updating emerging health terminology or cross-cultural expressions. Second, the Symbol-Aware Text GNN and cross-modal alignment modules require large quantities of labeled multimodal data, which are often scarce in low-resource or minority language domains. Moreover, the current method primarily focuses on text–image scenarios and is less effective for dynamic modalities such as short videos and audio. The dataset is also limited in language coverage, whereas misinformation is a global phenomenon spreading across diverse languages and scripts, often involving code-switching and culturally specific expressions. In addition, fairness and bias remain concerns, as predictions may inadvertently reflect demographic disparities. From a deployment perspective, practical challenges persist, including the need for transparent auditing workflows and seamless integration with platform moderation pipelines.

Future work will address these issues along several directions. To reduce reliance on manually curated lexicons, retrieval-augmented generation (RAG) will be explored to dynamically query verified health knowledge bases, while self-supervised contrastive learning on large-scale unlabeled corpora will be used to capture semantic drifts. For modality expansion, the framework will be extended to integrate video and audio using temporal attention mechanisms, enabling the detection of inconsistencies between visual streams and accompanying narration. To improve interpretability, attention heatmaps and attribution methods will be employed to highlight specific words or image regions that drive predictions. For multilingual generalization, we plan to curate a multilingual dataset spanning both high-traffic and low-resource languages and extend the lexicon via cross-lingual embeddings and expert-guided alignment of culturally relevant terms (e.g., “sugar-free,” “detox,” “natural”). Finally, fairness and trustworthiness will be emphasized through systematic bias evaluation, user-centric explanations, and deployment pipelines that balance automated detection with human oversight.

6. Conclusions

In the current landscape of health information dissemination, the trend toward multimodal presentations of false and misleading content has become increasingly evident, posing a serious threat to public health cognition and decision making, thereby necessitating the development of detection methods with robust crossmodal recognition capabilities. To address this challenge, a dual-stream Transformer-based multimodal health misinformation detection framework has been proposed, which integrates symbol drift detection, a Symbol-Aware Text GNN, and a crossmodal alignment fusion module, enabling joint modeling of implicit misleading symbols, textual structural dependencies, and crossmodal semantic conflicts in the health domain. From an innovative perspective, a domain-specific health symbol-sensitive lexicon was constructed, and a contextual drift measurement was introduced, embedding symbol information as explicit features into the node update process of the graph neural network for the first time. Symbol-aware graph convolution and bidirectional cross-attention mechanisms were designed to assign higher attention weights to high-drift symbols during crossmodal alignment, while the integration of contrastive learning and consistency regularization was found to enhance the model’s discriminative capacity and robustness. The experimental results demonstrate that the proposed method consistently and significantly outperforms state-of-the-art baselines in CTR prediction, multimodal recommendation, and ranking tasks, achieving superior performance across all major evaluation metrics and reflecting a favorable balance between accuracy and recall. The ablation study further validates the critical contributions of symbol drift modeling, graph-structured encoding, and crossmodal alignment mechanisms to performance improvements. This research not only enriches the methodological landscape of multimodal information detection but also offers a deployable and interpretable technical pathway for the governance of health information dissemination.

Author Contributions

Conceptualization, J.W., Z.F., C.J., and Y.Z.; Data curation, M.L.; Funding acquisition, Y.Z.; Methodology, J.W., Z.F., and C.J.; Project administration, Y.Z.; Resources, M.L.; Software, J.W., Z.F., and C.J.; Supervision, Y.Z.; Visualization, M.L.; Writing—original draft, J.W., Z.F., C.J., M.L., and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China via grant number 61202479.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Algorithm A1: Symbol-Aware Text GNN with hyperparameters

Algorithm A2: Drift-aware bidirectional cross-attention and contrastive alignment

Input: Text token matrix

T \in R^{L_{t} \times d}

; image region matrix

R \in R^{L_{v} \times d}

;

Global vectors

g_{text}, g_{img} \in R^{d}

; drift score

D_{sym} \geq 0

;

Projection heads

P_{q}, P_{k}, P_{v}

; fusion MLP; temperature

τ

;

Drift-aware attention bias

γ

; contrastive drift weight

λ

Output: Fused vector

h_{fuse}

; losses

L_{cls}, L_{align}, L_{cons}

2: // Shared projections
3: $Q_{t} \leftarrow P_{q} (T)$ , $K_{v} \leftarrow P_{k} (R)$ , $V_{v} \leftarrow P_{v} (R)$ ; $Q_{v} \leftarrow P_{q} (R)$ , $K_{t} \leftarrow P_{k} (T)$ , $V_{t} \leftarrow P_{v} (T)$
5: // Drift-aware cross-attention: text → image
6: $A_{t \to v} \leftarrow softmax (\frac{Q_{t} K_{v}^{⊤}}{\sqrt{d}} + γ D_{sym} 1)$ ; $Z_{t \to v} \leftarrow A_{t \to v} V_{v}$
8: // Drift-aware cross-attention: image → text
9: $A_{v \to t} \leftarrow softmax (\frac{Q_{v} K_{t}^{⊤}}{\sqrt{d}} + γ D_{sym} 1)$ ; $Z_{v \to t} \leftarrow A_{v \to t} V_{t}$
11: // Pooling & fusion
12: $z_{1} \leftarrow mean (Z_{t \to v})$ ; $z_{2} \leftarrow mean (Z_{v \to t})$
13: $h_{fuse} \leftarrow MLP ([g_{text} ∥ g_{img} ∥ z_{1} ∥ z_{2} ∥ D_{sym}])$
15: // Losses
16: $L_{cls} \leftarrow CE (classifier (h_{fuse}), y)$
17: if in-batch contrastive then $u \leftarrow W_{u} g_{text}$ ; $v \leftarrow W_{v} g_{img}$ ;

$L_{align} \leftarrow - log \frac{exp ((u^{⊤} v + λ D_{sym}) / τ)}{\sum_{v^{'}} exp (u^{⊤} v^{'} / τ)} - log \frac{exp ((u^{⊤} v + λ D_{sym}) / τ)}{\sum_{u^{'}} exp (u^{' ⊤} v / τ)}$

18: $L_{cons} \leftarrow ∥ g_{text} - W z_{1} ∥_{2}^{2} + {∥ g_{img} - W^{'} z_{2} ∥}_{2}^{2}$
19: return $h_{fuse}, L_{cls}, L_{align}, L_{cons}$

References

Chen, A.; Wei, Y.; Le, H.; Zhang, Y. Learning by teaching with ChatGPT: The effect of teachable ChatGPT agent on programming education. Br. J. Educ. Technol. 2024; early view. [Google Scholar]
Chen, J.; Sun, B.; Li, H.; Lu, H.; Hua, X.S. Deep ctr prediction in display advertising. In Proceedings of the 24th ACM international conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 811–820. [Google Scholar]
Zhao, W.; Zhang, J.; Xie, D.; Qian, Y.; Jia, R.; Li, P. AIBox: CTR prediction model training on a single node. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 319–328. [Google Scholar]
Wang, Q.; Liu, F.; Zhao, X.; Tan, Q. A CTR prediction model based on session interest. PLoS ONE 2022, 17, e0273048. [Google Scholar] [CrossRef]
Yin, H.; Cui, B.; Chen, L.; Hu, Z.; Zhou, X. Dynamic user modeling in social media systems. ACM Trans. Inf. Syst. TOIS 2015, 33, 1–44. [Google Scholar] [CrossRef]
Zhou, C.; Bai, J.; Song, J.; Liu, X.; Zhao, Z.; Chen, X.; Gao, J. Atrank: An attention-based user behavior modeling framework for recommendation. In Proceedings of the AAAI conference on artificial intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Lan, L.; Geng, Y. Accurate and interpretable factorization machines. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 4139–4146. [Google Scholar]
Wu, S.; Li, Z.; Su, Y.; Cui, Z.; Zhang, X.; Wang, L. GraphFM: Graph factorization machines for feature interaction modeling. arXiv 2021, arXiv:2105.11866. [Google Scholar] [CrossRef]
Yi, Y.; Zhou, Y.; Wang, T.; Zhou, J. Advances in Video Emotion Recognition: Challenges and Trends. Sensors 2025, 25, 3615. [Google Scholar] [CrossRef] [PubMed]
Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. Multimodal fusion for audio-image and video action recognition. Neural Comput. Appl. 2024, 36, 5499–5513. [Google Scholar] [CrossRef]
Kraprayoon, J.; Pham, A.; Tsai, T.J. Improving the robustness of DTW to global time warping conditions in audio synchronization. Appl. Sci. 2024, 14, 1459. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, J.; Sun, L.; Yang, G. Contrastive Learning-Based Cross-Modal Fusion for Product Form Imagery Recognition: A Case Study on New Energy Vehicle Front-End Design. Sustainability 2025, 17, 4432. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, Y.; Ma, X. A new strategy for tuning ReLUs: Self-adaptive linear units (SALUs). In Proceedings of the ICMLCA 2021; 2nd International Conference on Machine Learning and Computer Application, Shenyang, China, 17–19 December 2021; pp. 1–8. [Google Scholar]
Bai, J.; Geng, X.; Deng, J.; Xia, Z.; Jiang, H.; Yan, G.; Liang, J. A comprehensive survey on advertising click-through rate prediction algorithm. Knowl. Eng. Rev. 2025, 40, e3. [Google Scholar] [CrossRef]
Zhou, X.; Chen, S.; Ren, Y.; Zhang, Y.; Fu, J.; Fan, D.; Lin, J.; Wang, Q. Atrous Pyramid GAN Segmentation Network for Fish Images with High Performance. Electronics 2022, 11, 911. [Google Scholar] [CrossRef]
Zhang, W.; Han, Y.; Yi, B.; Zhang, Z. Click-through rate prediction model integrating user interest and multi-head attention mechanism. J. Big Data 2023, 10, 11. [Google Scholar] [CrossRef]
He, L.; Chen, H.; Wang, D.; Jameel, S.; Yu, P.; Xu, G. Click-through rate prediction with multi-modal hypergraphs. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual Event, QLD, Australia, 1–5 November 2021; pp. 690–699. [Google Scholar]
Deng, J.; Shen, D.; Wang, S.; Wu, X.; Yang, F.; Zhou, G.; Meng, G. ContentCTR: Frame-level live streaming click-through rate prediction with multimodal transformer. arXiv 2023, arXiv:2306.14392. [Google Scholar]
Deng, K.; Woodland, P.C. Multi-head Temporal Latent Attention. arXiv 2025, arXiv:2505.13544. [Google Scholar] [CrossRef]
Blondel, M.; Fujino, A.; Ueda, N.; Ishihata, M. Higher-order factorization machines. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Guo, H.; Tang, R.; Ye, Y.; Li, Z.; He, X. DeepFM: A factorization-machine based neural network for CTR prediction. arXiv 2017, arXiv:1703.04247. [Google Scholar]
Zhou, G.; Zhu, X.; Song, C.; Fan, Y.; Zhu, H.; Ma, X.; Yan, Y.; Jin, J.; Li, H.; Gai, K. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 1059–1068. [Google Scholar]
de Souza Pereira Moreira, G.; Rabhi, S.; Lee, J.M.; Ak, R.; Oldridge, E. Transformers4rec: Bridging the gap between nlp and sequential/session-based recommendation. In Proceedings of the 15th ACM Conference on Recommender Systems, Amsterdam, The Netherlands, 27 September–1 October 2021; pp. 143–153. [Google Scholar]
Yang, Y.; Zhang, L.; Liu, J. Temporal user interest modeling for online advertising using Bi-LSTM network improved by an updated version of Parrot Optimizer. Sci. Rep. 2025, 15, 18858. [Google Scholar] [CrossRef] [PubMed]
Wei, Y.; Wang, X.; Nie, L.; He, X.; Hong, R.; Chua, T.S. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1437–1445. [Google Scholar]
Li, T.; Yang, X.; Ke, Y.; Wang, B.; Liu, Y.; Xu, J. Alleviating the inconsistency of multimodal data in cross-modal retrieval. In Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, The Netherlands, 13–16 May 2024; pp. 4643–4656. [Google Scholar]
Kim, W.; Son, B.; Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 5583–5594. [Google Scholar]
Gan, Z.; Li, L.; Li, C.; Wang, L.; Liu, Z.; Gao, J. Vision-language pre-training: Basics, recent advances, and future trends. Comput. Graph. Vis. 2022, 14, 163–352. [Google Scholar]
Singh, A.; Hu, R.; Goswami, V.; Couairon, G.; Galuba, W.; Rohrbach, M.; Kiela, D. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 15638–15650. [Google Scholar]
Sun, C.; Myers, A.; Vondrick, C.; Murphy, K.; Schmid, C. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7464–7473. [Google Scholar]
Li, M.; Xu, R.; Wang, S.; Zhou, L.; Lin, X.; Zhu, C.; Zeng, M.; Ji, H.; Chang, S.F. Clip-event: Connecting text and images with event structures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16420–16429. [Google Scholar]
Woo, J.; Chen, H. Epidemic model for information diffusion in web forums: Experiments in marketing exchange and political dialog. SpringerPlus 2016, 5, 66. [Google Scholar] [CrossRef]
Xu, Z.; Qian, M. Predicting popularity of viral content in social media through a temporal-spatial cascade convolutional learning framework. Mathematics 2023, 11, 3059. [Google Scholar] [CrossRef]
Nguyen, P.T.; Huynh, V.D.B.; Vo, K.D.; Phan, P.T.; Le, D.N. Deep Learning based Optimal Multimodal Fusion Framework for Intrusion Detection Systems for Healthcare Data. Comput. Mater. Contin. 2021, 66, 2555–2571. [Google Scholar] [CrossRef]
Song, Y.; Elkahky, A.M.; He, X. Multi-rate deep learning for temporal recommendation. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, Italy, 17–21 July 2016; pp. 909–912. [Google Scholar]
Dean, S.; Dong, E.; Jagadeesan, M.; Leqi, L. Recommender systems as dynamical systems: Interactions with viewers and creators. In Proceedings of the Workshop on Recommendation Ecosystems: Modeling, Optimization and Incentive Design, Vancouver, BC, Canada, 26–27 February 2024. [Google Scholar]
Sangiorgio, E.; Di Marco, N.; Etta, G.; Cinelli, M.; Cerqueti, R.; Quattrociocchi, W. Evaluating the effect of viral posts on social media engagement. Sci. Rep. 2025, 15, 639. [Google Scholar] [CrossRef] [PubMed]
Munusamy, H.; C, C.S. Multimodal attention-based transformer for video captioning. Appl. Intell. 2023, 53, 23349–23368. [Google Scholar] [CrossRef]
Fedus, W.; Zoph, B.; Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res. 2022, 23, 5232–5270. [Google Scholar]
Cheng, H.T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.; Aradhye, H.; Anderson, G.; Corrado, G.; Chai, W.; Ispir, M.; et al. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, Boston, MA, USA, 15 September 2016; pp. 7–10. [Google Scholar]
Ma, J.; Zhao, Z.; Yi, X.; Chen, J.; Hong, L.; Chi, E.H. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 1930–1939. [Google Scholar]
Luo, H.; Ji, L.; Zhong, M.; Chen, Y.; Lei, W.; Duan, N.; Li, T. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 2022, 508, 293–304. [Google Scholar] [CrossRef]
Zhou, X. Mmrec: Simplifying multimodal recommendation. In Proceedings of the 5th ACM International Conference on Multimedia in Asia Workshops, Tainan, Taiwan, 6–8 December 2023; pp. 1–2. [Google Scholar]
Burges, C.J. From ranknet to lambdarank to lambdamart: An overview. Learning 2010, 11, 81. [Google Scholar]
Lyu, J.; Ling, S.H.; Banerjee, S.; Zheng, J.; Lai, K.L.; Yang, D.; Zheng, Y.P.; Bi, X.; Su, S.; Chamoli, U. Ultrasound volume projection image quality selection by ranking from convolutional RankNet. Comput. Med Imaging Graph. 2021, 89, 101847. [Google Scholar] [CrossRef] [PubMed]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]

Figure 1. Overall flowchart of the symbol-sensitive lexicon and drift extraction framework. The pipeline consists of three major stages. Ingestion: Raw transcripts are chunked and embedded and then used to construct a symbol-sensitive lexicon in the form of a knowledge graph, where nodes represent transcript chunks, and edges encode semantic relations. Retrieval: Given a user query, query processing generates embeddings which are scored by an enhanced GNN module with a GAT-based backbone and query-guided pooling. A scoring head produces similarity scores, and relevant subgraphs are ranked accordingly. Generation: The top-K ranked subgraphs are retrieved as context, integrated by the generator, and used to produce the final response. The knowledge base constructed during ingestion supports retrieval, while drift signals are explicitly modeled and exploited during scoring to highlight symbol-level discrepancies. This modular design illustrates how symbolic knowledge, graph modeling, and retrieval-augmented generation interact to detect and interpret drifts in health-related information.

Figure 2. Structural diagram of the Symbol-Aware Text GNN module. The text encoder first produces initial token embeddings, including both the original representation

x_{v}^{o r i g}

and its symbol-drift–aware variant

x_{v}^{v e r}

. These embeddings are fused through a gating mechanism parameterized by learnable weights

(W_{u}, U_{i}, W_{f})

to generate symbol-aware node features

(f_{v}, l_{v})

. The resulting features are integrated with the graph structure, where semantic relations form the edges between nodes. The graph encoder then propagates contextual information to obtain symbol-enhanced node embeddings

h_{v}

and aggregated outputs

O_{v}

. Finally, a discriminator distinguishes positive and negative pairs in the contrastive setting, reinforcing alignment between drift-sensitive and context-consistent representations. This design enables the model to explicitly incorporate symbol drift signals into graph-based textual dependency modeling.

Figure 2. Structural diagram of the Symbol-Aware Text GNN module. The text encoder first produces initial token embeddings, including both the original representation

x_{v}^{o r i g}

and its symbol-drift–aware variant

x_{v}^{v e r}

. These embeddings are fused through a gating mechanism parameterized by learnable weights

(W_{u}, U_{i}, W_{f})

to generate symbol-aware node features

(f_{v}, l_{v})

. The resulting features are integrated with the graph structure, where semantic relations form the edges between nodes. The graph encoder then propagates contextual information to obtain symbol-enhanced node embeddings

h_{v}

and aggregated outputs

O_{v}

. Finally, a discriminator distinguishes positive and negative pairs in the contrastive setting, reinforcing alignment between drift-sensitive and context-consistent representations. This design enables the model to explicitly incorporate symbol drift signals into graph-based textual dependency modeling.

Figure 3. Schematic of the text–image alignment multimodal fusion module. Visual inputs are first encoded by an image encoder

f_{θ}

and mapped through shared pre-projectors into a common embedding space, while textual inputs are processed by a text encoder

f_{ϕ}

and projected similarly. Both modalities undergo contrastive learning with intramodal objectives (

L_{intra}

) to preserve modality-specific structure and intermodal objectives (

L_{inter}

) to encourage crossmodal alignment. Momentum-updated encoders (EMA) and target networks are used to stabilize training, with query and key representations (q, g), enabling bidirectional matching between text and image. This design ensures that semantically consistent text–image pairs are aligned, while mismatched or misleading combinations are penalized, thereby enhancing crossmodal robustness for misinformation detection.

Figure 3. Schematic of the text–image alignment multimodal fusion module. Visual inputs are first encoded by an image encoder

f_{θ}

and mapped through shared pre-projectors into a common embedding space, while textual inputs are processed by a text encoder

f_{ϕ}

and projected similarly. Both modalities undergo contrastive learning with intramodal objectives (

L_{intra}

) to preserve modality-specific structure and intermodal objectives (

L_{inter}

) to encourage crossmodal alignment. Momentum-updated encoders (EMA) and target networks are used to stabilize training, with query and key representations (q, g), enabling bidirectional matching between text and image. This design ensures that semantically consistent text–image pairs are aligned, while mismatched or misleading combinations are penalized, thereby enhancing crossmodal robustness for misinformation detection.

Figure 4. Bar chart comparing CTR prediction baselines and proposed dual-stream Transformer across evaluation metrics.

Figure 5. Violin plot comparing CTR prediction baselines and proposed dual-stream Transformer across hit rate evaluation metrics.

Figure 6. Line chart comparing multimodal and ranking baselines with the proposed framework across evaluation metrics.

Figure 7. Case study examples with visual attention heatmaps highlighting critical regions. (Left) A misleading post claiming that a homemade herbal drink can cure COVID-19 has been correctly classified by the model; the attention map emphasizes the herbal mixture, indicating strong symbolic drift. (Right) A vague post suggesting that eating fruits can “naturally boost immunity” represents a challenging case; the visual cues are less informative, and the model exhibits lower prediction confidence. These examples illustrate how the proposed framework identifies crossmodal inconsistencies and provides interpretable evidence for its predictions.

Table 1. Detailed composition of the multimodal fake health information dataset.

Platform	Period	Modality	Authentic	Misleading	False	Total
Weibo	January 2023–December 2024	Text + Image	2350	2120	1980	6450
Xiaohongshu	January 2023–December 2024	Text + Image	2180	2040	1950	6170
Total	—	Text + Image	4530	4160	3930	12,620

Table 2. CTR prediction baselines compared with the proposed dual-stream Transformer framework. Best results per column are marked with †.

Model	AUC	NDCG@10	Precision@10	Recall@10	F1-Score	Hit Rate	Inference Latency (ms)	Memory Usage (GB)
Wide&Deep [40]	0.882	0.468	0.415	0.392	0.403	0.611	12	1.2
DeepFM [21]	0.894	0.482	0.429	0.406	0.417	0.628	18	1.5
DIN [22]	0.903	0.501	0.442	0.418	0.430	0.644	27	2.1
Transformer4Rec [23]	0.914	0.523	0.459	0.434	0.446	0.667	35	2.8
Proposed	0.931 ^†	0.558 ^†	0.486 ^†	0.461 ^†	0.473 ^†	0.702 ^†	42	3.4

Table 3. Multimodal and ranking baselines compared with the proposed framework under identical training protocols. Best results per column are marked with †.

Model	AUC	NDCG@10	Precision@10	Recall@10	F1-Score	Hit Rate	Inference Latency (ms)	Memory Usage (GB)
MMoE [41]	0.902	0.497	0.438	0.414	0.426	0.639	25	2.0
CLIP4Rec [42]	0.909	0.515	0.452	0.427	0.439	0.654	33	2.6
MMRec [43]	0.913	0.521	0.457	0.432	0.444	0.662	36	2.9
VideoBERT [30]	0.907	0.509	0.447	0.422	0.434	0.650	38	3.0
LambdaRank [44]	0.898	0.490	0.433	0.408	0.420	0.634	8	0.9
RankNet [45]	0.891	0.481	0.426	0.401	0.413	0.627	10	1.0
LightGBM-Rank [46]	0.905	0.505	0.444	0.420	0.432	0.646	15	1.2
Proposed	0.931 ^†	0.558 ^†	0.486 ^†	0.461 ^†	0.473 ^†	0.702 ^†	42	3.4

Table 4. Comparison of ranking-based baselines and the proposed framework. Best results per column are marked with †.

Model	AUC	NDCG@10	Precision@10	Recall@10	F1-Score	Hit Rate	Inference Latency (ms)	Memory Usage (GB)
LambdaRank [44]	0.898	0.490	0.433	0.408	0.420	0.634	8	0.9
RankNet [45]	0.891	0.481	0.426	0.401	0.413	0.627	12	1.1
LightGBM-Rank [46]	0.905	0.505	0.444	0.420	0.432	0.646	15	1.3
Proposed	0.931 ^†	0.558 ^†	0.486 ^†	0.461 ^†	0.473 ^†	0.702 ^†	42	3.4

Table 5. Ablation study of the proposed framework. “–” denotes the removed component.

Configuration	AUC	NDCG@10	Precision@10	Recall@10	F1-Score	Hit Rate
Full model (Symbol drift + Text GNN + Cross-attn + Contrastive)	0.931	0.558	0.486	0.461	0.473	0.702
– Symbol drift module	0.919	0.536	0.468	0.444	0.456	0.683
– Text GNN (use pooled BERT text only)	0.915	0.529	0.462	0.438	0.450	0.676
– Cross-attention (late concat only)	0.917	0.532	0.465	0.440	0.452	0.678
– Contrastive alignment loss	0.920	0.540	0.470	0.446	0.458	0.687
– Both drift and contrastive	0.908	0.517	0.452	0.428	0.440	0.662

Table 6. Sensitivity analysis of the symbol attention coefficient

β

in the Text GNN.

Table 6. Sensitivity analysis of the symbol attention coefficient

β

in the Text GNN.

$β$	AUC	NDCG@10	Precision@10	Recall@10	F1-Score	Hit Rate
0.00	0.902	0.495	0.437	0.412	0.424	0.638
0.25	0.915	0.520	0.454	0.429	0.442	0.661
0.50	0.923	0.539	0.471	0.445	0.458	0.683
1.00	0.931	0.558	0.486	0.461	0.473	0.702
2.00	0.918	0.528	0.462	0.436	0.449	0.672

Note: Bold numbers indicate the best performance across different

β

values.

Table 7. Sensitivity analysis of the drift weight

λ

in the contrastive loss.

Table 7. Sensitivity analysis of the drift weight

λ

in the contrastive loss.

$λ$	AUC	NDCG@10	Precision@10	Recall@10	F1-Score	Hit Rate
0.00	0.901	0.492	0.434	0.410	0.422	0.636
0.25	0.914	0.518	0.452	0.428	0.440	0.659
0.50	0.922	0.537	0.468	0.443	0.456	0.681
1.00	0.931	0.558	0.486	0.461	0.473	0.702
2.00	0.916	0.526	0.459	0.435	0.447	0.670

Note: Bold numbers indicate the best performance across different

λ

values.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Fu, Z.; Jiang, C.; Li, M.; Zhan, Y. Dual-Stream Transformer with LLM-Empowered Symbol Drift Modeling for Health Misinformation Detection. Appl. Sci. 2025, 15, 9992. https://doi.org/10.3390/app15189992

AMA Style

Wang J, Fu Z, Jiang C, Li M, Zhan Y. Dual-Stream Transformer with LLM-Empowered Symbol Drift Modeling for Health Misinformation Detection. Applied Sciences. 2025; 15(18):9992. https://doi.org/10.3390/app15189992

Chicago/Turabian Style

Wang, Jingsheng, Zhengjie Fu, Chenlu Jiang, Manzhou Li, and Yan Zhan. 2025. "Dual-Stream Transformer with LLM-Empowered Symbol Drift Modeling for Health Misinformation Detection" Applied Sciences 15, no. 18: 9992. https://doi.org/10.3390/app15189992

APA Style

Wang, J., Fu, Z., Jiang, C., Li, M., & Zhan, Y. (2025). Dual-Stream Transformer with LLM-Empowered Symbol Drift Modeling for Health Misinformation Detection. Applied Sciences, 15(18), 9992. https://doi.org/10.3390/app15189992

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dual-Stream Transformer with LLM-Empowered Symbol Drift Modeling for Health Misinformation Detection

Abstract

1. Introduction

2. Related Work

2.1. CTR Prediction and Temporal Modeling Methods

2.2. Multimodal Content Understanding and Propagation Modeling

2.3. Content Popularity and Propagation Potential Modeling in Recommendation Systems

3. Materials and Method

3.1. Data Collection

3.2. Data Preprocessing and Data Augmentation

3.3. Proposed Method

3.3.1. Overall

3.3.2. Symbol-Sensitive Lexicon and Drift Score Extraction

3.3.3. Symbol-Aware Text GNN

3.3.4. Multimodal Fusion Module for Image–Text Alignment

4. Results and Discussion

4.1. Experimental Setup and Evaluation Metrics

4.1.1. Evaluation Metrics

4.1.2. Baseline Models

4.2. CTR Prediction Baselines vs. Proposed Model

4.3. Multimodal and Ranking Baselines vs. Proposed Model

4.4. Comparison of Ranking-Based Baselines and the Proposed Framework

4.5. Ablation Study of the Proposed Framework

4.6. Sensitivity Analysis of Key Hyperparameters

4.7. Case Study

5. Discussion

5.1. Practical Application Analysis

5.2. Computational Complexity Analysis

5.3. Error Analysis

5.4. Limitation and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI