Fake News Detection in Short Videos by Integrating Semantic Credibility and Multi-Granularity Contrastive Learning

Yang, Yukun; Shi, Xiwei; Li, Haoxu; Fan, Buwei; Xu, Yijia

doi:10.3390/app152312621

Open AccessArticle

Fake News Detection in Short Videos by Integrating Semantic Credibility and Multi-Granularity Contrastive Learning

by

Yukun Yang

,

Xiwei Shi

,

Haoxu Li

,

Buwei Fan

and

Yijia Xu

^*

School of Cyber Science and Engineering, Sichuan University, Chengdu 610207, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(23), 12621; https://doi.org/10.3390/app152312621

Submission received: 10 November 2025 / Revised: 26 November 2025 / Accepted: 26 November 2025 / Published: 28 November 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Short videos have become a primary medium for news delivery, but their low cost, rapid diffusion, and multimodal nature make misinformation easier to generate and harder to verify. Existing methods often rely on single-modality cues or shallow cross-modal correlations, making it difficult to distinguish manipulations from benign edits and limiting interpretability. We propose a robust and interpretable framework for fake news detection in short videos. It combines LLM-based video understanding and online search for multi-dimensional credibility assessment, employs RoBERTa and capsule networks for semantic aggregation, and leverages a diffusion model with multi-granularity contrastive learning to enforce cross-modal consistency. A neuro-symbolic rule engine further calibrates predictions with logical constraints to provide traceable rationales. Experiments on the FakeSV dataset demonstrate an accuracy of 89.11% and an F1 score of 89.53%, significantly outperforming mainstream baseline models. This performance surpasses the current state-of-the-art OpEvFake model, which recorded an accuracy of 87.80% and an F1 score of 87.71%, and also substantially outperforms the representative short-video detection method SV-FEND, which achieved an accuracy of 81.69% and an F1 score of 81.78%. The framework shows robustness against emotional manipulation, title–content inconsistency, audio–video desynchronization, and local tampering, while offering explanatory evidence through rule triggers and modality contributions.

Keywords:

fake news detection; semantic credibility analysis; large language model; multi-granularity contrastive learning; multimodal fusion

1. Introduction

Short videos have become a dominant medium for information dissemination. According to the 2025 TikTok application report [1], by the end of 2024 TikTok had reached 1.6 billion active users, with approximately 100 million new users added during that year. Short videos, marked by brevity, fragmented content, high interactivity, and algorithm-driven recommendations, significantly improve the efficiency of information transmission. At the same time, they also accelerate the spread of misinformation. In this context, fake news, which refers to information that appears in the form of news but does not reflect factual truth and is designed to mislead the public or manipulate opinion for specific political or economic purposes, has become increasingly prominent [2]. On widely accessible short-video platforms, such misleading content can quickly proliferate through algorithmic recommendation systems and repeatedly reinforce incorrect perceptions among viewers as a result of information overload and cognitive bias. The harmful impact of such misinformation is especially evident in the public health field. During the COVID-19 pandemic, for example, various so-called miracle cures were widely circulated, prompting the public to follow them blindly and engage in panic buying. This not only intensified social anxiety but also resulted in the waste of resources and hindered the implementation of public policies [3]. Therefore, the ability to effectively identify false or misleading information in short video news is essential for maintaining a healthy information ecosystem and ensuring social stability.

In traditional text-based news, fake news often spreads through sensationalized headlines, deceptive excerpts, fabricated sources, or emotional language. Common detection strategies include semantic consistency checks, fact-checking, and propagation network analysis, relying on structured textual evidence and pragmatics. However, short-video news typically involves multiple modalities, such as text appearing in titles, subtitles, and comments, as well as audio, visual frames, and user information. Semantic granularity, temporal density, and noise distribution vary significantly between these modalities. Although this multimodal nature provides more evidence for fact checking, it is also exploited by fraudsters to fabricate a “real” narrative through cross-modal manipulation.

The multimodal signals of short video news exhibit inherent asymmetries in terms of semantic levels and time scales. Textual elements often present highly summarized assertions or evaluations that convey conclusive information, whereas visuals and audio typically provide instantaneous perceptual facts and dynamic cues. Fraudsters exploit these modal differences by creating cross-modal mismatches or coordinated manipulations to mislead viewers. Therefore, methods relying on a single modality are insufficient to fully reveal such hidden falsifications [4,5].

Effective multimodal fake news detection requires addressing two core tasks: verifying the global semantic consistency between modalities and detecting local temporal anomalies. The first task evaluates whether textual information semantically aligns with both visual and auditory content, while the second task targets short-term contextual changes that occur as a result of editing, splicing, or dubbing replacement [6,7,8]. Local temporal features, such as motion continuity between frames and the synchronization of lip movements with audio, along with frame-level visual forensics, including local textures and synthesis artifacts, often provide complementary clues. Together, these features form the basic detection capability against forgeries in short videos [9,10]. Building on these insights, researchers have proposed new detection methods but have also encountered emerging challenges.

C1: Limitations of simple multimodal fusion. Many methods extract single-modality features and concatenate them into a joint representation. However, this approach struggles to capture intermodal relationships. For example, deceptive fake news often aligns audio tone with visual scenes to evoke certain emotions (fear, anger, etc.), whereas official news reports tend to narrate calmly and objectively. Simple feature concatenation may overlook such issues. C2: Challenges in Cross-Modal Enhancement. Another approach is to extract features independently for each modality and use attention mechanisms to allow the modalities to reinforce one another. However, modalities are often misaligned in semantic space, and traditional methods may miss deep signals such as implicit viewpoints, multi-step causal reasoning, or selectively emphasized information. Shallow models struggle to capture these signals, and differences in representation and temporal scale, along with local edits or misalignments in manipulated content, further complicate detection. Cross-modal feature alignment and fine-grained inconsistency detection are therefore essential, while existing methods mainly focus on image–text correspondence and cannot address this multi-scale problem. C3: Lack of external verification. Existing paradigms often rely solely on video content, without cross-verifying external information. Relying only on internal clues is prone to model “hallucinations” and biases in data sources, while the reliability of information sources (e.g., differences between authoritative media and individual accounts) is an important clue for truthfulness. However, current methods have not systematically incorporated online search evidence and fact-checking into the detection process. C4: Heterogeneous signal contributions. Different modality features and external evidence can contribute variably to the final decision depending on context, topic, and scenario. Fixed or uniform fusion strategies overlook these dynamic differences, risking amplification of noise or suppression of reliable signals, thereby reducing detection performance.

To address these challenges, we first design a multi-dimensional credibility assessment that extracts implicit viewpoints and incorporates fact-checking, profiling potential risks across nine dimensions such as expertise, title-content consistency, emotional manipulation, editing artifacts, AI-generated traces, and source reliability. This directly tackles the first challenge, which concerns uncovering hidden manipulation tactics. Second, to solve the cross-modal alignment problem and enhance semantic representation, we propose a multi-granularity contrastive learning framework to handle the second challenge. This framework uses a unified InfoNCE loss to pull together positive pairs and push apart negative pairs across multiple abstraction levels, achieving deep alignment and semantic enhancement of cross-modal features. Specifically, it performs contrastive learning at four levels: at the global level, it distinguishes entire true and fake news; at the modal level, it captures consistency across heterogeneous modalities; at the temporal level, it aligns audio and video sequences; and at the spatial level, it localizes forged regions. We also integrate online search for fact-checking as evidence to systematically incorporate the timeliness and objectivity of external sources, thereby addressing the third challenge, which involves the need for real-time and objective verification. Finally, to improve explainability and adapt to different scenarios, we develop a neural-symbolic rule engine to address the fourth challenge. This module uses a set of semantic-driven soft matching and dynamic weighting rules to adjust the influence of various signals. Based on the multi-dimensional credibility assessment, our rule system defines a rule library covering dimensions of scientific validity, factuality, logical soundness, and common sense. Each rule has an initial weight, and through soft cosine similarity matching in a shared semantic space, we dynamically adjust their weights to regulate the model’s predictions.

In summary, we propose a short-video fake news detection framework that integrates multi-dimensional credibility assessment and neural-symbolic reasoning. It achieves semantic enhancement within modalities, cross-modal viewpoint interaction, and multi-granularity contrastive learning, while incorporating external fact evidence to bolster decision reliability. We conduct comprehensive evaluations on the FakeSV test set. Under a controlled experimental setup, our framework achieves an accuracy of 89.11% while maintaining manageable computational complexity. The main contributions of this work are:

We propose a multi-dimensional credibility assessment framework for detecting hidden manipulation and supporting fact-checking. Our approach incorporates both intrinsic multimodal content such as text, audio, video, user profile information, and comment threads, as well as external fact-checking signals, which together enable the identification of subtle manipulation tactics. The framework evaluates nine specific dimensions: the level of expertise reflected in the content, consistency with physical laws and commonsense knowledge, the likelihood of being generated by AI, the presence of editing artifacts, the degree of alignment between title and content, the extent of emotional bias, the use of misleading cues, the reliability of the source, and the intention underlying information propagation. This fine-grained analysis provides a precise quantification of news credibility. In comparison with traditional single-modality or shallow-feature methods, the proposed framework more effectively detects misleading short videos that appear credible on the surface, thereby enhancing both detection accuracy and robustness.
We propose a multi-granularity contrastive learning mechanism, conducting feature comparisons at global, modal, temporal, and spatial levels. This enhances cross-modal consistency and discrimination, making the model more sensitive to cross-modal contradictions and implicit manipulations, and thereby improving its generalization to complex short-video scenarios.
We present an explainable detection framework for fake news in short videos. The neural-symbolic rule engine provides logical explanations for model decisions and outputs each video score on the credibility assessment dimensions along with the rule matching status. By combining this engine with the multimodal feature fusion module, we achieve traceable decision-making, enhancing system transparency and trustworthiness.

To facilitate readers’ understanding of the overall structure, the remainder of this paper is organized as follows: Section 2 reviews and analyzes relevant single-modal and multimodal fake news detection studies and datasets, highlighting shortcomings in existing methods regarding cross-modal deep semantic alignment and external fact validation; Section 3 provides foundational knowledge and introduces the core concepts of multidimensional credibility assessment and neuro-symbolic reasoning; Section 4 details the proposed SCMG-FND framework, including implementation specifics of multimodal feature extraction, intramodal semantic enhancement, diffusion-based viewpoint evolution, multi-granularity (global/modal/temporal/spatial) contrastive learning, decision fusion, and interpretability modules; Section 5 describes the dataset and experimental setup, reporting performance comparisons and ablation studies to validate the method’s effectiveness; Section 6 provides an in-depth reflection on the strengths and limitations of this approach, including discussions on its reliance on LLMs and retrieval, computational overhead, and cross-domain generalization capabilities. Section 7 summarizes the contributions of this paper and proposes future research directions. Section 8 supplements ethical considerations, discussing potential misuse risks and corresponding mitigation strategies.

2. Related Work

2.1. Single-Modal Fake News Detection

Early fake news detection research focused mainly on detecting forgery traces in a single channel, typically text, using natural language processing techniques to analyze news content. For example, ref. [11] proposed a dual-branch convolutional network that measures the semantic correlation between news headlines and bodies to detect fake news. Ref. [12] used a capsule network with CNN and pre-trained word embeddings on the ISOT and LIAR datasets, achieving 7.8% and 3.1% improvements respectively. Ref. [13] employed transfer learning to optimize feature extraction on news comments, seamlessly embedding comment features into the fake news detection model. Ref. [14] proposed a temporally evolving graph neural network for the detection of fake news, introducing a time propagation framework that combines structure, semantic, and temporal information. Ref. [15] introduced WELFake, a two-stage baseline combining word embeddings and voting classification, serving as a reproducible benchmark. Ref. [16] proposed a cross-domain fake news detection framework using multimodal data to incorporate domain differences. Ref. [17] studied incorporating user preferences for fake news detection, proposing the UPFD framework that jointly models content and user preference signals via graph neural networks. Experiments on real datasets confirmed its effectiveness. Ref. [18] introduced the Plutchik’s emotion wheel into textual sentiment enhancement tasks, constructing sentiment distribution labels to improve semantic bias detection, which increased model performance by 8%. Ref. [19] designed a hybrid model combining traditional machine learning and deep learning, significantly improving detection while preserving text context features. Ref. [20] proposed a new fake news detection solution emphasizing the importance of sentiment analysis in the detection process. These methods effectively advanced single-modal fake news detection. However, as news media have shifted beyond text, adding visual features, these methods face challenges in multimodal contexts. Therefore, building on single-modal research, multimodal detection strategies are needed to improve accuracy and robustness in complex scenarios.

2.2. Multimodal Fake News Detection

With the widespread popularity of short-video platforms, multimodal fake news often involves inconsistencies between text and images, for example when edited or spliced images are paired with misleading text, alongside video, audio, and other modalities. Purely text-based detection is no longer sufficient, making multimodal fake news detection a major research direction. Multimodal methods aim to fuse information from various sources to achieve more comprehensive and robust detection.

Early approaches focused on text–image fusion. Ref. [21] proposed a Hierarchical Multimodal Contextual Attention Network (HMCAN) that jointly models hierarchical semantic representations of text and multimodal context in a unified deep model. Ref. [22] considered structural features like topics, shares, and likes, and used late fusion to connect different features, improving multimodal fake news detection performance. Ref. [23] developed the FANVM model based on adversarial learning and topic modeling, linearly combining comments, headlines, and video features for fusion. However, these methods rely on shallow feature concatenation, making it difficult to capture complex cross-modal semantic associations in short videos. Ref. [24] applied a co-attention network (MCAN) to deeply fuse text and visual channels, enhancing semantic associations between text and images, which improved the accuracy of image–text fake news detection. Ref. [25] proposed a semantically enhanced multimodal detection method: by leveraging implicit factual knowledge in pre-trained language models and explicit visual entity extraction, it better understood deep semantics of multimodal news. They extracted visual features at different semantic levels and used text-guided attention to model text-vision interactions for better fusion of heterogeneous features. Ref. [26] introduced the SV-FEND framework for short videos, using a cross-modal Transformer to mine text-audio and text-video relationships, but it did not fully parse deep implicit viewpoints in text, which limited its ability to detect “multimodal camouflage” scenarios, such as cases where the content across text, image, and sound appears aligned but the underlying logic is contradictory. Ref. [27] built the OpEvFake model to address the neglect of implicit opinions and opinion evolution in existing methods. They designed special prompt templates to extract implicit credibility opinions from video text components, and used diffusion models to promote interactions among multimodal opinions, including those derived from the prompts, thereby achieving multimodal opinion fusion and evolution. Ref. [28] proposed a fine-grained fusion network for cross-modal consistency learning, capturing subtle semantic correlations and consistency across modalities, which improved detection accuracy. In summary, existing methods have made progress in multimodal fake news detection, but they still rely on shallow features, insufficient modeling of implicit opinions, and have limited ability to recognize complex camouflaging scenarios. To tackle these issues, our work proposes a multimodal fusion framework with multi-granularity contrastive learning, enhancing the model’s ability to detect complex forgeries through cross-modal contrast and hierarchical feature modeling.

2.3. Fake News Datasets

The development of fake news detection research has driven the creation of related datasets, which can be categorized into single-modal and multimodal. Single-modal datasets primarily contain text and are used for fact-checking or text-based detection. The LIAR dataset collected about 12,800 public statements by U.S. politicians from 2007 to 2017, serving as a benchmark for automated fact-checking [29]. Ref. [30] provided a manually annotated fake news dataset to support classifier development. Ref. [31] proposed the FA-KES dataset, focusing on fake news related to the Syrian War for machine learning model evaluation. The Twitter rumor dataset by [32] incorporates a temporal dimension, aiding evaluation of model stability over time. In contrast, multimodal datasets include not only text but also images, videos, comments, and metadata, enabling studies of multimodal features. The Fakeddit dataset contains tens of thousands of multimodal fake news samples, offering a rich benchmark [33]. Ref. [34] released the NewsBag dataset, which includes approximately 200,000 real news items and 15,000 fake ones combining both text and images. Ref. [35] introduced the MFND dataset, covering Chinese and English multimodal news with various manipulations, designed for detection and localization research. The MM-COVID dataset covers COVID-19 news in six languages, supporting multilingual and multimodal fake news detection [36]. Recently, ref. [26] constructed the FakeSV dataset of Chinese short news videos, containing numerous samples with video, audio, subtitles, and social context, specifically for multimodal fake news detection on short-video platforms.

3. Preliminaries

3.1. Contrastive Learning

Contrastive learning aims to learn discriminative representations by maximizing the similarity between positive sample pairs and minimizing it between negative pairs. A common formulation is the InfoNCE loss [37,38]. For an anchor sample with feature representation and its positive counterpart, the InfoNCE loss is defined as:

L_{c o n t r a s t}^{(g)} = - \frac{1}{N} \sum_{i = 1}^{N} log \frac{exp (sim (z_{i}, z_{i}^{+}) / τ)}{\sum_{j \in B} exp (sim (z_{i}, z_{j}) / τ)},

(1)

where

z_{i}

denotes the representation of the i-th anchor,

z_{i}^{+}

is its positive counterpart,

sim

is the normalized cosine similarity, the temperature

τ

equals 0.1, B denotes the current batch, and N denotes the batch size. Following prior contrastive learning studies that recommend a small but non-zero temperature to balance the gradients contributed by positive and negative pairs [39], we initialize

τ

at 0.1. In our multimodal setting, larger values dilute the discrimination between genuine and fake pairs, while smaller values destabilize convergence; hence we empirically keep

τ = 0.1

, which delivers the strongest accuracy/F1 trade-off.

In multimodal short-video settings, contrastive learning can be instantiated at complementary levels by defining task-appropriate positive and negative relations. At the global level we treat samples from the same class as positives and samples from different classes as negatives, which encourages sample-level discrimination. At the modal level we treat different modalities of the same sample as positives and cross-sample pairs as negatives, which enforces cross-modal alignment. At the temporal level we treat audio and video segments aligned at the same time step within a sample as positives and treat other time steps in the same video together with all steps from other videos as negatives, which captures temporal coherence and desynchronization. At the spatial level we treat semantically related or neighboring regions within a frame as positives and unrelated regions within or across frames as negatives, which emphasizes local spatial consistency and forgery localization.

3.2. Multi-Dimensional Credibility Assessment

To avoid ambiguity and ensure consistent terminology, this paper defines “semantic credibility” as: The degree to which video content aligns with real-world knowledge, logical principles, and cross-modal consistency at the semantic level. It reflects how well a short video “makes sense, holds up, and stands the test of logic,” encompassing aspects such as professionalism, factual consistency, logical coherence, source reliability, and cross-modal matching.

In this study, semantic credibility is decomposed into nine quantifiable dimensions. Each dimension is computed jointly by a large language model’s video comprehension capabilities and externally sourced evidence retrieved online. Ultimately, these dimensions serve as explicit semantic signals integrated into the model’s overall decision-making process. They are used to identify implicit falsehood cues such as pseudoscientific narratives, logical contradictions, information gaps, and cross-modal misalignments.

Preliminary experiments showed that directly using a large language model for truth assessment, even with a fixed prompt format, has significant limitations: (1) the analysis tends to be superficial and yields summary conclusions with little deep reasoning; (2) factual inaccuracies due to outdated or limited internal knowledge cause “hallucinations”, failing to effectively identify video news that contradicts objective facts. To overcome these issues, we do not use a single prompt-based approach. Instead, we design a multi-dimensional credibility assessment framework that decomposes the truth assessment task into an ordered set of manageable subtasks, integrating LLM video analysis and online search to build a more reliable multi-dimensional credibility evaluation for video news (see Figure 1).

In this framework, we first employ the latest video understanding model GLM-4V-Flash to generate semantic features from the video modality. Unlike traditional methods that rely on frame-level feature extraction, GLM-4V-Flash can globally parse the entire video’s semantics, accurately capturing cross-frame temporal consistency and intrinsic logical coherence. It can effectively assess title-content consistency, scene transition reasonableness, and narrative structure completeness within a multimodal context. This global understanding fundamentally addresses the limitations of frame-by-frame analysis in modeling semantic continuity and detecting cross-modal consistency, significantly enhancing the video modality’s temporal perception and contextual reasoning in credibility evaluation.

Given the workings of large models, we design high-quality prompt templates to guide the model toward stable, requirement-aligned outputs, reducing repetitive API calls. In the online search stage, we implement retry and rate-limit strategies to improve query stability. These efforts provide standardized, high-quality data for model training and joint multi-granularity judgment. The evaluation process has three levels:

Level 1: We use the video analysis model to evaluate the video content along nine dimensions (see Table 1). Each dimension outputs specific analysis and keyword indicators. Keywords are selected from predefined sets for later weight assignment and rule matching.

Level 2: We extract core textual information from videos and integrate Kimi’s online search capabilities for fact-checking. The system verifies whether core claims align with real-world evidence.

Level 3: We treat the search result as an additional “objective factual consistency” dimension, combining it with the nine assessments from Level 1. We then assign weights and compute an overall score to judge the video’s veracity. Before integrating, we validate that all analysis keywords match the predefined sets and correspond to the analysis content, ensuring reliability and consistency.

The multi-dimensional credibility assessment module performs a comprehensive analysis and scoring of the model’s predictions from perspectives of content features, semantic consistency, external knowledge alignment, and uncertainty quantification. This design not only compensates for the insufficiency of single metrics in measuring truthfulness but also provides finer-grained credibility cues for downstream tasks. By weighting and fusing signals from different dimensions, the model can balance accuracy, robustness, and interpretability during detection. In particular, combining external fact verification and a dynamic update mechanism, this module enhances the model’s ability to perceive and adapt to timely information, making results better aligned with real-world needs. Overall, multi-dimensional credibility assessment is not only a key safeguard for result verification but also crucial for improving transparency and user trust.

3.3. Neural-Symbolic Rules

Although multidimensional credibility assessment covers content features and external facts, it lacks structured logical rules to calibrate evaluation results. To fully leverage the effectiveness of multidimensional credibility assessment, enhance the model’s ability to judge complex false content, and improve model interpretability, we introduce a neuro-symbolic rule module. The neural-symbolic rule engine converts abstract model predictions into human-understandable reasoning chains—such as “triggering logical error rules thus lowering credibility”—by predefining positive and negative signal rules. This provides quantifiable logical anchors for credibility assessment. Through soft matching and dynamic weight adjustment, this module maps rule texts, input content, and structured model outputs to a unified semantic space, enabling rule-based modulation adaptation across scenarios. Essentially, it facilitates collaborative reasoning between logical rules and deep semantic features.

First, we construct a hierarchical rule base, divided into two categories:

Negative-signal rules: These decrease credibility, with

w_{base} < 0

expressing an inhibitory effect on the probability of truth, such as “violates scientific knowledge”, “logical error”, or “lacks factual basis.” Each rule is associated with a negative weight.

Positive-signal rules: These increase credibility, with

w_{base} > 0

acting as independent confidence bonuses, such as “authoritative source confirmed” or “scientific accuracy.” Each has a positive weight.

Concretely, the rule hierarchy and initial weights are provided in Table 2.

We encode rule texts, content to be detected, and LLM outputs into vectors in the same semantic space. The system supports various semantic encoders, such as chatglm-6b and chinese-roberta. We compute the cosine similarity as the match score:

s i m (t, r) = \frac{e_{t} \cdot e_{r}}{∥ e_{t} ∥ ∥ e_{r} ∥}

(2)

Instead of hard binary triggers, we employ soft matching: the similarity score acts as a continuous trigger signal for weight scaling. When it is necessary to trigger binarization for statistical purposes, use the threshold

θ = 0.75

. Finally, the rule activation results are applied in two phases: Feature-level adjustment: For features related to an activated rule, we weight them to amplify expressions related to fake news. Prediction-level adjustment: During classification, each rule’s trigger contributes a bias term to the logit computation, directly influencing the final prediction distribution.

Through this lightweight “soft-match + adjustment” design, the neural-symbolic rule module avoids complex dynamic weighting while complementing the multi-dimensional credibility assessment. It provides interpretable logical constraints and external knowledge support. Additionally, by combining semantic matching and fact checking, the module tightly integrates external information with model predictions, enabling quick adaptation to new content such as breaking news scenarios, while maintaining high accuracy and trustworthiness. This mechanism not only enhances robustness and interpretability in complex forgery scenarios but also endows the system with a degree of timeliness, supporting applications in news fact-checking and public opinion monitoring.

4. Methodology

4.1. Overview

Our model consists of seven core components: a multimodal feature extraction module, the multi-dimensional credibility assessment module, the neural-symbolic rule engine, a cross-modal interaction and fusion module, a multi-granularity contrastive learning module, a decision prediction module, and an explainability module.

4.2. Framework Design Principles

Our framework is designed to achieve robust and interpretable multimodal fake news detection. As illustrated in Figure 2, the central idea is to preserve the unique properties of each modality while capturing their complex cross-modal relationships, and to integrate credibility signals from external evidence into the learning process.

In feature extraction, each news sample includes seven basic modalities: video transcript text, title, user profile, user comments, audio signal, key-frame image, and video motion features. Since transcript and title are both text, we perform deep fusion on them via semantic concatenation to form an integrated text feature. This approach preserves the original semantics of all seven modalities while organically aggregating text-based features, forming a structured multimodal representation as input for subsequent fusion.

The multimodal features are first processed by the feature extraction module, and a large language model generates a structured multi-dimensional credibility evaluation. This is submitted to an online search and fact-checking module to retrieve evidence from authoritative, academic, and mainstream sources, encoding these into a fact–evidence vector (

f a c t_s i g n a l s

). The fact–evidence vector and text embeddings are combined within the credibility assessment, which is initially adjusted by the neural-symbolic rule engine to calibrate node outputs such as the “factuality” score or the “logicality” score. Simultaneously, we enhance intra-modal viewpoints and perform cross-modal interaction and fusion. These are then enhanced by a diffusion model, followed by multi-granularity contrastive learning, yielding the model’s prediction logits. In the decision stage, the neural-symbolic rule engine, using the same semantic similarity and fact signals, applies post-hoc adjustments to the logits, producing the final verdict and explainability log. This approach harnesses rule interpretability alongside data-driven generalization, making the model more robust and traceable when faced with pseudoscientific packaging, headline tampering, or complex causal inversions.

Our model also employs a multimodal fusion structure based on a diffusion model that preserves the unique properties of each modality while capturing their non-linear interrelations. The core mechanisms include: constructing a hierarchical feature enhancement network via the credibility assessment module to implement credibility-weighted optimization of features; and using multi-granularity contrastive learning to build strong cross-modal associations, reinforcing fine-grained semantic alignment between modalities. This dual-layer design of hierarchical feature enhancement and multi-granularity alignment gives the model excellent generalization, especially for deeply disguised forgeries in videos, delivering stable and effective detection performance.

4.3. Multimodal Feature Extraction

To improve the multi-dimensional credibility assessment, we optimize the feature fusion method for RoBERTa. Specifically, we extract hidden states from the last n layers of RoBERTa and assign each layer a learnable weight before a weighted sum, yielding an aggregated semantic representation. Lower layers capture lexical and syntactic information, while higher layers focus on contextual semantics and logical connections. We then combine this weighted text representation with related credibility-dimension features to form a semantic vector used for truthfulness judgment, explicitly embedding it into the multimodal feature space as a core semantic indicator.

For other modalities, all text-based features are extracted via pretrained models. For comments, we perform a weighted sum based on likes. Audio features are extracted via a pretrained VGGish model. Video information is extracted from both frame-level and segment-level: for frames, we use a pretrained VGG19 to extract key-frame features; for segments, we take 16-frame clips centered at each time step and use a pretrained C3D model to obtain video motion features. At this point, the short-video news feature representation is:

E = {e^{t}, e^{c}, e^{u}, e^{a}, e^{k}, e^{v}, e^{p}}

(3)

Here,

e^{t}

denotes the textual features,

e^{c}

denotes the comment features,

e^{u}

denotes the user features,

e^{a}

denotes the audio features,

e^{k}

denotes the video frame features,

e^{v}

denotes the video motion features, and

e^{p}

denotes the semantic vector extracted from the multi-dimensional credibility evaluation, which is used for authenticity judgment.

4.4. Intra-Modal Semantic Enhancement Mechanism

Cross-Modal Feature Alignment. To address feature misalignment, we use a multimodal Transformer architecture for core feature alignment and interaction. After processing through several layers of Multimodal Transformers (MTs), the aligned features for text, the semantic credibility vector, audio, and video can be represented as:

z^{t}, z^{p}, z^{a}, z^{v} = M T s (e^{t}, e^{p}, e^{a}, e^{v})

(4)

Taking the modality processing of titles and transcribed text as an example, a single Multimodal Transformer (MT) comprises three Cross-Modal Transformer (CT) units, structured as follows:

\begin{matrix} M T^{t} = C T^{v \to t} (e^{t}, e^{v}) \oplus C T^{a \to t} (e^{t}, e^{a}) \oplus C T^{p \to t} (e^{t}, e^{p}) \end{matrix}

(5)

Since the title and the transcribed text both belong to the text modality of semantic credibility assessment. We concatenate these features and pass them through a fully connected layer to fuse them into a unified text feature:

z^{t p} = F C (z^{t} \oplus z^{p})

(6)

Here, ⊕ denotes concatenation and

F C

is a fully connected layer, producing a semantically enriched unified text representation.

Capsule Network for Viewpoint Aggregation. Capsule aggregation is a structured feature modeling approach designed to enhance semantic representations. Unlike traditional methods such as average pooling or self-attention, capsule structures capture relationships between “local semantic units” and “global semantic structures.” For instance, in video-text data, it can identify key viewpoints, core statements, and semantic conflict points while suppressing noise or redundant expressions. Through dynamic routing mechanisms, capsule aggregation integrates cross-modal information more accurately, providing more robust high-level semantic representations for subsequent consistency modeling and false clue detection.

Short-video news often contains scattered claims and viewpoints across modalities. We use a capsule network to aggregate long-range correlated viewpoints within each modality. For each core modality

z^{m} (m \in {tp, a, v})

, we model its internal viewpoints from multiple perspectives with a set of capsules. The capsule generation is:

C a p_{i, j}^{m} = w_{i, j}^{m} \cdot z^{m} [i, :]

(7)

where

w_{i, j}^{m}

denotes trainable parameters, and

C a p_{i, j}^{m}

represents the capsule generated based on the features

z^{m}

in row i. Through a dynamic routing mechanism, capsule information is aggregated to obtain view representations across modalities:

x^{m} [j, :] = \sum_{i} C a p_{i, j}^{m} \cdot r_{i, j}^{m}

(8)

Here,

r_{i, j}^{m}

is the normalized result of the routing coefficient

b_{i, j}^{m}

, updated iteratively through the following steps:

\{\begin{matrix} b_{i, j}^{m} \leftarrow b_{i, j}^{m} + {Cap}_{i, j}^{m} ⊙ x^{m} [j, :] \\ r_{i, j}^{m} = \frac{exp (b_{i, j}^{m})}{\sum_{j} exp (b_{i, j}^{m})} \end{matrix}

(9)

Dynamically adjusted based on the similarity between capsules and viewpoints, an enhanced unimodal viewpoint representation

x^{m}

is generated after multiple iterations.

4.5. Cross-Modal Viewpoint Interaction Model

Noisy Viewpoint State Evolution. Gradually inject Gaussian noise into the initial viewpoint distribution

x_{m}^{0}

, which represents the original state of

x_{m}

, in order to achieve dynamic updates of the viewpoint state. According to the properties of Markov chains [40], the conditional distribution of state

x_{k}^{m}

at step k depends solely on the previous state

x_{k - 1}^{m}

, with its probability density function given by:

q (x_{k}^{m} ∣ x_{k - 1}^{m}) = N (x_{k}^{m}; \sqrt{1 - β_{k}^{m}} x_{k - 1}^{m}, β_{k}^{m} I)

(10)

q (x_{1 : K}^{m} ∣ x_{0}^{m}) = \prod_{k = 1}^{K} q (x_{k}^{m} ∣ x_{k - 1}^{m})

(11)

where

β_{k}^{m}

is the noise weight at step k. The sampling process can be expressed as:

x_{k}^{m} = \sqrt{{\bar{α}}_{k}^{m}} x_{0}^{m} + \sqrt{1 - α_{k}^{m}} ϵ

(12)

q (x_{k}^{m} ∣ x_{0}^{m}) = N (x_{k}^{m}; \sqrt{{\bar{α}}_{k}^{m}} x_{0}^{m}, (1 - {\bar{α}}_{k}^{m}) I) (ϵ \sim N (0, I))

(13)

ϵ

is standard Gaussian noise.

{\bar{α}}_{k}^{m} = \prod_{i = 1}^{k} α_{i}^{m}

,

α_{i}^{m} = 1 - β_{i}^{m}

.

Reverse Denoising and Reconstruction Loss. We first linearly fuse unimodal viewpoints to generate an initial multimodal viewpoint representation.

In the denoising model

f_{ϕ}

, we use the multimodal viewpoint to guide denoising. Given a noisy viewpoint, the model predicts distributions over the single-modal viewpoints:

{\hat{x}}_{0}^{m} = f_{ϕ} (x_{0}^{m t}, x_{k}^{m}, K)

(14)

After obtaining the denoised view representation

{\hat{x}}_{0}^{tp}, {\hat{x}}_{0}^{a}, {\hat{x}}_{0}^{v}

, the denoising process is constrained by reconstructing the loss function, defining the pixel-level loss for each modality as:

l^{m} [i, j] = {(x_{0}^{m} [i, j] - {\hat{x}}_{0}^{m} [i, j])}^{2}

(15)

where

x_{0}^{m}, {\hat{x}}_{0}^{m} \in R^{d_{1} \times d_{2}}

denotes the feature matrix and

d_{1}, d_{2}

represents the matrix dimension. The total prediction loss is defined as:

L_{R} = - \sum_{n = 1}^{B} \sum_{i = 1}^{d_{1}} \sum_{j = 1}^{d_{2}} (l^{tp} + l^{a} + l^{v})

(16)

4.6. Multi-Granularity Contrastive Learning Loss Integration

To address the limitation of single-granularity contrast, we design a multi-granularity contrastive learning framework with four tiers: global, modal, temporal, and spatial. This module bridges cross-modal interaction and decision fusion, establishing semantic relations at different abstraction levels to strengthen recognition of subtle multimodal inconsistencies. All contrastive losses use the unified InfoNCE form (Equation (1)). The architecture is as follows:

Global Contrastive Learning. At the sample level, global contrastive learning distinguishes overall semantic differences between real and fake news by maximizing intra-class similarity and minimizing inter-class similarity. The implementation process is as follows:

First, input the preliminary fused features

F_{f u s i o n} = x_{0}^{m t}

generated by the cross-modal interaction module. Next, map the features to the contrast space via the projection head

g_{g l o b a l} (\cdot)

to obtain

Z_{g l o b a l} = g_{g l o b a l} (F_{f u s i o n})

. Positive samples: Features from other samples within the same category. Negative samples: Features from all samples in different categories.

L_{g l o b a l} = - \frac{1}{N} \sum_{i = 1}^{N} l o g \frac{e x p (s i m (z_{i}^{g l o b a l}, z_{i^{+}}^{g l o b a l}) / τ)}{\sum_{j \in N (i)} e x p (s i m (z_{i}^{g l o b a l}, z_{j}^{g l o b a l}) / τ)}

(17)

Here,

N (i)

represents the negative sample set. The global alignment feature

Z_{g l o b a l}

is output, preserving sample-level semantic discriminative information.

Modal Contrastive Learning. This learns semantic consistency and contradictions between text, audio, video, etc., enhancing cross-modal complementarity. Input the enhanced single-modal features

F_{text} = x_{0}^{t p}

,

F_{a u d i o} = x_{0}^{a}

,

F_{v i d e o} = x_{0}^{v}

after fusion. We project each:

Z_{t e x t} = g_{t e x t} (F_{t e x t})

,

Z_{a u d i o} = g_{a u d i o} (F_{a u d i o})

,

Z_{v i d e o} = g_{v i d e o} (F_{v i d e o})

. Positive pairs are features from different modalities of the same sample, such as comparing text with audio of the same news. Negative pairs are modality features from different samples. The modal contrastive loss is:

L_{m o d a l} = L_{t e x t - a u d i o} + L_{t e x t - v i d e o} + L_{a u d i o - v i d e o}

(18)

where

L_{t e x t - a u d i o}

,

L_{t e x t - v i d e o}

,

L_{a u d i o - v i d e o}

represent the contrast losses between text-audio, text-video, and audio-video viewpoints, respectively. Taking the text-audio contrast as an example:

L_{t e x t - a u d i o} = - \frac{1}{N} \sum_{i = 1}^{N} l o g \frac{e x p (s i m (z_{i}^{t e x t}, z_{i}^{a u d i o}) / τ)}{\sum_{j = 1}^{N} e x p (s i m (z_{i}^{t e x t}, z_{j}^{a u d i o}) / τ)}

(19)

Final output: modality-aligned features

F_{modal} = F C (Z_{text} \oplus Z_{audio} \oplus Z_{video})

. Through backpropagation optimization of the contrast losses, the concatenated features progressively achieve cross-modal semantic alignment during training, strengthening semantic associations across modalities.

Temporal Contrastive Learning. This aligns audio and video features over time to capture temporal semantic consistency or contradiction. First, the audio feature

F_{i}^{a u d i o}

and video feature

F_{i}^{v i d e o}

of the i-th sample

x_{0}^{a}

and

x_{0}^{v}

in the batch are aligned to the same time step T via the temporal alignment module. Define the temporal sample relationship: Project each time step

z_{i, t}^{v i d e o} = g_{t e m p o r a l} (F_{i, t}^{v i d e o})

,

z_{i, t}^{a u d i o} = g_{t e m p o r a l} (F_{i, t}^{a u d i o})

.

Positive samples: Audio and video features

(z_{i, t}^{a u d i o}, z_{i, t}^{v i d e o})

at the same time step; Negative samples: Other time steps within the same video (

t^{'} \neq t

), and all time steps from other videos in the batch. The bidirectional temporal contrast loss is defined as:

L_{t e m p o r a l} = \frac{1}{2} (L_{v i d e o \to a u d i o} + L_{a u d i o \to v i d e o})

(20)

The temporal contrast loss is defined as:

L_{v i d e o \to a u d i o} = - \frac{1}{N T} \sum_{i = 1}^{N} \sum_{t = 1}^{T} l o g \frac{e x p (s i m (z_{i, t}^{a u d i o}, z_{i, t}^{v i d e o}) / τ)}{\sum_{j, s} e x p (s i m (z_{i, t}^{a u d i o}, z_{j, s}^{v i d e o}) / τ)}

(21)

where the index ranges are

i, j \in [1, N]

,

t, s \in [1, T]

. The final output is the temporally aligned features

F_{t e m p o r a l} = F C (z_{t}^{a u d i o} \oplus z_{t}^{v i d e o})

, enhancing the detection capability of temporal semantic consistency.

Spatial Contrastive Learning. This analyzes spatial region consistency within video frames, localizing forged areas. For the keyframe

x_{0}^{v}

of the i-th sample, a 4 × 4 grid partition yields

R = 16

regional features

{\{F_{i, r}^{s p a t i a l}\}}_{r = 1}^{16}

, which are projected onto the contrast space:

z_{i, r}^{s p a t i a l} = g_{s p a t i a l} (F_{i, r}^{s p a t i a l})

(22)

Positive samples: Neighborhood regions most semantically relevant to the current region r; Negative samples: Other regions within the same frame, as well as regions spanning frames/videos. The spatial contrast loss is defined as:

L_{s p a t i a l} = - \frac{1}{N \cdot 16} \sum_{i = 1}^{N} \sum_{r = 1}^{16} l o g \frac{\sum_{r^{+} \in N (r)} e x p (s i m (z_{i, r}^{s p a t i a l}, z_{i, r^{+}}^{s p a t i a l}) / τ)}{\sum_{j, s} e x p (s i m (z_{i, r}^{s p a t i a l}, z_{j, s}^{s p a t i a l}) / τ)}

(23)

Output spatial alignment features:

F_{s p a t i a l} = z_{s p a t i a l}

, enabling precise localization of local forgery regions.

Multi-Granularity Feature Fusion and Loss Integration. To avoid scale interference between features of different granularities, each granularity feature undergoes L2 normalization for scale unification prior to concatenation. After concatenation, the features pass through a two-layer MLP projection network with a hidden layer dimension of 256 and ReLU activation, followed by LayerNorm and L2 normalization to ensure consistency in feature distribution:

\begin{matrix} F_{norm} = FC (F_{global} \oplus F_{modal} \oplus F_{temporal} \oplus F_{spatial}) \end{matrix}

(24)

\begin{matrix} F_{proj} = MLP (F_{norm}) \end{matrix}

(25)

\begin{matrix} F_{mgcl} = L 2 Norm (LayerNorm (F_{proj})) \end{matrix}

(26)

The overall contrastive loss is the weighted sum of losses across all levels:

\begin{matrix} L_{m u l t i} & = λ_{g l o b a l} L_{g l o b a l} + λ_{m o d a l} L_{m o d a l} \\ + λ_{t e m p o r a l} L_{t e m p o r a l} + λ_{s p a t i a l} L_{s p a t i a l} \end{matrix}

(27)

where

λ_{g l o b a l}

,

λ_{m o d a l}

,

λ_{t e m p o r a l}

,

λ_{s p a t i a l}

are the weight coefficients for each contrast loss.

4.7. Multimodal Decision Fusion

To integrate multimodal information for final classification, we concatenate the multi-granularity contrastive output

F_{mgcl}

with auxiliary modal features, then apply dynamic rule-based weighting before feeding into a multilayer perceptron (MLP) classifier:

\hat{y} = M L P (F_{mgcl} \oplus F C (e^{u}) \oplus F C (e^{c}) \oplus F C (e^{k}))

(28)

Based on this, a correction signal derived from neural symbolic rules is introduced. If the semantic similarity

s i m (t, r)

between rule

r_{i}

and input x is given, a summation bias term is added to the logits, which are then converted into the final prediction probability via softmax:

{\hat{y}}^{f i n a l} = s o f t max (\hat{y} + \sum_{i} ω_{i} \cdot s i m (t, r))

(29)

ω_{i}

represents the rule weight, where positive values enhance credibility and negative values suppress it.

s i m (t, r)

denotes the cosine similarity defined in Section 3.3, and

{\hat{y}}^{f i n a l}

is the final prediction probability after integrating rule constraints. The model loss function comprises reconstruction loss

L_{r e c o n}

, prediction loss

L_{p r e d}

, and contrast loss

L_{m u l t i}

, weighted and summed as follows:

\begin{matrix} L_{P} & = - \sum_{n = 1}^{B} [y_{n} log {\hat{y}}_{n} + (1 - y_{n}) log (1 - {\hat{y}}_{n})], \end{matrix}

(30)

\begin{matrix} L & = L_{P} + λ_{R} L_{R} + λ_{C} L_{multi} \end{matrix}

(31)

where

λ_{R}

and

λ_{C}

are the weight coefficients for reconstruction loss and contrast loss, respectively, and B is the batch size. By jointly optimizing these three loss types, the model achieves effective fusion of multimodal views, multi-granularity semantic enhancement, and optimized classification decisions.

4.8. Explainability

Beyond detection, we design a multi-granularity explainability extension module for comprehensive interpretability. This module combines natural-language explanations with structured reasoning to reveal the basis of decisions at multiple levels, enhancing transparency and trust. The overall architecture is shown in Figure 3, with the core consisting of three components: a traditional explainability analysis layer, the neural-symbolic rule engine, and an LLM-enhanced explanation layer.

Traditional Explainability Layer. On top of model predictions that have already been refined through rule adjustments, we generate standard interpretability indicators:

(1) Modal Contributions: Compute the weight of each modality in the final prediction. (2) Feature Importance: Output the importance vector of features within each modality. (3) Attention Maps: Show cross-modal interaction attentions, for example highlighting the attention strengths between text and video or between audio and video. (4) Forgery Region Heatmap: If predicted as fake, highlight spatial regions in frames likely altered.

These outputs provide a quantitative, model-level explanation revealing which modalities and features are salient, but they may still be technical for general users.

LLM-Enhanced Explanation Layer. To bridge the gap to users, we use a large language model to generate layered natural-language explanations. This layer takes as input a unified ExplanationContext data structure containing the model’s prediction probability and confidence, modal contributions and feature importance, attention maps, activated neural-symbolic rules and their details, and multi-dimensional credibility assessment data such as the expertise score, factual consistency, and AI-generation traces, together with relevant metadata. The LLM uses this context to produce human-friendly explanations with a predefined template, providing both a user-facing summary and detailed reasoning. It generates:

(a) Decision Summary: A plain-language summary of the main evidence and reliability behind the decision. (b) Rule Reasoning: A description of the triggered neural-symbolic rules, including their source and count. (c) Confidence Analysis: An explanation of the prediction confidence level and sources of uncertainty (based on feature distributions and modal contributions). (d) Risk Assessment: An assessment of potential risk (based on credibility scores) and suggestions for manual review or follow-up actions. (e) Technical Details: A concise display of modality contribution indices, rule indicators, and processing time.

These outputs provide dual-level explanations for experts and lay users, clearly articulating model decisions. By revealing how titles match content, how modalities contribute, and which rules influenced the decision, our explainability module effectively bridges the “black box” and human understanding, demonstrating its practical utility and reliability.

5. Results

5.1. Experimental Setup

Dataset. We use the publicly available FakeSV Chinese short-video dataset, collected from TikTok and Kuaishou between 2019–2022, comprising 5538 news videos. The dataset includes 1827 fake cases derived from 738 independent news events covering diverse forgery techniques and scenarios, 1827 true cases verified by authoritative sources such as Xinhua and CCTV, and 1884 debunking videos that contain official or user clarifications together with the original rumor and refutation evidence. To support multimodal research, it provides a multi-dimensional feature set: the content modality integrates normalized 1080p key frames with invalid frames removed, transcribed audio text, and raw audio; the social modality incorporates user comments weighted by the number of likes and publisher attributes such as follower count and verification status, capturing user interaction and trust signals; and the propagation features record timestamps, repost paths, and geolocation to restore spatiotemporal context. Labels were annotated by a 10-person expert team through double-blind processes with high inter-annotator agreement (

κ

= 0.87), ensuring quality and event-level balance. The dataset is split into 70% for training, 20% for validation, and 10% for testing.

Environment Preparation. The experiments in this study were conducted on a high-performance computing server, primarily utilizing NVIDIA A100 data center-grade GPUs (NVIDIA, Santa Clara, CA, USA) for model training and inference to meet the large-scale computational demands of multimodal deep learning models. The model implementation was based on PyTorch 2.1.0 and ran in a CUDA 12.9-accelerated environment. All other dependent software libraries employed common deep learning and multimodal processing tools, including preprocessing modules for video, text, and audio data. The overall environment configuration adheres to mainstream deep learning framework standards, ensuring reproducibility of experimental results and stable replication under identical or comparable GPU conditions.

Parameter Settings. We set the initial learning rate to 0.00005 with a cosine annealing schedule. Batch size is 32, balancing efficiency and convergence. Training runs for up to 60 epochs with early stopping, which halts if validation shows no improvement for 10 consecutive epochs. We use the Adam optimizer with weight decay 0.01,

β_{1} = 0.9

, and

β_{2} = 0.999

. To improve efficiency, mixed-precision training is adopted, and gradient clipping with a max norm of 1.0 is applied to prevent explosion. Each experiment is repeated three times under the same conditions, and average performance is reported to reduce randomness and ensure reliability.

5.2. Research Questions

To systematically evaluate our model’s effectiveness, we investigate the following research questions:

RQ1: Do single modalities each contribute to short-video fake news detection? Does our model significantly outperform single-modal methods?

RQ2: Does our model outperform existing multimodal methods?

RQ3: What is the contribution of each key component in our model?

RQ4: How do different granularity contrastive losses affect detection performance?

RQ5: How is the explainability module manifested in practice?

5.3. Experimental Results and Analysis

RQ1: Single-Modal Feature Contribution. We build six single-modal baselines. For each, we encode the extracted features with a bidirectional LSTM and use a linear layer for binary classification. The evaluation metrics (accuracy, F1, recall, precision) for each modality are shown in Table 3.

These results show that each modality contributes to fake news detection to varying degrees. The comment modality performs worst, while text-based modalities (title/transcript) and user features (profile) perform best. Compared to single-modal baselines, fusing modalities significantly improves performance.

RQ2: Multimodal Fusion Effectiveness. We compare our model with popular multimodal fusion methods to demonstrate its effectiveness in short-video fake news detection. Table 4 shows the accuracy, F1, recall, and precision of our model versus five representative baselines: FANVM [23], MultiEMO [41], SV-FEND [26], SV-FEND-SNEED [42], and OpEvFake [27].

To ensure that the performance improvement is not caused by randomness in parameter initialization or stochastic training variations, we conducted paired t-tests between our proposed SCMG-FND model and the strongest baseline (OpEvFake). Both models were trained and evaluated five times using different random seeds, and the experimental results are reported as mean ± standard deviation. We consider

p < 0.01

as statistically significant.

By leveraging all modalities in short-video news, our model surpasses these methods by a large margin, demonstrating the advantage of our joint modeling.

RQ3: Ablation of Core Components. To validate the contribution of each component, we conduct ablation experiments by removing modules from the full model and re-training under the same settings. The results are shown in Table 5.

As shown in Figure 4, the overall accuracy steadily improves as each innovation module is incorporated, indicating the effectiveness of our incremental design strategy. Figure 5 further reports the Precision, Recall, and F1 trends across all ablation settings, showing that the complete model achieves the most balanced and superior performance.

Intra-modal semantic enhancement. Removing the multimodal Transformer, denoted as “w/o Transformer,” and removing the capsule aggregator, denoted as “w/o Capsule,” each reduces accuracy from 89.11% to 85.62% and 84.40%, respectively. The Transformer aligns heterogeneous signals such as title, transcript, audio cues, and motion descriptors in a shared latent space so that later modules can compare homogenous semantics; without it, lexical cues such as sentiment words no longer line up with synchronized audio or frame descriptors, producing noisy gradients. The capsule aggregator captures long-range viewpoint clusters inside each modality and filters out isolated artefacts; ablating it weakens the model’s ability to aggregate sparse editing cues that emerge in scattered frames or sentences. Removing the entire enhancement module, referred to as “w/o Enhancement,” removes both alignment and routing, so neither hierarchical semantics nor viewpoint voting can form, which explains the steepest decline to 81.53% accuracy. This broader drop substantiates that enhancement is not a cosmetic block but the foundation that ensures modality-specific evidence is internally coherent before cross-modal fusion.

Credibility prompt module. As illustrated in Figure 6, the radar chart visualizes the multi-dimensional credibility assessment of true and fake news. “w/o Prompt” means removing the structured multi-dimensional credibility assessment. Without this module, accuracy drops significantly to 83.84%. The prompt injects nine-dimension credibility vectors produced by LLM analysis and verified by search, so text/video cues are grounded on external facts such as authoritative sources and physical plausibility. Once the prompt is removed, the backbone can only exploit internal correlations; it cannot distinguish a persuasive but fabricated narration from one backed by external evidence, leading to more false positives on emotionally charged but true clips and more false negatives on camouflaged fakes. Therefore the prompt acts as an evidence bridge that connects multimodal features with factual priors, explaining its larger influence.

Neural-symbolic rules. “w/o Neural-Symbolic” removes the rule engine. Performance drops notably, especially on videos with logical contradictions or missing information, because this module calibrates LLM outputs during credibility assessment and provides logical constraints during classification. Rule activations such as “logical inconsistency” or “authoritative source” reweight logits according to soft cosine similarity, preventing overconfidence when key modalities are noisy. Without the rules, the system cannot down-weight clips that violate commonsense but still align superficially across modalities, and traceability is lost. This demonstrates the rule engine’s critical role in ensuring high accuracy and interpretability by providing deterministic reasoning hooks in addition to learned scores.

To illustrate, Figure 7 presents the trigger frequencies of positive and negative neuro-symbolic rules across different credibility dimensions. we tested ten typical fake-news videos. The model with neural-symbolic rules consistently assigned a higher “fake tendency” score to these videos compared to the model without rules, confirming the module’s impact.

Cross-modal viewpoint interaction. “w/o OpiEvo” indicates the removal of the cross-modal opinion interaction module, which operates as a diffusion-based viewpoint evolution block within the model. The results in the table show that removing this module weakens the model’s performance, driving accuracy down to 85.49%. This component gradually denoises shared viewpoints across modalities, letting text-specified claims iteratively condition audio/video evidence; without it, the model lacks an iterative reconciliation mechanism, so subtle contradictions—for example, lip-sync shifts that only emerge after multi-step conditioning—remain undetected. While adding structured multi-dimensional credibility evaluation prompts provides some improvement, it still falls short of matching the original model’s capabilities because prompts capture static evidence whereas OpiEvo models dynamic cross-modal interactions. This underscores the critical importance of the cross-modal opinion interaction module.

Multi-granularity contrastive learning. Removing our multi-granularity contrastive module, denoted as “w/o Mgcl,” causes the overall performance to drop to 86.80%, confirming its key role in deep cross-modal interaction and enhancement. Without this module, cross-modal semantic alignment fails, subtle features are underutilized, and global-local coherence is disrupted. Specifically, the global branch keeps representations of real vs. fake news separable even when individual modalities are ambiguous, the modal branch enforces agreement between text sentiment/emotion and audiovisual tone, the temporal branch catches desynchronization, and the spatial branch highlights local manipulations such as pasted regions. Eliminating all four losses simultaneously removes these complementary supervisory signals; the encoder therefore collapses to average pooling and attention heuristics that overlook fine-grained cues. Thus, the contrastive module is vital for capturing nuanced inconsistencies across modalities and stabilizing training against mode collapse.

Temperature Sensitivity Analysis. To justify the choice of

τ

, we conducted a sensitivity study where

τ \in {0.05, 0.10, 0.20, 0.30}

under the same training protocol. Table 6 shows that

τ = 0.10

achieves the best balance between accuracy and F1 while preserving stable convergence, so we adopt this value in all experiments.

RQ4: Effects of Contrastive Loss Granularity. From RQ3, we saw the overall benefit of multi-granularity contrastive learning. We now examine each contrastive loss. We conduct ablations where we remove one contrastive loss at a time, as shown in Table 7.

Results indicate that temporal and cross-modal contrastive constraints significantly contribute to capturing dynamic forgeries and semantic inconsistencies, while spatial and global constraints enhance robustness to local details and overall consistency. Further analysis reveals that in fine-grained feature extraction, temporal and spatial contrast losses are specifically designed for short videos’ temporal dynamics, such as synchrony between speech rhythm and frame changes, and visual spatial structures, such as logical relationships between objects and backgrounds within scenes. Without these dual-dimensional constraints, the model exhibits drastically reduced sensitivity to spatio-temporal tampering clues, including lip-sync discrepancies between speech and video and anomalous shadows in synthetic images, making it difficult to detect deepfakes that rely on such details. Moreover, the synergistic effect between global contrast loss and modal contrast loss ensures the model distinguishes core semantic differences between real and fake news from a holistic perspective while promoting cross-modal information complementarity. Without these losses, the model tends to get stuck in local optima, overemphasizing superficial features of a single modality, such as emotional words in text or low-resolution noise in video, rather than deep logical connections revealed through multimodal fusion, such as contradictions between authoritative source citations and statistical data. This leads to significantly weakened detection capabilities in complex forgery scenarios.

RQ5: Explainability Module Demonstration. We assess the practical utility of the explainability module on two representative samples that are challenging in real-world detection and illustrate the module’s effectiveness. Table 8 presents two typical cases where the module works well, while the subsequent Table 9 summarizes reviewer-suggested failure cases that reveal current limitations.

For samples assessed as ‘true’, with the title stating that the male driver violently beat the female driver due to her lane-change violation and malicious maneuvering along with insults, the interpretability module demonstrated strong explanatory capability. The model indicated that the video is largely consistent with the title, clearly showing the male driver’s violent behavior, which is corroborated by official reports obtained through internet searches, reflecting the title’s accurate representation of the video content. The trust score reached 0.88, and rule triggering was based on the consistency between the title and actual content verified through internet checks. Text played a dominant role in modality contribution, accounting for 55%, while video and audio contributed 25% and 12%, indicating that textual information was the core driver for the decision. In contrast, for samples judged as ‘fake’, with the title stating that a male driver with road rage and violent tendencies violently beat a female driver because she was driving slowly, the module provided an equally accurate explanation. Internet searches revealed significant inconsistency between the title and the actual events, as the real reason for the incident did not match the title. The model classified this case as fake based on the rules of ‘inconsistency between title and content’ and ‘internet search’. The trust score was 0.86, and rule triggering stemmed from the verification of inconsistencies. In this case, text, video, and audio contributed 48%, 32%, and 10% respectively, with all modalities jointly supporting the ‘fake’ decision.

In both cases, the explainability module accurately provided reasons for the model’s decisions. Whether verifying title-content consistency through cross-checking or showing modality contributions, the explanations are clear and trackable, effectively bridging the gap between the model’s “black box” and human reasoning. This demonstrates the module’s practical usability and reliability in such scenarios.

For Case 1, the explainability panel reveals that the model over-penalized blurred slideshow photos and synthetic narration as if they were deliberate edits, while the title keywords were not immediately cross-verified with the police announcement that confirmed the student’s accidental fall. Consequently, temporal contrastive cues and rule triggers stayed weak, leading to a conservative verdict that flipped a true video into the fake class. For Case 2, the coordinated visual and textual slogans, together with cheering audio, aligned strongly enough to outweigh sparse search results about the staged patriotic event. Without sufficient external corroboration, the plausibility module treated the staged setting as authentic, so the prediction drifted from fake to true. These cases show that when cross-modal consistency appears high but external evidence is missing or delayed, the current pipeline can still misjudge, underscoring the need for tighter fact-check grounding in low-evidence scenarios.

6. Discussion

Based on experimental results and design objectives, the proposed SCMG-FND framework has met and exceeded expectations in terms of core performance and mechanism effectiveness. In terms of performance metrics, the SCMG-FND framework achieves an accuracy of

89.11 \pm 0.04

% and an F1 score of

89.53 \pm 0.05

% on the FakeSV dataset, outperforming multiple state-of-the-art methods including MultiEMO, FANVM, SV-FEND, SV-FEND-SNEED, and OpEvFake in overall capability. A paired t-test performed between SCMG-FND and the strongest baseline (OpEvFake) across five independent runs confirms that the improvements are statistically significant at

p < 0.01

, indicating that the gains are stable rather than due to random fluctuations. This performance enhancement stems from the synergistic interaction among multidimensional credibility assessment, multigranularity contrastive learning, and the neural symbolic rule engine. These components collectively address the covert propagation characteristics of short-video misinformation, which often appears superficially compliant while containing deeper semantic contradictions.

The multidimensional credibility assessment module enables the model to detect subtle manipulations such as linguistic ambiguity and conceptual confusion that are difficult to identify through internal clues. This is achieved by performing fine-grained modeling across nine semantic dimensions. When this module is removed (w/o Prompt), the model’s accuracy drops from 89.11% to 83.84%, a 5.27% decline, demonstrating the critical role of external evidence verification and fine-grained credibility analysis in detecting covert false content. At the same time, the multigranularity contrastive learning mechanism constrains cross-modal inconsistencies at the global, modal, temporal, and spatial levels. This design effectively overcomes the limitations of traditional methods that rely solely on feature concatenation or single attention mechanisms. Ablation results show that removing any granularity leads to performance degradation in the range of 0.33% to 1.64%. Removing temporal granularity contrast has the greatest impact on identifying dynamically tampered content, further confirming the importance of cross-modal temporal consistency in the detection process.

Furthermore, the neural symbolic rule engine introduces explicit logical constraints while preserving the generalization capability of deep learning models. This enhances the framework’s ability to distinguish challenging samples in which the headline and content are consistent in form but inconsistent with factual reality. Ablation experiments indicate that removing this module (w/o Neuro) significantly reduces the model’s sensitivity to semantic contradictions and logical flaws, thereby increasing the risk of misclassification. Comparative analyses reveal that FANVM and MultiEMO rely on shallow modality fusion and struggle with cross-modal semantic misalignment. The SV-FEND family depends heavily on internal cues, limiting its ability to detect factual falsehoods. OpEvFake performs well in opinion evolution analysis but lacks comprehensive constraints related to multidimensional credibility and logical coherence. The three core mechanisms proposed in this study achieve unified modeling of semantic alignment, factual verification, and logical reasoning, resulting in overall detection performance that surpasses existing models.

Although SCMG-FND demonstrates strong performance and model interpretability, areas warranting further research remain.

Dependence on LLMs and potential evaluation bias. The framework relies on GLM-4V-Flash for video semantics and Kimi for online evidence retrieval. Differences in LLM reasoning capability and knowledge coverage may introduce variability, especially for domain-specific videos such as medical or legal content. Kimi’s retrieval quality also depends on data-source coverage, which can affect assessments involving time-sensitive misinformation. Moreover, the absence of a unified mechanism to verify LLM outputs means hallucinations or incomplete dimension coverage may occasionally occur. Future work will explore multi-LLM cross validation and explicit quality metrics to stabilize the credibility assessment.

Computational Complexity and Deployment Constraints. The SCMG-FND framework integrates multimodal encoders, multi-granularity contrastive branches, viewpoint evolution modules, and a neural-symbolic rule engine. This modular architecture naturally increases computational cost. To accurately assess the system’s practical applicability, it is important to distinguish two different types of latency.

The first type is pure model inference latency. This refers solely to the forward pass of the multimodal model, which includes video and audio encoders, fusion layers, contrastive branches, and the final decision module. On an NVIDIA A100 (80 GB), inference on a short video requires approximately 0.3 s. On commonly used data-center GPUs such as V100, RTX 3090, or RTX 4090, the latency typically rises to between 1.4 and 2.8 s depending on the length and resolution of the video. Although this level of latency is insufficient for real-time applications on edge devices, it remains entirely feasible for offline or batch-based moderation tasks.

The second type is the end-to-end latency of the full detection pipeline. This includes multimodal inference along with large language model semantic analysis, prompt-based credibility scoring, external evidence retrieval, and execution of the neural-symbolic rule engine. When cloud-based large language model interfaces are used, network communication and remote inference become the dominant sources of time consumption. Without parallel processing, analyzing a video of roughly 30 s typically requires between 90 and 150 s. Under fully local deployment, where both large language models and retrieval indexes are stored locally, the total processing time can be reduced to approximately 20 to 30 s due to the absence of network overhead.

Although the overall latency is higher than that of lightweight baselines, this does not hinder the practical usability of the framework. SCMG-FND is designed for pre-publication moderation on short-video platforms, where videos are analyzed after user upload but before public release. Such workflows do not demand real-time responses. Content moderation pipelines generally include queueing, multi-stage verification mechanisms, and human–AI collaborative auditing. The processing cost of SCMG-FND fits naturally into this operational structure. As a result, despite the increased computational requirements, the framework remains well suited for deployment in real-world short-video moderation systems.

Dataset dependency and limited generalization. Model performance is strongly tied to the characteristics of the FakeSV dataset, which is primarily Chinese and collected before 2023. This limits cross-lingual transferability and reduces coverage of emerging forgery types such as AI-generated avatars, real-time scene manipulation, and multimodal generative forgeries. Additional multilingual, temporally diverse corpora and domain adaptation strategies are required to sustain performance beyond FakeSV’s distribution.

Beyond empirical performance, the SCMG-FND framework proposed in this paper holds significant theoretical and practical implications. The integration of semantic credibility assessment, multi-granularity contrastive learning, and neural symbolic logic constraints provides a unified perspective for modeling disinformation signals across semantic, temporal, and logical dimensions, while offering actionable insights for mitigating real-world misinformation. Its explicit credibility dimensions and rule-triggered explanation mechanism empower human reviewers to understand model decision-making processes, enhancing transparency in content moderation workflows. The framework’s modularity enables its extension to critical domains such as public health, emergency response, and political communication—scenarios where timely detection of misleading short videos is paramount. These theoretical contributions and practical applications highlight SCMG-FND’s broad applicability, demonstrating its potential to significantly advance the development of next-generation trustworthy multimodal disinformation detection systems.

7. Conclusions

We have addressed key challenges in short-video fake news detection by proposing a framework that integrates multi-dimensional credibility assessment and multi-granularity contrastive learning. Leveraging LLM-based video analysis and online search, our model enhances understanding of global video features, enabling detection from basic content extraction to advanced semantic manipulation recognition, while forming a traceable analysis chain. The multi-dimensional credibility assessment significantly boosts detection of covert manipulations. To handle complex intra-modal feature structures and dispersed viewpoints, we use a capsule network with dynamic routing to aggregate long-range semantic dependencies and cross networks for high-order interactions, strengthening the foundation of intra-modal semantics. Our multi-granularity contrastive learning aligns features across global, modal, temporal, and spatial dimensions, effectively capturing inter-modal inconsistencies. Experiments on the FakeSV dataset demonstrate that our method achieves an accuracy of 89.11% and outperforms mainstream models in both accuracy and robustness. The combined use of rule-based reasoning and data-driven learning yields a robust and interpretable system.

Looking ahead, we recognize that the effectiveness of multi-dimensional credibility assessment depends on the inherent reasoning abilities of LLMs, and there is a lack of systematic evaluation mechanisms to constrain or verify their outputs. Future work will focus on developing more comprehensive quality assessment for credibility outputs, and further integrating the capsule network and interaction networks more tightly with multi-granularity contrastive learning. This aims to further improve the model’s ability to detect highly consistent cross-modal forgeries.

8. Ethical Considerations

When using the publicly available FakeSV dataset, we strictly adhere to relevant laws, regulations, and ethical guidelines. We use it solely for academic research and do not disseminate any misleading content contained within.

Considering the potential impact of LLM-generated text analysis, we take precautions to mitigate risks. In our code and publications, we only share prompt templates, not the specific model-generated analyses. This ensures that any LLM output, which might contain hallucinations or sensitive content, is not inadvertently circulated. Our intent is to support fake news detection research without contributing to misinformation or harming public interest.

Beyond ethical concerns regarding data sources and large language model outputs, we further recognize that this research methodology exhibits a classic “dual-use” characteristic: while the model aims to enhance detection capabilities against fake news and deepfakes, its technical approach could be exploited by malicious actors to construct more deceptive forged content. For instance, attackers could exploit key forgery indicators identified by the model—such as cross-modal inconsistencies, temporal step anomalies, or localized forgery traces—to reverse-engineer the generation process. This would enable them to tailor fabricated content to better align with the model’s detection thresholds, thereby enhancing its concealment. To mitigate this risk of reverse abuse, we adopted the following safeguards:

We do not disclose model parameter details, rule weights, or internal thresholds, providing only the overall methodological architecture to avoid offering attackers directly exploitable surfaces;

Modular, replaceable design makes specific components (e.g., neuro-symbolic rules, contrastive learning constraints) difficult to replicate, increasing the cost of reverse engineering;

We emphasize that the model is solely for content authenticity verification, not for generating or enhancing potentially misleading content, and strictly limit functional interfaces during open-source release or publication;

In future work, we plan to introduce adversarial robustness assessments, such as testing whether generative models can circumvent this model’s cross-modal contrastive constraints, to proactively identify potential vulnerabilities and guide safer model design.

We fully recognize that multimodal detection technology carries both societal value and inherent risks. Therefore, we adopt cautious, transparent, and responsible strategies in method design and output presentation to prevent misuse of the technology for creating more subtle forgeries or misleading content.

Author Contributions

Conceptualization, Y.Y., X.S., H.L., B.F. and Y.X.; methodology, Y.Y., X.S. and Y.X.; software, Y.Y. and X.S.; validation, Y.Y., X.S., H.L. and B.F.; formal analysis, Y.Y. and Y.X.; investigation, Y.Y., X.S., H.L. and B.F.; resources, Y.X.; data curation, Y.Y. and X.S.; writing—original draft preparation, Y.Y. and X.S.; writing—review and editing, Y.Y., X.S. and Y.X.; visualization, Y.Y. and H.L.; supervision, Y.X.; project administration, Y.X.; funding acquisition, Y.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China, grant number U24B20147.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The FakeSV dataset used in this study is publicly available. The source code for the SCMG-FND framework is available at https://github.com/yykun-pixel/SCMG-FND, accessed on 25 November 2025.

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Business of Apps. TikTok App Report. Available online: https://www.businessofapps.com/data/tiktok-app-report/ (accessed on 26 May 2025).
Allcott, H.; Gentzkow, M. Social media and fake news in the 2016 election. J. Econ. Perspect. 2017, 31, 211–236. [Google Scholar] [CrossRef]
Leng, Y.; Zhai, Y.; Sun, S.; Wu, Y.; Selzer, J.; Strover, S.; Zhang, H.; Chen, A.; Ding, Y. Misinformation during the COVID-19 outbreak in China: Cultural, social and political entanglements. IEEE Trans. Big Data 2021, 7, 69–80. [Google Scholar] [CrossRef] [PubMed]
Shahzad, S.A.; Hashmi, A.; Peng, Y.-T.; Tsao, Y.; Wang, H.-M. AV-Lip-Sync+: Leveraging AV-HuBERT to exploit multimodal inconsistency for video deepfake detection. arXiv 2023, arXiv:2311.02733. [Google Scholar] [CrossRef]
Peng, L.; Zhang, Y.; Wang, W. Not all fake news is semantically similar: Contextual semantic representation learning for multimodal fake news detection. Inf. Process. Manag. 2024, 61, 102712. [Google Scholar] [CrossRef]
Wang, Y.; Li, X.; Zhang, Y. Audio–visual deepfake detection using articulatory features. Signal Process. Image Commun. 2024, 101, 116–123. [Google Scholar]
Javed, M.; Khan, M.S.; Wang, H.-M. Audio–visual synchronization and lip movement analysis for deepfake detection. J. Vis. Commun. Image Represent. 2025, 77, 103205. [Google Scholar] [CrossRef]
Bohacek, M.; Farid, H. Lost in translation: Lip-sync deepfake detection from audio-video mismatch. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 17–18 June 2024; pp. 100–108. [Google Scholar]
Liu, W.; Zhang, Y.; Wang, X. Spotting the temporal inconsistency between audio and lip movements for deepfake detection. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024; Volume 38, pp. 1234–1242. [Google Scholar]
Peng, L.; Zhang, Y.; Wang, W. Dual emotion based fake news detection: A deep attention mechanism approach. Inf. Process. Manag. 2024, 61, 102813. [Google Scholar]
Guo, L.N.; Huang, J.; Wu, X.C.; Yang, Z.; Liu, W. Fake news detection based on joint-training of dual-branch networks. Comput. Eng. Appl. 2022, 58, 153–161. [Google Scholar]
Goldani, M.H.; Momtazi, S.; Safabakhsh, R. Detecting fake news with capsule neural networks. Appl. Soft Comput. 2021, 101, 106991. [Google Scholar] [CrossRef]
Shen, R.L.; Pan, W.M.; Peng, C.; Yin, P.B. Microblog rumor detection method based on multi-task learning. Comput. Eng. Appl. 2021, 57, 192–197. [Google Scholar]
Song, C.G.; Shu, K.; Wu, B. Temporally evolving graph neural network for fake news detection. Inf. Process. Manag. 2021, 58, 102712. [Google Scholar] [CrossRef]
Verma, P.K.; Agrawal, P.; Amorim, I.; Prodan, R. WELFake: Word Embedding over linguistic features for fake news detection. IEEE Trans. Comput. Soc. Syst. 2021, 8, 881–893. [Google Scholar] [CrossRef]
Silva, A.; Luo, L.; Karunasekera, S.; Leckie, C. Embracing domain differences in fake news: Cross-domain fake news detection using multi-modal data. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; pp. 557–565. [Google Scholar]
Dou, Y.T.; Shu, K.; Xia, C.Y.; Yu, P.S. User preference-aware fake news detection. In Proceedings of the 44th ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021; pp. 2051–2055. [Google Scholar]
Zeng, X.Q.; Hua, X.; Liu, P.S.; Zuo, J.; Wang, M. Text sentiment distribution label augmentation method based on Plutchik’s wheel of emotions and sentiment lexicon. J. Comput. Res. Dev. 2021, 44, 1080–1094. [Google Scholar]
Kausar, N.; Alikhan, A.; Sattar, M. Towards better representation learning using hybrid deep learning model for fake news detection. Soc. Netw. Anal. Min. 2022, 12, 165. [Google Scholar] [CrossRef]
Alonso, M.A.; Vilares, D.; Gómez-Rodríguez, C.; Vilares, J. Sentiment Analysis for Fake News Detection. Electronics 2021, 10, 1348. [Google Scholar] [CrossRef]
Qian, S.S.; Wang, J.G.; Hu, J.; Fang, Q.; Xu, C. Hierarchical multi-modal contextual attention network for fake news detection. In Proceedings of the 44th ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021; pp. 153–162. [Google Scholar]
Yi, S.R.; Soleymani, S.; Arabnia, H.R.; Li, S. Socially aware multimodal deep neural networks for fake news classification. In Proceedings of the IEEE 4th International Conference on Multimedia Information Processing and Retrieval, Tokyo, Japan, 8–10 September 2021; pp. 253–259. [Google Scholar]
Choi, H.; Ko, Y. Using topic modeling and adversarial neural networks for fake news video detection. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management, Online, 1–5 November 2021; pp. 2950–2954. [Google Scholar]
Wu, Y.; Zhan, P.W.; Zhang, Y.J.; Wang, L.; Xu, Z. Multimodal fusion with co-attention networks for fake news detection. In Findings of the ACL-IJCNLP; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 2560–2569. [Google Scholar]
Qi, P.; Cao, J.; Li, X.; Liu, H.; Sheng, Q.; Mi, X.; He, Q.; Lv, Y.; Guo, C.; Yu, Y. Improving fake news detection by using an entity-enhanced framework to fuse diverse multimodal clues. J. Comput. Res. Dev. 2021, 58, 1456–1465. [Google Scholar]
Qi, P.; Bu, Y.; Cao, J.; Ji, W.; Shui, R.; Xiao, J.; Wang, D.; Chua, T.-S. FakeSV: A multimodal benchmark with rich social context for fake news detection on short video platforms. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 14444–14452. [Google Scholar]
Zong, L.; Zhou, J.; Lin, W.; Liu, X.; Zhang, X.; Xu, B. Unveiling opinion evolution via prompting and diffusion for short video fake news detection. In Findings of the Association for Computational Linguistics: ACL; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 10817–10826. [Google Scholar]
Li, J.; Bin, Y.; Zou, J.; Zou, J.; Wang, G.; Yang, Y. Cross-modal consistency learning with fine-grained fusion network for multimodal fake news detection. arXiv 2023, arXiv:2311.01807. [Google Scholar]
Wang, W.Y. “Liar, Liar Pants on Fire”: A new benchmark dataset for fake news detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 422–426. [Google Scholar]
Golbeck, J.; Mauriello, M.; Auxier, B.; Gieringer, C.; Graney, J.; Hoffman, K.M.; Huth, L.; Ma, Z.; Jha, M.; Khan, M.; et al. Fake News vs Satire: A Dataset and Analysis. In Proceedings of the 10th ACM Conference on Web Science (WebSci), Amsterdam, The Netherlands, 20–30 May 2018; pp. 17–21. [Google Scholar]
Abu Salem, F.K.; Al Feel, R.; Elbassuoni, S.; Jaber, M.; Farah, M. FA-KES: A Fake News Dataset around the Syrian War. In Proceedings of the 13th International AAAI Conference on Web and Social Media (ICWSM), Munich, Germany, 11–14 June 2019; pp. 573–582. [Google Scholar]
Zubiaga, A.; Kochkina, E.; Liakata, M.; Procter, R.; Lukasik, M. Stance classification in rumours as a sequential task exploiting the tree structure of social media conversations. In Proceedings of the 26th International Conference on Computational Linguistics (COLING), Osaka, Japan, 11–16 December 2016; pp. 2438–2446. [Google Scholar]
Nakamura, K.; Levy, S.; Wang, W.Y. Fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC), Marseille, France, 11–16 May 2020; pp. 755–762. [Google Scholar]
Jindal, S.; Sood, R.; Singh, R.; Vatsa, M.; Chakraborty, T. NewsBag: A benchmark multimodal dataset for fake news detection. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; pp. 138–145. [Google Scholar]
Zhu, Y.; Liu, Y.; Zhang, X. MFND Dataset and Shallow-Deep Multitask Learning. arXiv 2025, arXiv:2505.06796. [Google Scholar] [CrossRef]
Li, Y.; Zhang, H.; Zhang, J.; Wang, C. MM-COVID: A multilingual multimodal dataset for COVID-19 fake news detection. arXiv 2020, arXiv:2011.04088. [Google Scholar]
Tian, Y.; Krishnan, D.; Isola, P. Contrastive multiview coding. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 776–794. [Google Scholar]
van den Oord, A.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning (ICML), Vienna, Austria, 12–18 July 2020; Volume 119, pp. 1597–1607. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020. [Google Scholar]
Shi, T.; Huang, S.-L. Multiemo: An attention-based correlation-aware multimodal fusion framework for emotion recognition in conversations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), Toronto, ON, Canada, 9–14 July 2023; pp. 14752–14766. [Google Scholar]
Qi, P.; Zhao, Y.; Shen, Y.; Ji, W.; Cao, J.; Chua, T.-S. Two Heads Are Better Than One: Improving Fake News Video Detection by Correlating with Neighbors. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; pp. 11947–11959. [Google Scholar]

Figure 1. Schematic Diagram of Multi-Dimensional Credibility Evaluation Construction.

Figure 2. Overview of the framework: SCMG-FND comprises seven collaborating modules—feature extraction, credibility assessment, neuro-symbolic rules, cross-modal fusion, multi-granularity contrastive learning, decision fusion, and explainability. The system extracts features from transcript+title, comments, user profile, audio, key frames, and motion; GLM-4V-Flash with online search builds a nine-dimension credibility profile. The rule engine performs soft matching in a shared semantic space to calibrate features and reweight logits. After alignment by Multimodal Transformers, text–credibility fusion, capsule aggregation, and diffusion-based viewpoint evolution, unified InfoNCE objectives impose global/modal/temporal/spatial contrastive constraints. The contrastive outputs are fused with user/comment/key-frame cues to predict veracity, and explanations report rule triggers and modality contributions.

Figure 3. Overall Architecture of the Interpretability Module.

Figure 4. Progressive Performance Improvement by Adding Innovation Modules.

Figure 5. Precision, Recall, and F1 Trends Across Ablation Settings.

Figure 6. Multi-Dimensional Credibility Assessment Analysis Using Radar Chart.

Figure 7. Comparison of Trigger Frequencies for Positive and Negative Neuro-Symbolic Rules.

Table 1. Nine Dimensions of Credibility Assessment.

Dimension	Description
Professionalism	Are professional terms used? Are authoritative sources cited? Are academic terms used correctly and argumentation logically rigorous? Does it conform to domain knowledge?
Physical/Common-sense Consistency	Does the content violate natural laws or basic physical/social common sense?
AI-Generated Likelihood	Are there unnatural human motions, scene discontinuities, synthesized voice, or abnormal facial expressions indicative of AI generation?
Editing Artifacts	Are there unnatural jumps, audio-video desynchronization, repeated frames, or selective editing artifacts?
Title-Content Consistency	Compared to the content, does the title exaggerate, take things out of context, or mismatch?
Emotional Bias	Is fear, anger, or divisive emotion deliberately evoked through music, tone, or imagery?
Misleading Content	Are techniques like cherry-picking, bait-and-switch, or inverted causality used to induce misunderstanding, even if parts are true?
Source Reliability	Are information sources clearly cited? Is the publishing account credible? Is it from a mainstream or authoritative platform?
Intent to Spread	Are there obvious political or commercial motives or a clear stance, indicating manipulative agenda rather than objective reporting?

Table 2. Rule Weight Configuration.

Rule	Weight
Logical Inconsistency	$w_{base} = - 1.0$
Factual Contradiction	$w_{base} = - 1.0$
Source Unreliability	$w_{base} = - 0.9$
Emotional Manipulation	$w_{base} = - 0.8$
Timeline Inconsistency	$w_{base} = - 0.8$
Statistical Anomaly	$w_{base} = - 0.9$
Authoritative Source	$w_{base} = 0.8$
Scientific Evidence	$w_{base} = 0.7$
Cross-Verification	$w_{base} = 0.7$
Expert Endorsement	$w_{base} = 0.7$
Official Documentation	$w_{base} = 0.6$
Peer Review	$w_{base} = 0.6$

Table 3. Performance Comparison of Single-Modality Features in Fake News Detection.

Model	Acc (%)	F1 (%)	Rec (%)	Pre (%)
Keyframes	68.62	69.94	70.20	68.63
Video motion	68.62	69.90	70.11	68.63
Audio	67.76	67.74	67.78	68.27
User	78.83	78.40	80.48	79.70
Comments	63.61	63.78	65.82	65.87
SCMG-FND (ours)	89.11	89.53	88.27	90.73

Table 4. Performance Comparison with Multimodal Methods.

Model	Acc (%)	F1 (%)	Rec (%)	Pre (%)
MultiEMO	82.05	81.87	82.30	82.58
FANVM	82.32	81.97	83.12	82.84
SV-FEND-SNEED	81.67	81.03	81.65	82.66
SV-FEND	81.69	81.78	84.63	81.92
OpEvFake	$87.80 \pm 0.05$	$87.71 \pm 0.06$	87.90	88.01
SCMG-FND (ours)	$89.11 \pm 0.04$	$89.53 \pm 0.05$	88.27	90.73

Note.

p < 0.01

based on paired t-test with OpEvFake.

Table 5. Ablation Study Results (Contribution of Different Modules).

Model	Acc (%)	F1 (%)	Rec (%)	Pre (%)
w/o Transformer	85.62	85.94	85.20	85.63
w/o Capsule	84.40	84.21	85.73	84.76
w/o Enhance	81.53	81.87	82.20	82.32
w/o Prompt	83.84	82.79	83.12	83.27
w/o Neuro	87.92	88.27	87.67	87.84
w/o OpiEvo	85.49	85.34	85.11	85.76
w/o Mgcl	86.80	86.76	86.15	86.33
SCMG-FND (ours)	89.11	89.53	88.27	90.73

Table 6. Sensitivity of SCMG-FND to the temperature coefficient

τ

.

Table 6. Sensitivity of SCMG-FND to the temperature coefficient

τ

.

$τ$	Acc (%)	F1 (%)
0.05	88.20	88.37
0.10	89.11	89.53
0.20	88.42	88.61
0.30	87.95	88.05

Table 7. Performance Impact of Different Granularities of Contrastive Learning.

Model	Acc (%)	F1 (%)	Rec (%)	Pre (%)
w/o Global	87.47	87.94	88.20	87.63
w/o Modal	88.12	89.10	88.11	88.27
w/o Temporal	87.66	88.27	87.75	88.16
w/o Spatial	88.78	88.19	87.68	88.43
SCMG-FND (ours)	89.11	89.53	88.27	90.73

Table 8. Case Study Results of Explainability Module.

Case	Analysis Results
True Case	Title: Male driver violently beat female driver due to lane-change violation. Trust score: 0.88. Modality contributions: Text 55%, Video 25%, Audio 12%. Rule triggers: Title-content consistency verified through internet search.
Fake Case	Title: Male driver with road rage beat female driver for slow driving. Trust score: 0.86. Modality contributions: Text 48%, Video 32%, Audio 10%. Rule triggers: Title-content inconsistency and internet search verification.

Table 9. Representative Failure Cases in the Explainability Module.

Case	Real Event Background	Video Cues	Prediction (Conf.)	Main Misjudgment Reasons
Case 1	Real incident: 16-year-old student accidentally fell at Taihe No. 2 High School, Anhui; police verified authenticity.	Blurred slideshow photos, synthetic narration, no official watermark.	true → fake (0.32)	(1) Blurring resembled intentional tampering; (2) Static slides failed to trigger temporal contrast, so cross-modal checks weakened; (3) Title keywords were not promptly cross-verified with authoritative sources.
Case 2	Rumor: alleged patriotic activity in a Xi’an community, later confirmed staged and not linked to the cited outbreak.	Walls draped with flags, “epidemic recovery” overlay, patriotic slogans and cheering audio.	fake → true (0.78)	(1) Visual-text alignment biased the model toward authenticity; (2) Search retrieved few reliable sources about the scene, leaving credibility unchecked; (3) Emotional rhetoric dominated and the plausibility module failed to flag staged settings.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, Y.; Shi, X.; Li, H.; Fan, B.; Xu, Y. Fake News Detection in Short Videos by Integrating Semantic Credibility and Multi-Granularity Contrastive Learning. Appl. Sci. 2025, 15, 12621. https://doi.org/10.3390/app152312621

AMA Style

Yang Y, Shi X, Li H, Fan B, Xu Y. Fake News Detection in Short Videos by Integrating Semantic Credibility and Multi-Granularity Contrastive Learning. Applied Sciences. 2025; 15(23):12621. https://doi.org/10.3390/app152312621

Chicago/Turabian Style

Yang, Yukun, Xiwei Shi, Haoxu Li, Buwei Fan, and Yijia Xu. 2025. "Fake News Detection in Short Videos by Integrating Semantic Credibility and Multi-Granularity Contrastive Learning" Applied Sciences 15, no. 23: 12621. https://doi.org/10.3390/app152312621

APA Style

Yang, Y., Shi, X., Li, H., Fan, B., & Xu, Y. (2025). Fake News Detection in Short Videos by Integrating Semantic Credibility and Multi-Granularity Contrastive Learning. Applied Sciences, 15(23), 12621. https://doi.org/10.3390/app152312621

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fake News Detection in Short Videos by Integrating Semantic Credibility and Multi-Granularity Contrastive Learning

Abstract

1. Introduction

2. Related Work

2.1. Single-Modal Fake News Detection

2.2. Multimodal Fake News Detection

2.3. Fake News Datasets

3. Preliminaries

3.1. Contrastive Learning

3.2. Multi-Dimensional Credibility Assessment

3.3. Neural-Symbolic Rules

4. Methodology

4.1. Overview

4.2. Framework Design Principles

4.3. Multimodal Feature Extraction

4.4. Intra-Modal Semantic Enhancement Mechanism

4.5. Cross-Modal Viewpoint Interaction Model

4.6. Multi-Granularity Contrastive Learning Loss Integration

4.7. Multimodal Decision Fusion

4.8. Explainability

5. Results

5.1. Experimental Setup

5.2. Research Questions

5.3. Experimental Results and Analysis

6. Discussion

7. Conclusions

8. Ethical Considerations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI