Next Article in Journal
An Interpretable Belief Rule-Based Fault Diagnosis Method for Complex Equipment Considering Linguistic Fuzzy Information
Previous Article in Journal
Entropy Bounds and Capacity-Limited Information Flow in Black-Hole Evaporation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

GC2MFND: Multi-Granularity Conflict and Domain-Guided Calibration for Multimodal Fake News Detection

1
School of Transportation, Shandong University of Science and Technology, Qingdao 266590, China
2
School of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China
3
Library, Shandong University of Science and Technology, Qingdao 266590, China
*
Author to whom correspondence should be addressed.
Entropy 2026, 28(6), 672; https://doi.org/10.3390/e28060672 (registering DOI)
Submission received: 22 April 2026 / Revised: 7 June 2026 / Accepted: 9 June 2026 / Published: 11 June 2026
(This article belongs to the Section Multidisciplinary Applications)

Abstract

On current social media platforms, multimodal fake news has permeated various fields. Multi-domain fake news detection has garnered significant attention in the academic community. Existing multi-domain methods primarily employ feature fusion techniques based on text–image alignment, neglecting the extraction of conflicting information across modalities and failing to address the domain-dependent nature of cross-modal feature conflicts. To address this, we propose a Multi-Granularity Conflict and Domain-Guided Calibration for Multimodal Fake News Detection model (GC2MFND). This model captures conflicting features through the domain-aware multi-granularity conflict extraction module and mitigates feature suppression using the domain-guided multimodal feature calibration module. Finally, it combines domain-adaptive aggregation with multi-view evidence integration to achieve robust decision-making under supervised contrastive learning constraints. Under known domain conditions, the experimental results demonstrate that GC2MFND outperforms existing multi-domain baseline methods, achieving accuracy rates of 95.3%, 95.7%, and 81.2% on the Weibo, Weibo21, and FineFake datasets, respectively, representing improvements of 1.1%, 1.2%, and 1.4% over the corresponding multi-domain baselines.

1. Introduction

The evolution of social media has expanded the pool of news publishers from professional media organizations to individual users [1]. While this has enriched the diversity and personalization of information dissemination, it has also accelerated the creation and spread of fake news [2]. Today, news content is evolving from single-text formats to multimodal forms that integrate text, images, video, and audio [3,4,5], which has increased the inflammatory and deceptive nature of fake news [6,7,8]. The proliferation of multimodal fake news on social media easily misleads public opinion, erodes social trust, and triggers multiple social crises [9]. Meanwhile, such fake news has already permeated numerous fields [10]. From an information-theoretic perspective, the influx of multi-domain multimodal data has significantly increased the information content and diversity of news. However, this also amplifies the inherent uncertainty, complexity, and information redundancy in the fake news detection task. Traditional manual verification methods struggle to provide efficient and accurate responses in the face of massive multi-domain multimodal news. Consequently, Multimodal Fake News Detection (MFND) in different domains has attracted widespread attention [11,12,13].
Multimodal multi-domain fake news detection methods primarily enhance detection performance by incorporating domain-specific information as auxiliary signals to learn both domain-general and domain-specific knowledge. For example, MMDFND [14] models domain-specific commonalities and peculiarities through an improved Domain Progressive Layered Extraction (DPLE) module, thereby further improving the performance of multimodal, multi-domain fake news detection. DAMMFND [15] accurately extracts domain information through domain decoupling and integrates it into a multi-view decision-making process to quantify the contributions of different modalities in detection. Multi-domain methods have optimized the incorporation of domain information into multi-view decision-making through mechanisms such as feature fusion and expert routing. However, they have overlooked semantic conflicts between features across modalities. Consequently, some studies have begun to focus on inter-modal conflicting features to enhance detection performance. Specifically, methods commonly employ co-attention, similarity, and anti-attention mechanisms to extract conflicting features. Furthermore, MIAN [16] and RaCMC [17] have validated the effectiveness of conflicting features for fake news detection. Although the aforementioned methods have achieved solid performance in multimodal fake news detection, two major issues remain:
(1)
Domain dependence of cross-modal feature conflicts: In text–image conflict scenarios, the conflict pattern is not uniform across domains [18] but exhibits significant domain heterogeneity. Figure 1a shows news in the social domain, characterized by local text–image conflicts. In contrast, Figure 1b shows news in the disaster domain, characterized by conflicts between local text features and global image features. Figure 1c illustrates a global conflict at the scene level. Existing methods use a static modeling paradigm, making it hard to adaptively extract conflict features based on domain characteristics, thus insufficiently capturing latent contradictions in specific scenarios.
(2)
Semantic shifts in cross-domain feature distributions: The semantic distributions of the same modality shift significantly across different domains [19,20], causing a decline in feature discriminability. The same vocabulary and image representations can have fundamentally different discriminative power across domains. As shown in Figure 1d, the keyword “virus” refers to biological pathogens in the medical domain but malicious code in the technology domain. A unified modeling approach that ignores such semantic drift loses domain-specific discriminative information, thus limiting the model’s cross-scenario generalization.
Figure 1. Examples of cross-modal conflict patterns and semantic shifts across different domains from the Weibo21 dataset, where the red boxes indicate conflicting semantics, while the green boxes indicate consistent semantics. (a) Local-Local conflict. (b) Local-Global conflict. (c) Global-Global conflict. (d) Domain semantic shift.
Figure 1. Examples of cross-modal conflict patterns and semantic shifts across different domains from the Weibo21 dataset, where the red boxes indicate conflicting semantics, while the green boxes indicate consistent semantics. (a) Local-Local conflict. (b) Local-Global conflict. (c) Global-Global conflict. (d) Domain semantic shift.
Entropy 28 00672 g001
To address these challenges, we propose a Multi-Granularity Conflict and Domain-Guided Calibration for Multimodal Fake News Detection model (GC2MFND). Current models focus on enhancing complementary information between text and images. In contrast, GC2MFND treats domain embeddings as the hub of adaptive regulation and focuses on addressing the problem of underutilized modal conflicts. First, the model uses a domain-aware multi-granularity conflict extraction module. By dynamically adjusting the perception weights of local and global perspectives using domain embeddings, this module captures domain-specific conflict signals effectively. To address the issue of semantic drift, we have developed a domain-guided multimodal feature calibration module. By employing intra-modal adaptive calibration and domain-guided gated redundancy removal, we effectively reduce noise while achieving domain-adaptive semantic alignment. The model employs domain-adaptive aggregation, dynamically assigning optimal aggregation weights to conflict features and modalities based on domain characteristics to produce domain-adaptive conflict features and multimodal features. Finally, the model uses a multi-view evidence integration strategy. It fuses calibrated unimodal representations, multi-granularity conflict representations, and global semantics. This enables collaborative decision-making for complex evidence chains under domain-supervised contrastive constraints. Through this approach, GC2MFND mitigates detection biases caused by domain heterogeneity and enhances the utilization of inter-modal conflict features.
The main contributions in this study are summarized as follows:
(1)
We propose the GC2MFND model that dynamically extracts modal features based on domain embeddings, integrates conflicting features with enhanced cross-modal representations, and aggregates evidence from multiple perspectives to verify the authenticity of news.
(2)
We propose a domain-aware, multi-granularity conflict extraction mechanism to capture cross-modal inconsistencies at three levels: local–local, local–global, and global–global. Additionally, we achieve dynamic feature integration through a domain-adaptive aggregation framework.
(3)
We construct a domain-guided feature calibration module to obtain domain-corrected features, employing multi-view evidence integration and domain contrastive learning constraints to form a complete collaborative reasoning evidence chain.

2. Related Works

With the continuous development of the field of news detection [21], related studies have progressively transitioned from early unimodal news detection to multimodal approaches, and with the increasing segmentation of news domains, multi-domain fake news detection has emerged as a focal point of investigation.

2.1. Multimodal Fake News Detection

Multimodal fake news detection generally aims to reduce the semantic gap between text and images, with a primary focus on feature fusion and network architecture design. Early studies [22] treated visual information as a supplement to textual content. For instance, EANN [23] employed a generative adversarial network to map bimodal features into a unified space for simple concatenation. Subsequently, a number of methods [24,25] adopted pretrained models for feature extraction and performed early multimodal fusion through concatenation or vector operations. Considering the higher-level semantic relationships between images and text, SpotFake [26] used a pre-trained model to extract features from images and text, and classified fake news by concatenating these features. Later, Masked Autoencoder (MAE) [27] was proposed based on masked autoencoders to improve local feature extraction, capturing subtle local manipulation traces and micro-level semantic anomalies in images more sensitively than conventional CNNs, thereby providing visual support for fine-grained conflict mining. Radford et al. [28] introduced the Contrastive Language-Image Pre-training (CLIP) model, which constructs an aligned text–image shared semantic space through large-scale contrastive learning and enhances global feature extraction. Building on this foundation, Liu et al. [29] proposed the interactive mixture-of-experts framework MIMoE-FND, which explicitly models semantic alignment degree and unimodal consistency while employing gating mechanisms for adaptive feature fusion. Since simple fusion methods struggle to identify conflicts such as “text–image irrelevance” or “text–image contradiction,” research focus has shifted toward cross-modal inconsistency mining. CAFE [30] was the first to quantify the degree of modality conflict using KL divergence and dynamically adjust fusion weights accordingly. RaCMC [17] leveraged knowledge distillation to maximize modal interaction information for detecting anomalous image–text relationships. TLFND [31] extracted text–image conflicts at multiple levels, including local and global levels as well as intra-modal and inter-modal levels, through a three-level feature matching distance mechanism. However, most existing studies adopt a uniform cross-modal interaction mechanism that does not fully account for feature variations across different news domains, thereby limiting model performance in multi-domain scenarios [32].

2.2. Multi-Domain Fake News Detection

Real-world news data spans numerous domains and is highly heterogeneous. Achieving domain-adaptive detection is the goal of multi-domain fake news detection [33,34,35,36]. KATMF [37] was the first to combine multi-domain and multimodal approaches for fake news detection, utilizing adversarial multi-task learning and knowledge-enhanced Transformers to capture differences in the feature distributions of news articles across different domains. EMT [38] improved generalization by extracting both domain-specific and domain-invariant features and by incorporating external knowledge. To address domain distribution shifts, Zhang et al. [39] and Li et al. [40] utilized Bidirectional Encoder Representations from Transformers (BERT) and transfer learning to mitigate cross-domain discrepancies. To model the specific distributions of different domains with greater precision, Mixture of Experts (MoE) and graph-structured learning have emerged as mainstream approaches in recent years. To address the issue of domain data imbalance, MDFEND [10] employed a MoE architecture and a domain gating mechanism to dynamically integrate expert representations. M3DFEND [41] and MMDFND [14] further enhanced multi-view adaptive aggregation through domain adapters and an improved DPLE module, respectively. Zhao et al. [42] utilized a mixture-of-experts network and a gating mechanism to address feature distribution discrepancies in multi-domain fake news detection. Yuan et al. [43] modeled cross-domain relationships among news events based on a graph attention network, utilizing structured information to aid in identification. Recently, Lu et al. [15] proposed DAMMFND, which further introduced the concept of feature decoupling. By separating domain features from semantic features and combining them with a domain-aware decision mechanism, it achieved a deep analysis of domain heterogeneity. Xu et al. [44] proposed DATTAMM, which employed a domain-aware test-time adaptation mechanism to dynamically adjust model parameters during the inference stage, thereby accommodating the feature distribution of the target domain. However, hard decoupling may filter out subtle counterfeit detection cues by disrupting the semantic flow of text and images. Therefore, the key to improving multi-domain multimodal fake news detection is to fully leverage cross-modal conflict information and discriminative features, guided by domain-specific knowledge, without compromising core semantic meaning.

3. Methods

In this section, we introduce the proposed GC2MFND, whose overall architecture is shown in Figure 2. Given a news sample containing text, images, and domain labels, our method first performs multi-view feature encoding and domain embedding generation (Section 3.1). Subsequently, the model comprises three core modules: a module for mining cross-modal conflicts across local-to-global scales (Section 3.2), a module for domain-guided dynamic feature calibration and deduplication (Section 3.3), and a domain-adaptive feature aggregation module (Section 3.4).
Each input multimodal news sample is represented as N = [ T , I , D d ] D , where T, I, D d , and D denote the text content, image content, domain label, and dataset, respectively. The news items in the entire dataset are classified into k domains, with each domain assigned a label D d { D 1 , D 2 , D 3 , , D k } . The objective of multimodal domain-adaptive fake news detection is as follows: given multimodal content comprising text T and image I, and using the explicit domain label D d as prior knowledge, the model determines the authenticity of the news item via a domain-adaptive mechanism. The main symbols used in this method and their meanings are shown in Table 1.

3.1. Multi-Granularity Feature Extraction and Domain Representation Module

To exhaustively mine the discriminative features of each modality and enhance the representation capability, we employ a dual-granularity feature encoding module that simultaneously extracts fine-grained local features and coarse-grained global features from text and image modalities to capture semantic information at different levels.

3.1.1. Fine-Grained Local Feature Extraction

Given a text T, we use the pre-trained BERT model [45] as a text encoder to obtain fine-grained local features of the text, denoted as T local R L × d t , where L is the length of the text sequence and d t is the dimension of the text features. Meanwhile, given an image I, we use the MAE [27] model to extract patch features as fine-grained local image features, denoted as I local R P × d i , where P is the number of image patches and d i is the image feature dimension. To uniformly compress and align text–image features from the high-dimensional pre-training space, we define LocalNET as a local feature adapter, employing a multi-scale one-dimensional convolutional extractor [46]. We can obtain the enhanced text local features T ˜ local = LocalNET text ( T local ) R L × d and the enhanced local image features I ˜ local = LocalNET img ( I local ) R P × d using the method described above.

3.1.2. Coarse-Grained Global Feature Extraction

Global features are designed to provide coarse-grained, macro-level semantic information. We utilize a pre-trained CLIP model [28] to extract global features for both images and text. The text content is processed by CLIP’s text encoder to obtain global text features T global R d g that represent the overall semantic meaning. Similarly, the image content is encoded by CLIP’s image encoder to obtain image global features I global R d g that encapsulate high-level visual semantics, where d g is the dimension of the global features. Similarly, to ensure consistency with the aforementioned local features within a unified metric space, GlobalNET is defined as a global feature adapter. It employs linear projection and layer normalization to map these features into a unified high-dimensional semantic space. Thus, we obtain the enhanced text global features T ˜ global = GlobalNET text ( T global ) R d and the enhanced image global features I ˜ global = GlobalNET img ( I global ) R d .

3.1.3. Domain Embeddings

Given the substantial statistical heterogeneity in fabrication patterns and content distribution across various news domains, we introduce learnable domain embeddings to enable the model to capture domain-specific characteristics. Domain labels are fed into the embedding layer to produce domain embedding vectors e d = E dom [ d , : ] R d d , where E dom R N × d d and N denotes the number of domains.

3.2. Domain-Aware Multi-Granularity Conflict Extraction Module

In multimodal fake news detection, semantic conflicts between text and images serve as key clues for identifying fake news. To better capture and utilize conflict features across various domains, we propose a domain-aware, multi-granularity conflict extraction module, as shown in Figure 3. Conflict features are extracted from three perspectives: “local–local,” “local–global,” and “global–global.” The contribution of these features is adaptively adjusted under the influence of the domain embedding e d .
To address the discrepancies in dimensionality and information density between local and global features, we design two asymmetric cross-modal interaction operators [30] to extract conflicting information across modalities of different granularities [18]. For local features, we employ a parameter-free, heuristic element-level operator F s e q to amplify anomalous deviation signals. It is defined as:
F s e q X , Y = X Y + X Y
where X and Y denote the image and text features, respectively; the absolute difference term quantifies the numerical deviation between the two features, serving to capture fine-grained semantic contradictions; and the product term measures the co-occurrence patterns of the two features in the feature space.
To address global features and to mitigate the loss of nonlinear conflict patterns that occur when traditional cosine metrics compress high-dimensional information into a single scalar, a multidimensional heuristic interactive mapping operator H is employed to extract robust macroscopic conflict representations while preserving the modal context. It is defined as:
H ( X , Y ) = SiLU BN W h X Y ( X Y ) ( X Y )
Local-Local view: The local–local conflict feature aims to capture fine-grained inconsistencies between text words and local image patches. First, the local text features and local image features are L 2 -normalized (i.e., Euclidean normalization), after which a similarity matrix smoothed by a learnable temperature coefficient τ is computed:
S l l = T ˜ l o c a l I ˜ l o c a l τ
Then, after applying a Softmax function to S l l R L × P along the image block dimension and weighting the local image features, we obtain a text-aligned image sequence I a = Softmax ( S l l ) I ˜ l o c a l . Based on the aligned features, we employ the lightweight sequence conflict operator to obtain the local–local conflict sequence C l l = F s e q ( I a , T l o c a l ) . And through a text-mask-aware attention pooling layer, we derive the local conflict features F l l :
F l l = u l l C l l , u l l = softmax W u ϕ C l l
where W u is a trainable parameter and ϕ ( · ) denotes a non-linear mapping function.
Local–Global view: Local–global conflict features capture the semantic discrepancy between fine-grained local elements and the overall global context. They address the semantic misalignment between local and global elements across different modalities. Specifically, we employ a broadcast extension mechanism. This mechanism spatially aligns local features with global features, enabling direct comparison between each local element and the corresponding cross-modal global features. Then, we apply the lightweight sequence conflict operator. This produces the “local text–global image” conflict sequence C l g t = F s e q ( T ˜ l o c a l , I ˜ g l o b a l ) and the “local image–global text” conflict sequence C l g i = F s e q ( I ˜ l o c a l , T ˜ g l o b a l ) . Using the attention pooling layer, we obtain the local text–global image conflict features F l g t and the local image–global text conflict features F l g i :
F l g t = u l g t C l g t , u l g t = softmax ϕ C l g t W l g t
F l g i = u l g i C l g i , u l g i = softmax ϕ C l g i W l g i
Global–Global view: Global–Global conflict features capture the overall semantic inconsistency between the text and the image. To mitigate the loss of multidimensional contradictory information and complex nonlinear patterns, we employ the interactive mapping operator to extract a high-dimensional global conflict representation F g g = H ( T ˜ global , I ˜ global ) . This approach helps preserve a rich representation of global cross-modal conflict features while preserving the topological structure of the high-dimensional space.
Since multi-granularity conflict feature patterns vary across different domains, we leverage the domain-adaptive feature aggregation module described later to achieve domain-adaptive fusion of these conflict features. Through this module, we generate weight vectors w l l , w l g t , w l g i , w g g R d that represent the importance of LL , LG t , LG i and GG conflicts in the current domain, thereby obtaining the domain-aware conflict feature F C :
F C = w l l F l l + w l g t F l g t + w l g i F l g i + w g g F g g

3.3. Domain-Guided Multimodal Feature Calibration Module

To mitigate semantic distribution discrepancies across domains and enhance the domain adaptability of features, we propose a domain-guided feature calibration module, whose overall structure is illustrated in Figure 4. This module takes domain embeddings e d as prior conditioning information and conducts stepwise calibration and enhancement of local text and image features via three steps: Conditional Modulation, Domain-guided Gated Redundancy Removal (DGR), and Global Semantic Compensation.
To achieve domain-aware feature calibration and preserve general semantic information during cross-domain alignment, we introduce a residual-based linear modulation mechanism, termed Res-FiLM. In contrast to direct fusion of domain labels, we use the domain embedding e d as a prior condition to generate, via independent affine transformations, dynamic scaling factors s m and offsets sh m for the current input sample. We then use these generated parameters to adaptively calibrate the local text features T ˜ local and local image features I ˜ local , yielding the modulated features T local mod and I local mod . The formula is as follows:
s m = σ ( W scale m e d ) R d , s h m = W shift m e d R d
T local mod = T ˜ local + T ˜ local s t + s h t , I local mod = I ˜ local + I ˜ local s i + s h i
where σ ( · ) represents the sigmoid activation function; m { t , i } denotes the text (t) or image (i) modality; W scale and W shift are learnable projection matrices for the corresponding modalities; and ⊙ denotes element-wise multiplication.
Since the definition of “redundancy” varies between domains, we employ DGR to enhance the discriminative power of text–image features. To obtain cross-modal redundant representations, we first employ the domain embedding e d to dynamically modulate the query and key matrices:
D Q = σ ( W q D e d ) , D K = σ ( W k D e d )
Taking text features as an example, we use the modulated matrix to compute the domain-guided cross-modal co-occurrence attention matrix A t i , which is then aggregated to form the text redundancy representation R t i :
A t i = Softmax T local mod D Q I local mod D K d
R t i = A t i I local mod
Similarly, we compute the image attention matrix A i t and extract the image-to-text redundant representation R i t .
Next, we employ nonlinear adaptive subtractive gating to obtain clean features. An adaptive gating function g t is generated via a deep neural network to filter out redundant information, which is then subtracted from the modulated features to obtain the purified local text features T ^ local :
g t = σ W g t T local mod , R t i , T ^ local = T local mod g t R t i
Similarly, we can obtain the local features of the clean image I ^ local .
Although subtraction-based decoupling highlights micro-level cues, excessive orthogonalization may weaken macro-level text coherence and global contextual image dependencies. To compensate for this potential loss of semantic information, we introduce global anchor features T ˜ global and I ˜ global to restore semantic integrity. This yields domain-adaptive and discriminative calibrated features: T calib for text and I calib for images, which serve as unimodal features for the subsequent module. A mask is applied to the text features to remove placeholders, as shown in the following formula:
T c a l i b = SiLU BN W o u t t F a t t n ( T ^ l o c a l , M a s k ) T ˜ g l o b a l
I c a l i b = SiLU BN W o u t i F a t t n ( I ^ l o c a l ) I ˜ g l o b a l

3.4. Domain-Adaptive Feature Aggregation Module

In this module, we design a dynamic weight generation network to achieve domain-adaptive feature fusion. Given a domain embedding vector e d , the gated network learns the mapping relationship between domain attributes and feature discriminative power through non-linear projection. Specifically, for each feature branch v { 1 , 2 , , n } , it dynamically generates a corresponding feature-wise weight vector w v R d :
w v = σ Linear v ( e d )
F j = w 1 F 1 + w 2 F 2 + + w n F n
This module is reused twice in the model. The first application occurs during the multi-granularity conflict extraction stage, where the previously described method generates weights to dynamically fuse the most domain-representative conflicting signals, thereby obtaining the conflict feature F C . The second reuse occurs during the adaptive aggregation of integrated features. In this stage, the calibrated text features T calib , image features I calib , and conflict features F C from the preceding module are fused in a domain-adaptive manner. This process similarly generates corresponding weight vectors w t , w i , w c R d to obtain information-rich fused multimodal features F m .
F m = w t T c a l i b + w i I c a l i b + w c F C

3.5. Multi-View Evidence Integration and Loss Functions

After obtaining refined text features T calib , image features I calib , multi-granularity conflict features F C , and fused modal features F m from the above modules, we adopt an evidence-based strategy that concatenates all these features and feeds them into a deep fusion network. The network generates the final classification features F f , which are then fed into the classifier.
F f = MLP concat T c a l i b , I c a l i b , T ˜ g l o b a l , I ˜ g l o b a l , F C , F m
In multi-domain joint training, fake news exhibits semantic heterogeneity across different domains. Traditional supervised contrastive learning [47], which does not distinguish domain boundaries, tends to push all samples with the same label to cluster tightly together in the feature space. To mitigate feature interference in different domains, we introduce a domain-aware soft-weighted contrastive loss L DSC to bring samples of the same class within the same domain closer to each other to varying degrees, while pushing samples of different classes further apart. We define domain-aware positive sample masks M i , j pos and negative sample masks M i , j neg as follows:
M i , j pos = 1 , y i = y j , D i = D j , i j θ , y i = y j , D i D j , i j 0 , otherwise and M i , j neg = 1 , y i y j 0 , otherwise
L DSC = 1 W pos i = 1 N j = 1 N M i , j pos log exp F i · F j / τ k = 1 N M i , k all exp F i · F k / τ
where τ denotes the temperature coefficient, θ ( 0 , 1 ) is a hyperparameter controlling the strength of cross-domain positive sample alignment, W pos = i = 1 N j = 1 N M i , j pos , M i , j all = M i , j pos + M i , j neg , and F i and F j are the final classification features.
Since all global classification and auxiliary supervision tasks are essentially binary classification problems, and to mitigate the risk of overfitting caused by overconfidence in deep neural networks, all classifiers in this model uniformly adopt the binary cross-entropy loss with a label smoothing strategy, BCE smooth y ^ , y .
BCE smooth y ^ , y = 1 B i = 1 B y ˜ i log y ^ i + 1 y ˜ i log 1 y ^ i
where B denotes the given batch size, y ^ i represents the predicted probability, y i { 0 , 1 } denotes the original true label of the sample, and y ˜ i = y i ( 1 ϵ ) + ϵ 2 denotes the smoothed true label obtained by introducing the smoothing parameter ϵ .
To enhance the discriminative power of each modal feature in fake news detection, we apply independent classification supervision to the conflict features, text features, image features and the final fused features, calculating their respective losses as L conflict , L t , L i and L final . Consequently, the total loss for GC2MFND is as follows:
L total = L final + α · L conflict + β · L t + L i 2 + γ · L DSC
where α , β , and γ are weight coefficients for balancing the losses of different terms.

4. Experiments

In this section, we conduct an empirical evaluation of GC2MFND using three datasets covering news from different domains. The experiments in this section aim to elucidate the six dimensions of interest in this study concerning fake news detection by exploring the following research questions:
RQ1. 
Does GC2MFND effectively improve the overall performance of fake news detection?
RQ2. 
Can GC2MFND improve the detection accuracy for specific types of fake news?
RQ3. 
Does each component of GC2MFND contribute to improved detection?
RQ4. 
Is the domain-adaptive fusion mechanism capable of effectively capturing feature distribution discrepancies across different domains?
RQ5. 
How sensitive is GC2MFND to key hyperparameters, and what is its parameter robustness?
RQ6. 
Does GC2MFND exhibit high computational efficiency during both the training and inference stages?

4.1. Experimental Settings

4.1.1. Datasets

We evaluate GC2MFND on three real-world datasets: Weibo [23], Weibo21 [10], and FineFake [48]. For the Weibo dataset, we adopt the same data splitting and domain classification methods as the baseline work [14], dividing the data into training, validation, and test sets at a ratio of 7:1:2, and categorizing it into nine domains: finance, healthcare, military, science, politics, disasters, education, entertainment, and society. Weibo21 is a larger, multi-domain multimodal dataset covering data up to 2021. Following the partitioning scheme of the benchmark method [14], we split it into training, validation, and test sets in an 8:1:1 ratio. This dataset is categorized into nine domains: finance, health, military, science, politics, international affairs, education, entertainment, and society. Both of the above datasets are Chinese datasets sourced from the Weibo news platform. FineFake is a larger, multi-domain multimodal fake news detection dataset covering data up to 2024. It is split into training, validation, and test sets in a 6:2:2 ratio and includes data from eight news platforms, such as Twitter and Snopes, covering six domains: politics, entertainment, business, health, society, and conflict. The domain labels for each dataset were manually annotated. Furthermore, to ensure data quality, we follow the preprocessing steps outlined in previous works [14,48,49,50] to prevent data leakage between the training and test sets. To ensure a fair comparison, we obtain the experimental results of all baselines using the same dataset partitioning and pre-processing methods described above. See Table 2 for the data volume.

4.1.2. Implementation Details

In the multi-granularity feature extraction phase, BERT, MAE, and CLIP are used to extract image–text features. The parameters of their backbone networks are frozen, and corresponding Chinese and English BERT and CLIP models are used to accurately extract multilingual features. In image–text feature extraction, the pixels of the input images are uniformly resized to 224 × 224 , the length of local image–text features is set to 197, and the dimensionality is 768. Feature matching is performed using feature adapters, which employ parallel 1D convolutions (with kernel sizes of 1, 3, and 5) and the SiLU activation function to align the dimensions of local and global features in the image–text data to 320. The domain embedding dimension is 128. In the attention masking mechanism, the weights of invalid positions are set to the minimum value. In the loss function section, we use a binary cross-entropy loss function with label smoothing ϵ = 0.1 for all classifiers, and a positive sample mask θ = 0.5 in L DSC . The hyperparameters for the overall joint loss function are configured according to the differences between the Chinese and English datasets. For the two Chinese datasets, we set α = 0.3 , β = 0.2 , γ = 0.7 , and τ = 0.1 . For the English dataset FineFake, we set α = 0.1 , β = 0.1 , γ = 0.05 , and τ = 0.1 . For model optimization, we employ Adam [51] for end-to-end parameter updates, with an initial learning rate of 1 × 10 4 . To prevent gradient explosion, gradient clipping is set to 1.0 , the maximum number of epochs is set to 50, and early stopping is applied. All code is executed on an NVIDIA GeForce RTX 3090 graphics processing unit.

4.1.3. Baseline

To conduct a comprehensive evaluation of this model, we compare it with unimodal multi-domain, multimodal multi-domain and multimodal single-domain fake news detection methods.
(1)
Unimodal Multi-Domain
  • MOSE [52], which employs Long Short-Term Memory (LSTM) networks as the expert components in the MMoE architecture.
  • KATMF [37], using adversarial multi-task learning and an external knowledge base enhanced Transformer to capture feature differences in multi-domain multimodal news.
  • MDFEND [10], which employs a domain gate to aggregate MoE experts in a weighted manner for multi-domain fake news detection.
(2)
Multimodal Multi-Domain
  • M3DFEND [43], which adaptively aggregates semantic, sentiment, and stylistic features via domain adapters and a domain memory bank.
  • MMDFND [14], which uses Improved PLE to capture cross-domain and specific knowledge for multi-domain multimodal fake news detection.
  • DAMMFND [15], which employs domain decoupling to separate domains from semantic features, and uses a domain-aware, multi-view discriminator along with a decision layer to dynamically weigh multimodal information.
(3)
Multimodal Single-Domain
  • EANN [23], which employs a Generative Adversarial Network (GAN) to learn event-invariant general knowledge.
  • SpotFake [26], which leverages VGG for image feature extraction and BERT for text feature extraction in fake news detection.
  • CAFE [30], which employs cross-modal ambiguity for the adaptive aggregation of unimodal features and cross-modal correlations.
  • BMR [53], which fuses multi-view features with cross-modal consistency using a weighted scheme.
  • MIAN [16], which extracts intra-modal and inter-modal conflict features via a reverse attention mechanism.
  • MTS [54], which explicitly captures multi-order text–image interactions via Taylor series expansion, reduces model parameters and increases interpretability.

4.2. Overall Performance

To address RQ1 and RQ2, this section presents comparative experiments between GC2MFND and the three representative baselines described above, and analyzes the experimental results in terms of both overall performance and F1 scores for various domains. For the Weibo and Weibo21 datasets, the existing state-of-the-art results were taken from prior experiments [14] and are marked with an asterisk (*) in Table 3. For the newly introduced FineFake dataset, given the limited publicly available experimental results for existing methods, we reproduce the results for each baseline method under a standardized experimental setup and reported these findings.
As shown in Table 3, GC2MFND is compared with representative state-of-the-art multi-domain fake news detection baseline methods across three benchmark datasets and achieves the best results in terms of overall evaluation metrics. On the Weibo dataset, GC2MFND achieves overall F1, Acc, and AUC scores of 0.953, 0.953, and 0.986, respectively, representing improvements of 1.1%, 1.1%, and 0.4% over the best competing method. On the Weibo21 dataset, GC2MFND achieves overall F1, Acc, and AUC scores of 0.957, 0.957, and 0.986, respectively, representing improvements of 1.2%, 1.2%, and 0.3% over the best competing method. On the larger and more diverse FineFake dataset, GC2MFND achieves overall F1, Acc, and AUC scores of 0.807, 0.812, and 0.890, respectively, representing improvements of 1.2%, 1.3%, and 0.8% over the best baseline method.
GC2MFND remains highly competitive in terms of F1 scores for most domains. On the Weibo dataset, GC2MFND achieves the best results in the military, education, society, political, and health domains, and ties with DAMMFND in the science domain. However, DAMMFND performs better in the finance, entertainment, and disaster domains. On the Weibo21 dataset, GC2MFND achieves the best results in the science, military, education, politics, finance, entertainment, and international domains, but performs slightly worse than some comparison methods in the society and health domains. On the FineFake dataset, GC2MFND achieves the best results in the society, political, health, and finance domains, but performs slightly worse in the entertainment and conflict domains. We attribute the performance differences in different domains primarily to imbalanced sample distributions and domain-specific heterogeneity. On one hand, domains with larger sample sizes benefit from stronger supervisory signals, while resource-poor domains are more prone to training biases, causing fluctuations in detection performance among different domains. On the other hand, differences in topic attributes, semantic expressions, and text–image association patterns among different domains increase detection difficulty. In particular, the FineFake dataset introduces cross-platform heterogeneity, which leads to more pronounced performance fluctuations and a significantly lower overall detection performance compared to the two Chinese datasets. Nevertheless, GC2MFND still outperforms baselines in most domains and enhances overall detection performance by mitigating domain heterogeneity.
To validate the stability of our method, we repeat the experiments under ten different random seeds, compute the mean and standard deviation of GC2MFND and two strong baselines, and then confirm the statistical significance of the performance improvements over the strong baselines using a t-test (p < 0.05), as shown in Table 4.
Table 5 presents the comparison results between GC2MFND and multimodal single-domain detection methods, including accuracy and F1 scores for fake and real news. Overall, GC2MFND achieves the best performance on all three datasets. Specifically, in terms of overall accuracy, GC2MFND outperforms the best baseline methods by 1.7%, 1.9%, and 2.4% on the Weibo, Weibo21, and FineFake datasets, respectively. For the fake news F1 score, the improvements are 1.9%, 2.0%, and 1.8% for the respective datasets; for the real news F1 score, the improvements are 1.5%, 1.8%, and 2.7%. These results indicate that GC2MFND not only outperforms single-domain multimodal detection methods in overall classification performance but also exhibits enhanced recognition capabilities for both fake and real news samples. It can be observed that the Chinese datasets show a greater improvement for fake news, whereas the English dataset shows a greater improvement for real news. This improvement primarily stems from GC2MFND’s ability to effectively extract multi-granularity conflict features. In Chinese datasets, where conflicts in fake news are prominent, the model achieves high accuracy. Meanwhile, particularly in the English dataset, the accompanying images, intended to enrich multimodal news presentation, cause even real news to exhibit minor conflicts. By leveraging this rich conflict information, the model distinguishes between real and fake news, thereby reducing false positives for real news.
Figure 5, Figure 6 and Figure 7 show the t-SNE visualizations of the sample distributions produced by the model on the Weibo, Weibo21, and FineFake datasets. Parameters are set as follows: perplexity = 40, PCA initialization, and random seed = 3074, consistent with the baseline experiments. In Figure 5a, Figure 6a and Figure 7a, real and fake news samples are intermingled, whereas in Figure 5b, Figure 6b and Figure 7b, real and fake news exhibit relatively good separability, with only a few samples not fully separated. This demonstrates the effectiveness of GC2MFND in multimodal fake news classification.

4.3. Ablation Study

To assess the impacts of key components of GC2MFND on detection performance, we construct the following model variants: (1) -w/o Conflict: removal of the domain-aware multi-granularity conflict extraction module; (2) -w/o Calib: removal of the domain-guided multimodal feature calibration module; (3) -w/o Domain: removal of the domain embedding-based feature processing component; (4) -w/o Loss: removal of the contrastive loss and auxiliary loss; and (5) -w/o Smooth: removal of the label smoothing strategy during training.
Table 6 shows the experimental results. We use accuracy and F1 score to quantify the contribution of each module. Specifically, we summarize the following points:
  • Comparing the first three variants, we observe that GC2MFND -w/o Conflict, GC2MFND -w/o Calib, and GC2MFND -w/o Domain all exhibit a performance drop, suggesting that extracting conflict features, calibrating text–image features, and incorporating domain information contribute to the performance enhancement of our model. Notably, completely removing the domain labels causes a slight performance drop of about 1%, but the model still retains high detection accuracy, showing reasonable robustness to missing domain labels. From an information-theoretic perspective, conflict features enhance the correlation between news content and truth labels, modal calibration reduces redundancy entropy among features, and domain embedding lowers conditional entropy across topics; together, these three factors improve the model’s discriminative ability in uncertain environments.
  • Comparing GC2MFND -w/o Loss with GC2MFND -w/o Smooth shows that effective training and learning enhance model performance. Removing either component degrades model performance. This indicates that the contrastive loss and auxiliary loss enhance the discriminative power of features, while label smoothing prevents the model from over-relying on training samples and thus improves classification stability.

4.4. Discussions

4.4.1. Evaluation of Domain-Adaptive Fusion Mechanisms

To further validate the effectiveness of the domain-adaptive fusion mechanism, we analyze the dynamically learned routing weights of the model on Weibo, Weibo21, and FineFake. We use sigmoid weights as a quantitative measure of feature dependency across domains. The results are shown in Figure 8. Figure 8a and Figure 8b show the differences in sigmoid weights for conflicting patterns and multi-channel feature fusion, respectively. The sigmoid outputs are independent gate values rather than a normalized distribution. Although most values are close to 0.5, the relative ordering across features reliably reflects the model’s dependency strength.
In the integration of conflict modes and multi-channel features, the sigmoid weights across different channels vary with the domain, indicating that the model can adaptively adjust its reliance on each feature based on the domain attributes of the input. Specifically, in the Weibo and Weibo21 datasets, domains exhibit similar modulation patterns: the social and entertainment domains show relatively higher reliance on global semantic conflicts; the science, military, and political domains exhibit slightly stronger weights for semantic conflicts between image-local and text-global features; the health, education, finance, and disaster domains tend to display an increased dependency on semantic conflicts between text-local and image-global features. Regarding multi-channel features, conflict-fusion features exhibit slightly higher weights. In the FineFake dataset, the sigmoid weights for fine-grained local–local conflict features show relatively higher activation levels across most domains. However, in the conflict and politics domains, the sigmoid weights for the four conflict types are relatively close, primarily due to the official and rigorous writing styles that characterize these two domains. Additionally, among the multi-channel features, text features have relatively higher sigmoid weights in the politics, business, and conflict domains, while image features and conflict features receive slightly lower weights.

4.4.2. Parameter Analysis

We analyze the sensitivity of the method to different values of the parameters α , β , γ , and τ on the Weibo, Weibo21, and FineFake datasets. Figure 9, Figure 10, Figure 11 and Figure 12 present the experimental results for these four key parameters. Overall, the Chinese datasets are more sensitive to parameter variations. The English dataset exhibits lower sensitivity. For the conflict feature loss parameter α , the model peaks at 0.3 on the Chinese datasets. On the English dataset, favorable results are observed within [ 0.08 , 0.12 ] , with 0.1 yielding a better outcome. For the correction loss parameter β , GC2MFND works well at 0.2 on the Chinese datasets and at 0.1 on the English dataset.
Regarding the contrastive loss parameter γ , the Weibo and Weibo21 datasets perform better at 0.7. The FineFake dataset achieves better performance within [ 0.04 , 0.06 ] , with 0.05 giving higher accuracy. For the contrastive loss temperature parameter τ , GC2MFND shows consistent performance at 0.1 across all three datasets, which is a robust choice. Furthermore, the performance drop across all three datasets under the same parameter settings did not exceed 0.009, and the performance remained higher than the respective baselines. This indicates that the model is stable. Based on these findings, we set the parameters for the Chinese datasets as α = 0.3 , β = 0.2 , γ = 0.7 , τ = 0.1 . For the English dataset, we use α = 0.1 , β = 0.1 , γ = 0.05 , τ = 0.1 .

4.4.3. Computational Cost Analysis

To comprehensively evaluate the computational efficiency of GC2MFND, we compare the average single-round training time, testing time, inference time, GPU memory consumption, and number of parameters across various models on Weibo, Weibo21, and FineFake under a unified experimental setting. We select the two best-performing baselines (MMDFND and DAMMFND) for a fair comparison. Since all models share the same pre-trained feature extractor, differences in time and parameters arise solely from their respective downstream network designs.
Table 7 presents the computational overhead metrics. GC2MFND demonstrates a clear advantage in parameter efficiency, with a trainable parameter count of only 6.82 million, considerably lower than the two baselines. This advantage stems from architectural differences: MMDFND and DAMMFND employ multiple expert networks or domain-aware Transformer decoders, leading to a large parameter count; in contrast, GC2MFND uses only lightweight operators, gated networks, multi-scale adapters, low-dimensional domain embeddings, conflict extraction and calibration modules based on linear mappings and attention, as well as an MLP classifier.
On Weibo and Weibo21, GC2MFND demonstrates competitive training efficiency, with a single-round training time around 45.27 s and 43.03 s, respectively. This is approximately 25% faster than DAMMFND. Its testing time is approximately 50% faster than that of MMDFND and comparable to that of DAMMFND. On the larger-scale FineFake dataset, DAMMFND achieves the best training efficiency because it relies solely on discrete-domain routing. In contrast, GC2MFND incorporates fine-grained cross-modal interaction and attention calibration, which increases the computational complexity of matrix operations as the dataset size grows, thereby resulting in longer training times.
However, in practical deployment, the timeliness of online inference is of greater importance. As shown in Figure 13, GC2MFND exhibits the lowest inference time and GPU memory consumption across the three datasets, indicating that its high accuracy does not come at the expense of real-time responsiveness. With a small number of parameters and efficient online inference, it can meet the need for timely detection and blocking of fake news in social media environments without incurring high computational costs.

4.4.4. Case Study and Error Analysis

To evaluate the proposed model, we qualitatively analyze correctly and incorrectly predicted cases, as shown in Figure 14. Case (a) is a correctly identified fake news item. Although the text mentioning “Toothpaste” matches certain colors and visual elements, a subtle local conflict exists regarding children’s toothpaste and its hazards. Case (b) is also correctly identified: the phrase “camels begging” globally conflicts with an image showing “a camel being led by a person.” Additionally, the phrase “amputated limbs” locally conflicts with an image of “a camel with normal limbs.” In contrast, Case (c) is a misclassified fake news item. Although there is no explicit visual–textual conflict between the image of a “girl” and the text mentioning “Biden”, external knowledge confirms that the child is Biden’s granddaughter, exposing the caption’s false claim about a “young Boy dressed as a girl”. Case (d) is another misclassified example. The text and the image are highly consistent regarding elements such as “bear,” “bipedal stance,” “wrinkled skin,” and “human,” which misleads the model into an incorrect prediction. However, incorporating external knowledge—that sun bears have loose, clothing-like wrinkles and an eerily human-like posture—is required to correctly identify it as fake.
Based on the case studies above, the multi-granularity conflict features extracted by GC2MFND can effectively capture cross-modal inconsistencies between textual and visual content, thereby facilitating the detection of subtle fake news instances. However, as shown in Cases (c) and (d), when textual and visual information is highly consistent, conflict signals alone may be insufficient to reveal that the news is false, and external factual knowledge is often needed. Therefore, integrating external knowledge may further improve fake news detection performance in complex scenarios.

5. Conclusions

This study proposes a multimodal fake news detection framework (GC2MFND) designed for multi-domain scenarios. To address the issue of domain heterogeneity, the framework leverages domain embeddings to achieve a deep decoupling and integration of domain knowledge with cross-modal conflict mining, effectively mitigating feature semantic drift. Specifically, the framework first utilizes a domain-aware multi-granularity module to accurately extract text–image conflict signals. Subsequently, it employs a domain-guided feature calibration and redundancy reduction strategy to filter out redundant noise. Finally, a domain-adaptive dynamic aggregation and multi-view integration module is utilized to perform collaborative decision-making. Extensive experiments on three Chinese and English datasets—Weibo, Weibo21, and FineFake—demonstrate that GC2MFND achieves consistent improvements over existing multi-domain baseline methods. Ablation studies, mechanism analyses, and case studies further confirm that the conflict extraction and feature correction operations effectively enhance the discriminative power of features, while the dynamic aggregation strategy improves the model’s domain adaptability across complex topics.
Although GC2MFND achieves better performance in multi-domain fake news detection, this study still has certain limitations. First, the model relies on explicit domain labels, but many social media news items lack predefined domain classifications, resulting in high manual annotation costs. Furthermore, the model’s robustness in handling unseen scenarios requires further validation. Additionally, relying solely on news content without incorporating external knowledge makes the model prone to miss highly deceptive fake news. Future work will explore cross-domain transfer learning to reduce reliance on target domain labels, incorporate external knowledge bases to enhance detection capabilities for unknown domains and sophisticated samples, and adopt weakly supervised or unsupervised methods to lower manual annotation costs.

Author Contributions

Conceptualization, Y.S., M.Z. and F.Z.; methodology, Y.S.; software, M.Z.; validation, Y.S., M.Z. and F.Z.; formal analysis, Y.S. and F.Z.; resources, Y.S., M.Z. and F.Z.; data curation, M.Z.; writing—original draft preparation, Y.S. and M.Z.; writing—review and editing, Y.S. and F.Z.; visualization, Y.S. and M.Z.; supervision, Y.S.; project administration, Y.S. and F.Z.; funding acquisition, Y.S. and F.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the Natural Science Foundation of Shandong Province (Grant No. ZR2021MG021) and the Youth Innovation Technology Project of Higher School in Shandong Province (Grant No. 2021RW030).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The three datasets used in this study are all publicly available and can be obtained from the relevant cited articles. The source code for GC2MFND can be found at https://github.com/ZMingYue-Z/GC2MFND (accessed on 20 April 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Jin, Z.; Cao, J.; Zhang, Y.; Luo, J. News verification by exploiting conflicting social viewpoints in microblogs. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 2972–2978. [Google Scholar] [CrossRef]
  2. Boudana, S.; Segev, E. Fake news makes the news: Definitions and framing of fake news in mainstream media. J. Pract. 2026, 20, 679–698. [Google Scholar] [CrossRef]
  3. Guo, Q.; Liao, Y.; Li, Z.; Liang, S. Multi-modal representation via contrastive learning with attention bottleneck fusion and attentive statistics features. Entropy 2023, 25, 1421. [Google Scholar] [CrossRef]
  4. Alghamdi, J.; Lin, Y.; Luo, S. Cross-domain fake news detection using a prompt-based approach. Future Internet 2024, 16, 286. [Google Scholar] [CrossRef]
  5. Lu, Y.; Zheng, X.; Chen, H.T. Fake News is Shared by “Them” Not “Us” on Social Media: Perceptual Gaps of Fake News Sharing and Affective Polarization. J. Broadcast. Electron. Media 2026, 70, 264–279. [Google Scholar] [CrossRef]
  6. Su, Y.; Zhao, X. Hierarchical Text-Guided Refinement Network for Multimodal Sentiment Analysis. Entropy 2025, 27, 834. [Google Scholar] [CrossRef]
  7. Hu, X.; Zhang, H. Invariant representation learning in multimedia recommendation with modality alignment and model fusion. Entropy 2025, 27, 56. [Google Scholar] [CrossRef] [PubMed]
  8. Tan, Z.; Zhang, T. Emotion-semantic interaction network for fake news detection: Perspectives on question and non-question comment semantics. Inf. Process. Manag. 2026, 63, 104391. [Google Scholar] [CrossRef]
  9. Olan, F.; Jayawickrama, U.; Arakpogun, E.O.; Suklan, J.; Liu, S. Fake news on social media: The impact on society. Inf. Syst. Front. 2024, 26, 443–458. [Google Scholar] [CrossRef]
  10. Nan, Q.; Cao, J.; Zhu, Y.; Wang, Y.; Li, J. MDFEND: Multi-domain Fake News Detection. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Gold Coast, QLD, Australia, 1–5 November 2021; pp. 3343–3347. [Google Scholar] [CrossRef]
  11. Chen, W.; Dang, Y.; Zhang, X. A Multimodal Semantic-Enhanced Attention Network for Fake News Detection. Entropy 2025, 27, 746. [Google Scholar] [CrossRef]
  12. Deng, B. Exploring Universal Domain Adaptation with CLIP Models: A Calibration Method. Entropy 2025, 27, 1213. [Google Scholar] [CrossRef]
  13. Zhu, J.; Gao, C.; Yin, Z.; Li, X.; Wang, Z.; Kurths, J. Noise-Filtering Enhanced Graph Transformer for Robust Fake News Detection. IEEE Trans. Knowl. Data Eng. 2026, 38, 3778–3791. [Google Scholar] [CrossRef]
  14. Tong, Y.; Lu, W.; Zhao, Z.; Lai, S.; Shi, T. MMDFND: Multi-modal Multi-Domain Fake News Detection. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 1178–1186. [Google Scholar] [CrossRef]
  15. Lu, W.; Tong, Y.; Ye, Z. DAMMFND: Domain-Aware Multimodal Multi-view Fake News Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 559–567. [Google Scholar] [CrossRef]
  16. Zhang, T.; Yu, E.; Shao, Y.; Sun, J. Multimodal Inverse Attention Network with Intrinsic Discriminant Feature Exploitation for Fake News Detection. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, Montreal, QC, Canada, 16–22 August 2025; pp. 7940–7948. [Google Scholar] [CrossRef]
  17. Yu, X.; Sheng, Z.; Lu, W.; Luo, X.; Zhou, J. RaCMC: Residual-aware compensation network with multi-granularity constraints for fake news detection. In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; pp. 986–994. [Google Scholar] [CrossRef]
  18. Guan, W.; Wen, H.; Song, X.; Yeh, C.H.; Chang, X.; Nie, L. Multimodal Compatibility Modeling via Exploring the Consistent and Complementary Correlations. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 2299–2307. [Google Scholar] [CrossRef]
  19. Chen, R.; Rong, Y.; Guo, S.; Han, J.; Sun, F.; Xu, T.; Huang, W. Smoothing Matters: Momentum Transformer for Domain Adaptive Semantic Segmentation. arXiv 2022, arXiv:2203.07988. [Google Scholar] [CrossRef]
  20. Li, J.; Wang, Z.; Gao, Y.; Hu, X. Exploring High-quality Target Domain Information for Unsupervised Domain Adaptive Semantic Segmentation. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 5237–5245. [Google Scholar] [CrossRef]
  21. Zhang, X.; Ghorbani, A.A. An overview of online fake news: Characterization, detection, and discussion. Inf. Process. Manag. 2020, 57, 102025. [Google Scholar] [CrossRef]
  22. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
  23. Wang, Y.; Ma, F.; Jin, Z.; Yuan, Y.; Xun, G.; Jha, K.; Su, L.; Gao, J. EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 849–857. [Google Scholar] [CrossRef]
  24. Shao, Y.; Sun, J.; Zhang, T.; Jiang, Y.; Ma, J.; Li, J. Fake News Detection Based on Multi-Modal Classifier Ensemble. In Proceedings of the 1st International Workshop on Multimedia AI against Disinformation, Newark, NJ, USA, 27–30 June 2022; pp. 78–86. [Google Scholar] [CrossRef]
  25. Kutay, E.; Yener, A. Harnessing the Power of Pre-Trained Models for Efficient Semantic Communication of Text and Images. Entropy 2025, 27, 813. [Google Scholar] [CrossRef] [PubMed]
  26. Singhal, S.; Shah, R.R.; Chakraborty, T.; Kumaraguru, P.; Satoh, S. SpotFake: A Multi-modal Framework for Fake News Detection. In Proceedings of the 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM), Singapore, 11–13 September 2019; pp. 39–47. [Google Scholar] [CrossRef]
  27. He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar] [CrossRef]
  28. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar] [CrossRef]
  29. Liu, Y.; Liu, Y.; Li, Z.; Yao, R.; Zhang, Y.; Wang, D. Modality Interactive Mixture-of-Experts for Fake News Detection. In Proceedings of the ACM on Web Conference 2025, Sydney, NSW, Australia, 28 April–2 May 2025; pp. 5139–5150. [Google Scholar] [CrossRef]
  30. Chen, Y.; Li, D.; Zhang, P.; Sui, J.; Lv, Q.; Tun, L.; Shang, L. Cross-modal Ambiguity Learning for Multimodal Fake News Detection. In Proceedings of the ACM Web Conference 2022, Lyon, France, 25–29 April 2022; pp. 2897–2905. [Google Scholar] [CrossRef]
  31. Wang, J.; Zheng, J.; Yao, S.; Wang, R.; Du, H. Tlfnd: A multimodal fusion model based on three-level feature matching distance for fake news detection. Entropy 2023, 25, 1533. [Google Scholar] [CrossRef]
  32. Shen, L.; Long, Y.; Cai, X.; Razzak, I.; Chen, G.; Liu, K.; Jameel, S. GAMED: Knowledge Adaptive Multi-Experts Decoupling for Multimodal Fake News Detection. In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining, Hannover, Germany, 10–14 March 2025; pp. 586–595. [Google Scholar] [CrossRef]
  33. Lu, W.; Li, Y. From Blind Transfer to Wise Selection: Prototype-Driven Neighbor-Domain Adaptation for Fake News Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Singapore, 20–27 January 2026; pp. 818–826. [Google Scholar] [CrossRef]
  34. Fu, L.; Peng, H.; Liu, S. KG-MFEND: An efficient knowledge graph-based model for multi-domain fake news detection. J. Supercomput. 2023, 79, 18417–18444. [Google Scholar] [CrossRef]
  35. Wang, L.; Li, X.; Zhou, B.; Zhang, Y.; Yuan, J.; Hu, H. Multimodal fusion with LLM content via hierarchical progressive transformer for explainable fake news detection. Inf. Process. Manag. 2026, 63, 104700. [Google Scholar] [CrossRef]
  36. Luo, W.; Yang, Z.; Shang, Y.; Shorfuzzaman, M.; Wu, Y.; Ghoneim, A. Securing Consumer Applications Against AI-Driven Misinformation: A Cross-Domain Multimodal Approach. IEEE Trans. Consum. Electron. 2026, 72, 1574–1583. [Google Scholar] [CrossRef]
  37. Song, C.; Ning, N.; Zhang, Y.; Wu, B. Knowledge augmented transformer for adversarial multidomain multiclassification multimodal fake news detection. Neurocomputing 2021, 462, 88–100. [Google Scholar] [CrossRef]
  38. Bazmi, P.; Asadpour, M.; Shakery, A.; Maazallahi, A. Entity-centric multi-domain transformer for improving generalization in fake news detection. Inf. Process. Manag. 2024, 61, 103807. [Google Scholar] [CrossRef]
  39. Zhang, T.; Wang, D.; Chen, H.; Zeng, Z.; Guo, W.; Miao, C.; Cui, L. BDANN: BERT-Based Domain Adaptation Neural Network for Multi-Modal Fake News Detection. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar] [CrossRef]
  40. Li, J.; Feng, X.; Gu, T.; Chang, L. Dual-Teacher De-Biasing Distillation Framework for Multi-Domain Fake News Detection. In Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, The Netherlands, 13–16 May 2024; pp. 3627–3639. [Google Scholar] [CrossRef]
  41. Zhu, Y.; Sheng, Q.; Cao, J.; Nan, Q.; Shu, K.; Wu, M.; Wang, J.; Zhuang, F. Memory-Guided Multi-View Multi-Domain Fake News Detection. IEEE Trans. Knowl. Data Eng. 2022, 35, 7178–7191. [Google Scholar] [CrossRef]
  42. Zhao, J.; Zhao, Z.; Shi, L.; Kuang, Z.; Liu, Y. Collaborative mixture-of-experts model for multi-domain fake news detection. Electronics 2023, 12, 3440. [Google Scholar] [CrossRef]
  43. Yuan, H.; Zheng, J.; Ye, Q.; Qian, Y.; Zhang, Y. Improving fake news detection with domain-adversarial and graph-attention neural network. Decis. Support Syst. 2021, 151, 113633. [Google Scholar] [CrossRef]
  44. Xu, K.; Wang, S.; Diao, Z. DATTAMM: Domain-Aware Test-Time Adaptation for Multimodal Misinformation Detection. Appl. Sci. 2025, 15, 11832. [Google Scholar] [CrossRef]
  45. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
  46. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 2002, 86, 2278–2324. [Google Scholar] [CrossRef]
  47. Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised Contrastive Learning. arXiv 2021, arXiv:2004.11362. [Google Scholar] [CrossRef]
  48. Zhou, Z.; Zhang, X.; Zhang, L.; Liu, J.; Cambria, E.; Li, C. FineFake: A knowledge-enriched dataset for fine-grained multi-domain fake news detection. Inf. Fusion 2026, 132, 104253. [Google Scholar] [CrossRef]
  49. Xue, J.; Wang, Y.; Tian, Y.; Li, Y.; Shi, L.; Wei, L. Detecting fake news by exploring the consistency of multimodal data. Inf. Process. Manag. 2021, 58, 102610. [Google Scholar] [CrossRef]
  50. Jin, Z.; Cao, J.; Guo, H.; Zhang, Y.; Luo, J. Multimodal Fusion with Recurrent Neural Networks for Rumor Detection on Microblogs. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 795–816. [Google Scholar] [CrossRef]
  51. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]
  52. Qin, Z.; Cheng, Y.; Zhao, Z.; Chen, Z.; Metzler, D.; Qin, J. Multitask Mixture of Sequential Experts for User Activity Streams. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Online, 6–10 July 2020; pp. 3083–3091. [Google Scholar] [CrossRef]
  53. Wu, L.; Liu, P.; Zhang, Y. See how you read? multi-reading habits fusion reasoning for multi-modal fake news detection. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 13736–13744. [Google Scholar] [CrossRef]
  54. Sun, J.; Chen, C.; Hou, C.; Wu, Y.; Yuan, X. Multimodal Taylor Series Network for Misinformation Detection. In Proceedings of the ACM on Web Conference 2025, Sydney, NSW, Australia, 28 April–2 May 2025; pp. 2540–2548. [Google Scholar] [CrossRef]
Figure 2. The network architecture of GC2MFND. BERT, MAE, and CLIP are utilized to extract multi-granularity features of multimodal news. Domain-Aware Conflict Extraction is employed to mine subtle visual–textual conflicts. Domain-Guided Feature Calibration enhances visual–textual features via domain information. Domain-Adaptive Feature Aggregation generates weights based on domain embeddings to aggregate multi-channel features. News authenticity is determined by the concatenated features.
Figure 2. The network architecture of GC2MFND. BERT, MAE, and CLIP are utilized to extract multi-granularity features of multimodal news. Domain-Aware Conflict Extraction is employed to mine subtle visual–textual conflicts. Domain-Guided Feature Calibration enhances visual–textual features via domain information. Domain-Adaptive Feature Aggregation generates weights based on domain embeddings to aggregate multi-channel features. News authenticity is determined by the concatenated features.
Entropy 28 00672 g002
Figure 3. Architecture of a Domain-Aware Multi-Granularity Conflict Extraction Module.
Figure 3. Architecture of a Domain-Aware Multi-Granularity Conflict Extraction Module.
Entropy 28 00672 g003
Figure 4. Architecture of the domain-guided multimodal feature calibration module.
Figure 4. Architecture of the domain-guided multimodal feature calibration module.
Entropy 28 00672 g004
Figure 5. T-SNE of sample distribution on Weibo, where (a) shows the distribution of original samples, and (b) shows the distribution of learned features.
Figure 5. T-SNE of sample distribution on Weibo, where (a) shows the distribution of original samples, and (b) shows the distribution of learned features.
Entropy 28 00672 g005
Figure 6. T-SNE of sample distribution on Weibo21, where (a) shows the distribution of original samples, and (b) shows the distribution of learned features.
Figure 6. T-SNE of sample distribution on Weibo21, where (a) shows the distribution of original samples, and (b) shows the distribution of learned features.
Entropy 28 00672 g006
Figure 7. T-SNE of sample distribution on FineFake, where (a) shows the distribution of original samples, and (b) shows the distribution of learned features.
Figure 7. T-SNE of sample distribution on FineFake, where (a) shows the distribution of original samples, and (b) shows the distribution of learned features.
Entropy 28 00672 g007
Figure 8. Line chart of (a) domain-adaptive multi-granularity conflict and (b) multi-channel feature fusion weights on the Weibo, Weibo21, and FineFake datasets.
Figure 8. Line chart of (a) domain-adaptive multi-granularity conflict and (b) multi-channel feature fusion weights on the Weibo, Weibo21, and FineFake datasets.
Entropy 28 00672 g008
Figure 9. The performances of GC2MFND with different values of α on (a) Weibo, (b) Weibo21, and (c) FineFake.
Figure 9. The performances of GC2MFND with different values of α on (a) Weibo, (b) Weibo21, and (c) FineFake.
Entropy 28 00672 g009
Figure 10. The performances of GC2MFND with different values of β on (a) Weibo, (b) Weibo21, and (c) FineFake.
Figure 10. The performances of GC2MFND with different values of β on (a) Weibo, (b) Weibo21, and (c) FineFake.
Entropy 28 00672 g010
Figure 11. The performances of GC2MFND with different values of γ on (a) Weibo, (b) Weibo21, and (c) FineFake.
Figure 11. The performances of GC2MFND with different values of γ on (a) Weibo, (b) Weibo21, and (c) FineFake.
Entropy 28 00672 g011
Figure 12. The performances of GC2MFND with different values of τ on (a) Weibo, (b) Weibo21, and (c) FineFake.
Figure 12. The performances of GC2MFND with different values of τ on (a) Weibo, (b) Weibo21, and (c) FineFake.
Entropy 28 00672 g012
Figure 13. Comparison of inference time and GPU memory consumption across different methods.
Figure 13. Comparison of inference time and GPU memory consumption across different methods.
Entropy 28 00672 g013
Figure 14. Examples from the Weibo21 and FineFake datasets. Figure (a) shows a local conflict between the text “toothpaste” and the image illustrating the hazards of children’s toothpaste; Figure (b) presents a global conflict between the text “camels begging” and the image “a camel being led by a person” as well as a local conflict between the text “amputated limbs” and the image “a camel with normal limbs”; Figure (c) indicates consistency between the image and the textual references to “Biden” and “girl”; Figure (d) reveals high consistency between the text and image in elements such as “bear,” “bipedal stance,” “wrinkled skin,” and “human”.
Figure 14. Examples from the Weibo21 and FineFake datasets. Figure (a) shows a local conflict between the text “toothpaste” and the image illustrating the hazards of children’s toothpaste; Figure (b) presents a global conflict between the text “camels begging” and the image “a camel being led by a person” as well as a local conflict between the text “amputated limbs” and the image “a camel with normal limbs”; Figure (c) indicates consistency between the image and the textual references to “Biden” and “girl”; Figure (d) reveals high consistency between the text and image in elements such as “bear,” “bipedal stance,” “wrinkled skin,” and “human”.
Entropy 28 00672 g014
Table 1. Notations and definitions used in this paper.
Table 1. Notations and definitions used in this paper.
NotationDefinitionNotationDefinition
  T local , I local Local Image/Text Features  T local mod , I local mod Modulated Local Image/Text Features
  T global , I global Global Image/Text Features  T ^ local , I ^ local Purified features without redundancy
  T ˜ local , I ˜ local Enhanced Local Image/Text Features  T calib , I calib Calibrated Image/Text Features
  T ˜ global , I ˜ global Enhanced global Image/Text Features  s m , s h m Dynamic scaling factor and offset
  F l l , F l g t , F l g i , F g g Multi-granularity conflict features  R t i , R i t Cross-modal redundant features
  w l l , w l g t , w l g i , w g g Conflict feature fusion weights  w t , w i , w c Multimodal feature aggregation weights
  e d Domain Embedding  BCE smooth Smoothed BCE loss
  F C Domain-Aware Conflict Features  L final Final discriminative loss
  F m Fused Multimodal Features  L total Total loss
  F f Final Discriminative Features  L t Text feature classification loss
  F seq ( · ) Local Conflict Extraction Operator  L i Image feature classification loss
  H ( · ) Global Conflict Extraction Operator  L DSC Domain-Aware Soft-Weighted Contrastive Loss
Res-FiLMResidual Feature-wise Linear ModulationDGRDomain-guided Gated Redundancy Removal
Table 2. Statistics of the datasets used in our experiments.
Table 2. Statistics of the datasets used in our experiments.
DatasetsFake NewsReal NewsAll
Weibo478347459528
Weibo21448746409127
FineFake640210,50716,909
Table 3. Comparison between GC2MFND and the latest multi-domain fake news detection methods on Weibo, Weibo21 and FineFake.
Table 3. Comparison between GC2MFND and the latest multi-domain fake news detection methods on Weibo, Weibo21 and FineFake.
DatasetMethodSci.Mil/ConEdu.Soc.Pol.Hlth.Fin.Ent.Dis/IntAll
F1AccAuc
WeiboMOSE *0.7930.7380.8340.9120.7640.8590.7910.8440.8830.8900.8900.954
KATMF *0.8310.9080.9240.8950.8230.8980.9030.9040.8940.9290.9300.969
MDFEND *0.7740.9110.8970.9020.7630.8780.8080.8810.8740.9040.9040.965
M3DFEND *0.7920.9030.9230.9120.7650.8630.8990.8990.8760.9280.9280.969
MMDFND *0.8240.9110.9410.9390.7350.9130.9170.9170.8880.9340.9340.972
DAMMFND0.8530.9110.9560.9430.8220.9390.9370.9560.9280.9420.9420.982
GC2MFND0.8530.9560.9720.9570.8230.9480.9170.9400.8890.9530.9530.986
Weibo21MOSE *0.8500.8850.8810.8720.8800.9170.8670.8910.8670.8930.8940.954
KATMF *0.9140.9280.9130.8950.9020.9140.8710.9370.8980.9230.9280.975
MDFEND *0.8300.9380.8910.8980.8860.9400.8950.9060.9000.9130.9130.970
M3DFEND *0.8290.9500.8990.9080.8820.9460.9000.9310.8890.9210.9210.975
MMDFND *0.9370.9530.8520.9450.9650.9200.8840.9590.9190.9390.9390.977
DAMMFND0.9320.9480.9310.9160.9780.9190.9370.9700.9440.9450.9450.983
GC2MFND0.9880.9980.9650.9280.9820.9360.9410.9800.9450.9570.9570.986
FineFakeMOSE-0.691-0.7880.7320.7860.7770.831-0.7730.7750.851
KATMF-0.662-0.7780.7400.8130.8030.824-0.7760.7780.861
MDFEND-0.699-0.7810.7420.7960.8100.830-0.7820.7850.873
M3DFEND-0.648-0.8130.7320.7620.8270.857-0.7810.7810.873
MMDFND-0.694-0.7790.7520.8030.8400.832-0.7840.7890.875
DAMMFND-0.713-0.8070.7690.8270.8330.833-0.7950.7980.882
GC2MFND-0.693-0.8080.7850.8300.8420.788-0.8070.8120.890
Bold: best results, Underline: second best results.
Table 4. Comparison of the stability and p-values of GC2MFND with two strong baseline methods across three datasets.
Table 4. Comparison of the stability and p-values of GC2MFND with two strong baseline methods across three datasets.
DatasetsMethodAccuracyPrecisionRecallF1
WeiboMMDFND92.96 ± 0.8092.99 ± 0.8092.97 ± 0.8192.95 ± 0.80
DAMMFND93.27 ± 0.4893.31 ± 0.4893.25 ± 0.4993.26 ± 0.48
GC2MFND94.78 ± 0.2694.78 ± 0.2694.82 ± 0.2694.78 ± 0.26
p-value 2.842 × 10 5 3.375 × 10 5 2.741 × 10 5 2.693 × 10 5
5.761 × 10 7 7.672 × 10 7 4.871 × 10 7 5.350 × 10 7
Weibo21MMDFND93.19 ± 0.8493.21 ± 0.8493.19 ± 0.8593.19 ± 0.85
DAMMFND94.02 ± 0.8794.03 ± 0.8794.02 ± 0.8794.02 ± 0.87
GC2MFND95.07 ± 0.3895.09 ± 0.3895.08 ± 0.3995.07 ± 0.38
p-value 2.943 × 10 5 2.943 × 10 5 2.885 × 10 5 3.112 × 10 5
2.023 × 10 4 1.745 × 10 4 1.642 × 10 4 2.023 × 10 4
FineFakeMMDFND75.29 ± 4.0275.15 ± 4.1574.62 ± 4.1874.73 ± 4.22
DAMMFND79.32 ± 0.9479.27 ± 1.0478.79 ± 0.8278.92 ± 0.88
GC2MFND80.77 ± 0.6980.63 ± 0.7980.41 ± 0.6580.47 ± 0.66
p-value 1.902 × 10 3 2.411 × 10 3 1.672 × 10 3 1.943 × 10 3
1.173 × 10 3 4.341 × 10 3 1.391 × 10 4 3.834 × 10 4
Bold: best results. Note: The p-values are listed in order as MMDFND and DAMMFND.
Table 5. Performance comparison across different datasets.
Table 5. Performance comparison across different datasets.
DatasetsMethodAccuracyF1
Fake NewsReal News
WeiboEANN0.8270.8290.825
SpotFake0.8920.9320.739
CAFE0.8400.8420.837
BMR0.9180.9140.904
MTS0.9250.9260.924
MIAN0.9360.9350.937
GC2MFND0.9530.9540.952
Weibo21EANN0.8700.8620.875
SpotFake0.8510.8280.866
CAFE0.8820.8850.876
BMR0.9290.9270.925
MTS0.9280.9290.927
MIAN0.9380.9360.939
GC2MFND0.9570.9560.957
FineFakeEANN0.7850.7700.792
SpotFake0.7790.7600.791
CAFE0.7830.7690.798
BMR0.7830.7290.819
MIAN0.7840.7590.804
MTS0.7880.7580.811
GC2MFND0.8120.7760.838
Bold: best results.
Table 6. Ablation study results on different datasets.
Table 6. Ablation study results on different datasets.
DatasetsMethodAccuracyF1
Fake NewsReal News
WeiboGC2MFND0.9530.9540.952
-w/o_Conflict0.9440.9440.943
-w/o_Calib0.9390.9410.938
-w/o_Domain0.9410.9420.939
-w/o_Loss0.9480.9490.947
-w/o_Smooth0.9410.9420.939
Weibo21GC2MFND0.9570.9560.957
-w/o_Conflict0.9490.9500.949
-w/o_Calib0.9410.9420.940
-w/o_Domain0.9440.9440.944
-w/o_Loss0.9410.9420.939
-w/o_Smooth0.9420.9430.940
FineFakeGC2MFND0.8120.7760.838
-w/o_Conflict0.8080.7690.836
-w/o_Calib0.8080.7750.832
-w/o_Domain0.8060.7710.832
-w/o_Loss0.8010.7580.831
-w/o_Smooth0.8040.7790.824
Bold: best results.
Table 7. Computational efficiency comparison of different methods.
Table 7. Computational efficiency comparison of different methods.
MethodTraining Time (s)Testing Time (s)Parameters (M)
WeiboWeibo21FineFakeWeiboWeibo21FineFake
MMDFND302.95275.44744.8735.5927.10249.86347.82
DAMMFND61.9357.61164.9615.9713.9871.4476.62
GC2MFND45.2743.03360.4318.1313.96120.316.82
Bold: best results.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, Y.; Zhang, M.; Zhang, F. GC2MFND: Multi-Granularity Conflict and Domain-Guided Calibration for Multimodal Fake News Detection. Entropy 2026, 28, 672. https://doi.org/10.3390/e28060672

AMA Style

Sun Y, Zhang M, Zhang F. GC2MFND: Multi-Granularity Conflict and Domain-Guided Calibration for Multimodal Fake News Detection. Entropy. 2026; 28(6):672. https://doi.org/10.3390/e28060672

Chicago/Turabian Style

Sun, Yanming, Mingyue Zhang, and Fujun Zhang. 2026. "GC2MFND: Multi-Granularity Conflict and Domain-Guided Calibration for Multimodal Fake News Detection" Entropy 28, no. 6: 672. https://doi.org/10.3390/e28060672

APA Style

Sun, Y., Zhang, M., & Zhang, F. (2026). GC2MFND: Multi-Granularity Conflict and Domain-Guided Calibration for Multimodal Fake News Detection. Entropy, 28(6), 672. https://doi.org/10.3390/e28060672

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop