4.2. Overall Performance
To address RQ1 and RQ2, this section presents comparative experiments between GC
2MFND and the three representative baselines described above, and analyzes the experimental results in terms of both overall performance and F1 scores for various domains. For the Weibo and Weibo21 datasets, the existing state-of-the-art results were taken from prior experiments [
14] and are marked with an asterisk (*) in
Table 3. For the newly introduced FineFake dataset, given the limited publicly available experimental results for existing methods, we reproduce the results for each baseline method under a standardized experimental setup and reported these findings.
As shown in
Table 3, GC
2MFND is compared with representative state-of-the-art multi-domain fake news detection baseline methods across three benchmark datasets and achieves the best results in terms of overall evaluation metrics. On the Weibo dataset, GC
2MFND achieves overall F1, Acc, and AUC scores of 0.953, 0.953, and 0.986, respectively, representing improvements of 1.1%, 1.1%, and 0.4% over the best competing method. On the Weibo21 dataset, GC
2MFND achieves overall F1, Acc, and AUC scores of 0.957, 0.957, and 0.986, respectively, representing improvements of 1.2%, 1.2%, and 0.3% over the best competing method. On the larger and more diverse FineFake dataset, GC
2MFND achieves overall F1, Acc, and AUC scores of 0.807, 0.812, and 0.890, respectively, representing improvements of 1.2%, 1.3%, and 0.8% over the best baseline method.
GC2MFND remains highly competitive in terms of F1 scores for most domains. On the Weibo dataset, GC2MFND achieves the best results in the military, education, society, political, and health domains, and ties with DAMMFND in the science domain. However, DAMMFND performs better in the finance, entertainment, and disaster domains. On the Weibo21 dataset, GC2MFND achieves the best results in the science, military, education, politics, finance, entertainment, and international domains, but performs slightly worse than some comparison methods in the society and health domains. On the FineFake dataset, GC2MFND achieves the best results in the society, political, health, and finance domains, but performs slightly worse in the entertainment and conflict domains. We attribute the performance differences in different domains primarily to imbalanced sample distributions and domain-specific heterogeneity. On one hand, domains with larger sample sizes benefit from stronger supervisory signals, while resource-poor domains are more prone to training biases, causing fluctuations in detection performance among different domains. On the other hand, differences in topic attributes, semantic expressions, and text–image association patterns among different domains increase detection difficulty. In particular, the FineFake dataset introduces cross-platform heterogeneity, which leads to more pronounced performance fluctuations and a significantly lower overall detection performance compared to the two Chinese datasets. Nevertheless, GC2MFND still outperforms baselines in most domains and enhances overall detection performance by mitigating domain heterogeneity.
To validate the stability of our method, we repeat the experiments under ten different random seeds, compute the mean and standard deviation of GC
2MFND and two strong baselines, and then confirm the statistical significance of the performance improvements over the strong baselines using a
t-test (
p < 0.05), as shown in
Table 4.
Table 5 presents the comparison results between GC
2MFND and multimodal single-domain detection methods, including accuracy and F1 scores for fake and real news. Overall, GC
2MFND achieves the best performance on all three datasets. Specifically, in terms of overall accuracy, GC
2MFND outperforms the best baseline methods by 1.7%, 1.9%, and 2.4% on the Weibo, Weibo21, and FineFake datasets, respectively. For the fake news F1 score, the improvements are 1.9%, 2.0%, and 1.8% for the respective datasets; for the real news F1 score, the improvements are 1.5%, 1.8%, and 2.7%. These results indicate that GC
2MFND not only outperforms single-domain multimodal detection methods in overall classification performance but also exhibits enhanced recognition capabilities for both fake and real news samples. It can be observed that the Chinese datasets show a greater improvement for fake news, whereas the English dataset shows a greater improvement for real news. This improvement primarily stems from GC
2MFND’s ability to effectively extract multi-granularity conflict features. In Chinese datasets, where conflicts in fake news are prominent, the model achieves high accuracy. Meanwhile, particularly in the English dataset, the accompanying images, intended to enrich multimodal news presentation, cause even real news to exhibit minor conflicts. By leveraging this rich conflict information, the model distinguishes between real and fake news, thereby reducing false positives for real news.
Figure 5,
Figure 6 and
Figure 7 show the t-SNE visualizations of the sample distributions produced by the model on the Weibo, Weibo21, and FineFake datasets. Parameters are set as follows: perplexity = 40, PCA initialization, and random seed = 3074, consistent with the baseline experiments. In
Figure 5a,
Figure 6a and
Figure 7a, real and fake news samples are intermingled, whereas in
Figure 5b,
Figure 6b and
Figure 7b, real and fake news exhibit relatively good separability, with only a few samples not fully separated. This demonstrates the effectiveness of GC
2MFND in multimodal fake news classification.
4.4. Discussions
4.4.1. Evaluation of Domain-Adaptive Fusion Mechanisms
To further validate the effectiveness of the domain-adaptive fusion mechanism, we analyze the dynamically learned routing weights of the model on Weibo, Weibo21, and FineFake. We use sigmoid weights as a quantitative measure of feature dependency across domains. The results are shown in
Figure 8.
Figure 8a and
Figure 8b show the differences in sigmoid weights for conflicting patterns and multi-channel feature fusion, respectively. The sigmoid outputs are independent gate values rather than a normalized distribution. Although most values are close to 0.5, the relative ordering across features reliably reflects the model’s dependency strength.
In the integration of conflict modes and multi-channel features, the sigmoid weights across different channels vary with the domain, indicating that the model can adaptively adjust its reliance on each feature based on the domain attributes of the input. Specifically, in the Weibo and Weibo21 datasets, domains exhibit similar modulation patterns: the social and entertainment domains show relatively higher reliance on global semantic conflicts; the science, military, and political domains exhibit slightly stronger weights for semantic conflicts between image-local and text-global features; the health, education, finance, and disaster domains tend to display an increased dependency on semantic conflicts between text-local and image-global features. Regarding multi-channel features, conflict-fusion features exhibit slightly higher weights. In the FineFake dataset, the sigmoid weights for fine-grained local–local conflict features show relatively higher activation levels across most domains. However, in the conflict and politics domains, the sigmoid weights for the four conflict types are relatively close, primarily due to the official and rigorous writing styles that characterize these two domains. Additionally, among the multi-channel features, text features have relatively higher sigmoid weights in the politics, business, and conflict domains, while image features and conflict features receive slightly lower weights.
4.4.2. Parameter Analysis
We analyze the sensitivity of the method to different values of the parameters
,
,
, and
on the Weibo, Weibo21, and FineFake datasets.
Figure 9,
Figure 10,
Figure 11 and
Figure 12 present the experimental results for these four key parameters. Overall, the Chinese datasets are more sensitive to parameter variations. The English dataset exhibits lower sensitivity. For the conflict feature loss parameter
, the model peaks at 0.3 on the Chinese datasets. On the English dataset, favorable results are observed within
, with 0.1 yielding a better outcome. For the correction loss parameter
, GC
2MFND works well at 0.2 on the Chinese datasets and at 0.1 on the English dataset.
Regarding the contrastive loss parameter , the Weibo and Weibo21 datasets perform better at 0.7. The FineFake dataset achieves better performance within , with 0.05 giving higher accuracy. For the contrastive loss temperature parameter , GC2MFND shows consistent performance at 0.1 across all three datasets, which is a robust choice. Furthermore, the performance drop across all three datasets under the same parameter settings did not exceed 0.009, and the performance remained higher than the respective baselines. This indicates that the model is stable. Based on these findings, we set the parameters for the Chinese datasets as , , , . For the English dataset, we use , , , .
4.4.3. Computational Cost Analysis
To comprehensively evaluate the computational efficiency of GC2MFND, we compare the average single-round training time, testing time, inference time, GPU memory consumption, and number of parameters across various models on Weibo, Weibo21, and FineFake under a unified experimental setting. We select the two best-performing baselines (MMDFND and DAMMFND) for a fair comparison. Since all models share the same pre-trained feature extractor, differences in time and parameters arise solely from their respective downstream network designs.
Table 7 presents the computational overhead metrics. GC
2MFND demonstrates a clear advantage in parameter efficiency, with a trainable parameter count of only 6.82 million, considerably lower than the two baselines. This advantage stems from architectural differences: MMDFND and DAMMFND employ multiple expert networks or domain-aware Transformer decoders, leading to a large parameter count; in contrast, GC
2MFND uses only lightweight operators, gated networks, multi-scale adapters, low-dimensional domain embeddings, conflict extraction and calibration modules based on linear mappings and attention, as well as an MLP classifier.
On Weibo and Weibo21, GC2MFND demonstrates competitive training efficiency, with a single-round training time around 45.27 s and 43.03 s, respectively. This is approximately 25% faster than DAMMFND. Its testing time is approximately 50% faster than that of MMDFND and comparable to that of DAMMFND. On the larger-scale FineFake dataset, DAMMFND achieves the best training efficiency because it relies solely on discrete-domain routing. In contrast, GC2MFND incorporates fine-grained cross-modal interaction and attention calibration, which increases the computational complexity of matrix operations as the dataset size grows, thereby resulting in longer training times.
However, in practical deployment, the timeliness of online inference is of greater importance. As shown in
Figure 13, GC
2MFND exhibits the lowest inference time and GPU memory consumption across the three datasets, indicating that its high accuracy does not come at the expense of real-time responsiveness. With a small number of parameters and efficient online inference, it can meet the need for timely detection and blocking of fake news in social media environments without incurring high computational costs.
4.4.4. Case Study and Error Analysis
To evaluate the proposed model, we qualitatively analyze correctly and incorrectly predicted cases, as shown in
Figure 14. Case (a) is a correctly identified fake news item. Although the text mentioning “Toothpaste” matches certain colors and visual elements, a subtle local conflict exists regarding children’s toothpaste and its hazards. Case (b) is also correctly identified: the phrase “camels begging” globally conflicts with an image showing “a camel being led by a person.” Additionally, the phrase “amputated limbs” locally conflicts with an image of “a camel with normal limbs.” In contrast, Case (c) is a misclassified fake news item. Although there is no explicit visual–textual conflict between the image of a “girl” and the text mentioning “Biden”, external knowledge confirms that the child is Biden’s granddaughter, exposing the caption’s false claim about a “young Boy dressed as a girl”. Case (d) is another misclassified example. The text and the image are highly consistent regarding elements such as “bear,” “bipedal stance,” “wrinkled skin,” and “human,” which misleads the model into an incorrect prediction. However, incorporating external knowledge—that sun bears have loose, clothing-like wrinkles and an eerily human-like posture—is required to correctly identify it as fake.
Based on the case studies above, the multi-granularity conflict features extracted by GC2MFND can effectively capture cross-modal inconsistencies between textual and visual content, thereby facilitating the detection of subtle fake news instances. However, as shown in Cases (c) and (d), when textual and visual information is highly consistent, conflict signals alone may be insufficient to reveal that the news is false, and external factual knowledge is often needed. Therefore, integrating external knowledge may further improve fake news detection performance in complex scenarios.