Beyond Spurious Cues: Adaptive Multi-Modal Fusion via Mixture-of-Experts for Robust Sarcasm Detection
Abstract
1. Introduction
- Mainstream models often exploit superficial textual cues (e.g., emojis and hashtags) that are prevalent in noisy datasets. While prior works employ various fusion strategies, their architectures often fail to distinguish these cues from genuine semantic conflicts, inadvertently learning shortcuts that do not generalize to real-world scenarios. This fundamental weakness is exposed when such cues are removed, causing model performance to degrade significantly [10]. For example, as shown in Figure 1, traditional multi-modal models such as HFM [11], Att-BERT [12], CMGCN [13], HKE [14], DynRT [15], and G2SAM [16] perform significantly worse on the cleaned MMSD2.0 dataset [10], as the accuracy of the DynRT model drops by 23.22% from MMSD [11] to MMSD2.0.
- Lack of dynamic fusion mechanisms: Most current MMSD models adopt static fusion strategies, applying the same integration method to all inputs. This uniformity ignores the varying dominance or noise in different modalities across contexts, resulting in unstable performance.
- MMSD3.0: Trained on noisy MMSD, tested on clean MMSD2.0.
- MMSD4.0: Trained on clean MMSD2.0, tested on noisy MMSD.
- We propose a new systematic evaluation protocol to assess the robustness of multi-modal sarcasm detection models against distributional shifts of spurious cues. We instantiate this protocol with two challenging benchmarks, MMSD3.0 and MMSD4.0, which are designed to rigorously measure a model’s true generalization ability beyond memorizing superficial dataset artifacts.
- We propose MM-MoE, a Mixture-of-Experts framework that integrates diverse expert modules and a dynamic gating mechanism for adaptive semantic fusion. A regularization loss based on negative sampling augmentation is further introduced to enhance semantic alignment and reduce dependence on superficial cues.
- Extensive experiments show that MM-MoE performs well in the presence of spurious cues in multi-modal sarcasm detection, consistently outperforms state-of-the-art baselines, and exhibits strong generalization capabilities to spurious correlations.
2. Related Work
2.1. Multi-Modal Sarcasm Detection
2.2. Mixture-of-Experts (MoE) Architectures
3. Methodology
- Text Self-Attention Experts;
- Image Self-Attention Experts;
- Text-Guided Cross-Modal Experts;
- Image-Guided Cross-Modal Experts.
3.1. Task Definition
3.2. Mixture-of-Experts Module
- Text Self-Attention Experts: These experts focus on capturing the internal contextual dependencies and semantic relationships within the text modality. Their input is the projected text feature T′.
- Image Self-Attention Experts: These experts are used to capture global and local visual cues within the image modality. Their input is the projected image feature I′.
- Text-Guided Cross-Modal Experts: These experts aim to model how textual semantics attend to visual features to identify cross-modal incongruities.
- Image-Guided Cross-Modal Experts: Symmetrical to the previous type, these experts model how visual semantics attend to textual features to enrich their representations.
3.3. Dynamic Fusion via Gating and Attention
3.4. Training Objective with Negative Sampling Augmentation
4. Experiments
4.1. Datasets
- MMSD is constructed from English tweets, where sarcasm is weakly labeled via hashtags such as #sarcasm. It contains numerous spurious cues, such as emojis and hashtags, which may bias the learning process. And the author officially divided the training set, the validation set, and the test set.
- MMSD2.0 [10] is a refined version of MMSD, where superficial cues (e.g., emojis, hashtags) are systematically removed, aiming to better expose genuine semantic incongruity between modalities. At the same time, MMSD2.0 does not change the original division of MMSD, but only processes the text and changes some inappropriate labels, so there is no data leakage problem at all.
- MMSD3.0: Trained and validated on MMSD (with noise), tested on MMSD2.0 (clean).
- MMSD4.0: Trained and validated on MMSD2.0 (clean), tested on MMSD (noisy).
4.2. Evaluation Metrics
4.3. Baselines
- Text-Only Modality:
- -
- TextCNN [43]: A Convolutional Neural Network-based model for sentence classification. It utilizes convolutional filters of varying sizes to extract n-gram features, and has shown competitive performance across various text classification tasks.
- -
- Bi-LSTM [44]: A bidirectional long short-term memory network that captures both past and future dependencies in text sequences. It is widely used to model contextual semantics in sentiment and sarcasm detection.
- -
- SMSD [45]: The Self-Matching and Low-Rank Bilinear Pooling model, specifically designed for sarcasm detection. It captures intra-sentence incongruities by computing rich word-to-word interactions, and is effective in modeling subtle ironic cues.
- -
- BERT [46]: A Transformer-based language representation model pre-trained on large corpora. We fine-tune BERT for sarcasm classification tasks, leveraging its strong contextual modeling ability to capture implicit sentiment conflicts within text.
- Image-Only Modality:
- -
- ResNet [47]: A residual Convolutional Neural Network pre-trained on ImageNet. It is used as a visual backbone to extract deep hierarchical features from input images, enabling sarcasm detection based solely on visual signals.
- -
- ViT [48]: Vision Transformer, which splits images into patches and applies multi-head self-attention to model long-range dependencies. It provides global visual context, and has demonstrated strong performance in many vision tasks.
- Multi-Modal Fusion Modality:
- -
- HFM [11]: The Hierarchical Fusion Model, one of the earliest works in multi-modal sarcasm detection. It performs multi-level fusion of text, image, and visual attribute features to improve sarcasm recognition.
- -
- CMGCN [13]: The Cross-Modal Graph Convolutional Network constructs a graph across textual and visual modalities, capturing intricate sarcasm-related relationships via structured cross-modal interactions.
- -
- DIP [23]: The Dual Incongruity Perceiving Network introduces a two-branch architecture to simultaneously detect semantic and emotional incongruities between text and image, significantly enhancing multi-modal sarcasm understanding.
- -
- DynRT [15]: A dynamic routing transformer network that adapts fusion pathways based on input characteristics. It aims to handle semantic mismatches by dynamically routing features through different sub-networks.
- -
- Multi-View CLIP [10]: A multi-perspective framework based on CLIP that integrates textual, visual, and interactional views. It aggregates complementary sarcastic cues from various perspectives for robust sarcasm detection.
- -
- DMSD-CL [49]: This paper introduces a novel framework for debiasing multi-modal sarcasm detection using contrastive learning. It constructs positive samples with dissimilar word biases and negative samples with similar biases via counterfactual data augmentation, enhancing the model’s robustness in OOD settings.
- -
- G2SAM [16]: The paper introduces a novel inference paradigm leveraging graph-based global semantic awareness for multi-modal sarcasm detectio. It is the first work to combine global semantic congruity with label-aware contrastive learning for multi-modal classification.
- -
- ESAM [25]: The paper proposes a multi-modal sarcasm detection model called Enhancing Semantic Awareness Model (ESAM), which addresses feature shift issues through Sentiment Consistency Constraint (SCC) and Automatic Outlier Masking (AOM), thereby enhancing the model’s capability to identify sarcastic sentiments in multi-modal data.
- -
- GPT-4o [50]: The multi-modal large language model released by OpenAI has powerful multi-modal capabilities and has achieved advanced performance in many fields.
- -
- Qwen [51]: Qwen is a multi-modal large language model released by Alibaba. Its performance in all aspects is at the forefront of the world, especially its multi-modal processing capabilities.
4.4. Experimental Setup
4.5. Hyperparameter Analysis
4.6. Results and Analysis
- Performance on MMSD3.0 (Train on Noisy, Test on Clean): This benchmark is particularly challenging, as it directly assesses a model’s ability to learn genuine semantic incongruity from a training set rife with spurious cues (e.g., hashtags, emojis) and generalize to a clean test set devoid of them. As shown in the table, our model obtains an accuracy of 81.61% and an F1 score of 81.26%. This result significantly surpasses the next best performing model, Multi-View CLIP, which scored an F1 of 78.07%. The improvement of 3.19 percentage points in F1 score is not only substantial in magnitude, but is also statistically significant (p < 0.05, paired t-test), which strongly highlights our model’s ability to mitigate the influence of spurious correlations. In contrast, several baseline models, such as DynRT, exhibit a dramatic performance collapse on this task with an F1 score of only 55.95%, indicating their heavy reliance on superficial cues and the failure to generalize.
- Performance on MMSD4.0 (Train on Clean, Test on Noisy): This setting evaluates the model’s robustness when faced with distracting spurious cues at test time, after having been trained on a clean dataset. Our model once again demonstrates its superiority, achieving a top accuracy of 87.21% and an F1 score of 87.61%. This performance represents a significant margin over other strong multi-modal methods like DIP (F1 of 84.70%) and Multi-View CLIP (F1 of 84.99%). Crucially, the performance gap between our model and the strongest baseline (Multi-View CLIP) is statistically robust, as confirmed by our significance test (p < 0.05). This success suggests that, by learning from clean data, our model builds a robust understanding of sarcasm that is not easily perturbed by the reintroduction of noisy signals. The dynamic gating mechanism in our framework likely plays a key role in adaptively focusing on meaningful cross-modal interactions while down-weighting potentially misleading uni-modal cues.
- Overall Observations: A key trend observed across nearly all models is that performance on MMSD4.0 is considerably higher than on MMSD3.0. This suggests that generalizing from a noisy training distribution is a fundamentally harder problem than being robust to noise in the test distribution. It underscores the importance of clean training data or, in its absence, models like ours that are explicitly designed to handle spurious correlations. Furthermore, the results consistently show that multi-modal models outperform their uni-modal counterparts (Text-only and Image-only methods), reaffirming the necessity of leveraging information from both modalities for reliable sarcasm detection. In conclusion, the comprehensive results strongly validate the effectiveness, generalization ability, and robustness of our proposed MM-MoE model.
4.7. Ablation Study
- The Importance of Core Mechanisms: The two most critical components of our framework are the augmentation loss (Aug) and the dynamic gating network (Gating_network). Removing the augmentation loss (w/o Aug) results in the most substantial performance drop, with the F1 score decreasing by 4.46% on MMSD3.0 and 3.59% on MMSD4.0. This underscores the vital role of the negative sampling strategy in forcing the model to learn genuine cross-modal semantic alignment, which is essential for generalizing beyond spurious cues. Similarly, removing the gating network (w/o Gating_network) also causes a significant decline, proving that the ability to dynamically weight and select experts based on input context is crucial for robust fusion.
- The Necessity of Cross-Modal Interaction: The removal of either the text-guided (w/o Cross_Text) or image-guided (w/o Cross_Image) cross-modal experts leads to a notable drop in performance. This is expected, as sarcasm often arises from the incongruity between modalities. The degradation in performance confirms that explicitly modeling these cross-modal relationships is a cornerstone of effective sarcasm detection. To further emphasize their collective importance, removing all cross-modal experts at once (w/o All_Cross) triggers the most severe performance drop in all of the expert ablations, as detailed in Table 3. This powerfully confirms that the model’s ability to capture inter-modal incongruity is the single most vital element of the framework.
- The Role of Uni-Modal Understanding: The ablation of uni-modal self-attention experts (w/o Text_Self and w/o Image_Self) also negatively impacts the results. This indicates that a thorough understanding of each modality’s internal context is a necessary prerequisite for subsequent, more complex cross-modal reasoning. Interestingly, the performance drop from removing the image self-attention expert (w/o Image_Self) is the least pronounced on the MMSD3.0 task. This suggests that, when trained on the noisy MMSD dataset, where textual cues might be more dominant or misleading, the model learns to rely less heavily on intra-modal visual information for generalization.
4.8. Performance on Standard Benchmarks
- Analysis of MMSD: The MMSD dataset, known to be “noisy” due to its inclusion of numerous spurious cues like hashtags (for example, #sarcasm) and emojis, poses a unique challenge. As observed in Table 4, our MM-MoE model achieves a comprehensive lead on this dataset, with an Accuracy of 91.12% and an F1 score of 90.63%. Both key metrics surpass all existing baselines, demonstrating that, even in an environment with superficial shortcuts, the Mixture-of-Experts architecture and dynamic fusion mechanism of MM-MoE can efficiently learn and integrate multi-modal information to achieve superior sarcasm detection.
- Analysis of MMSD2.0: The MMSD2.0 dataset, a refined version of MMSD with spurious cues removed, provides a more authentic test of a model’s ability to capture semantic incongruity. On this “clean” dataset, our MM-MoE model once again achieves SOTA performance, with an F1 score of 85.87% and an accuracy of 85.89%. It is noteworthy that MM-MoE maintains a clear performance advantage over strong competitors like G2SAM (74.93% F1 score) and ESAM (84.56% F1 score). This strongly indicates that our model’s success is not reliant on spurious cues, but rather on a deeper understanding of the semantic relationships between text and images.
- Observation on Performance Shift from MMSD to MMSD2.0: A critical observation is that nearly all models, including MM-MoE, experience a performance drop when transitioning from MMSD to MMSD2.0. This universally validates the core motivation of our study: many models inadvertently learn to depend on spurious correlations. For instance, the DIP model’s F1 score falls sharply from 86.18% to 77.94%, indicating a strong reliance on such cues. In contrast, while MM-MoE’s performance also naturally declines (F1 from 90.63% to 85.87%), its ability to remain the top-performing model after the removal of spurious cues highlights the superiority of its design. The dynamic gating network and the augmentation loss objective likely contribute to its robustness, enabling it to adaptively focus on genuine cross-modal conflicts when obvious cues are absent.In summary, these in-domain experimental results provide a powerful complement to our cross-domain findings. MM-MoE not only excels at generalization tasks involving distribution shifts (MMSD3.0 & MMSD4.0), but also sets a new performance benchmark in standard in-domain evaluations, comprehensively validating its effectiveness and robustness as an advanced framework for multi-modal sarcasm detection.
4.9. Impact of Expert Number
4.10. Computational Efficiency Analysis and Occam’s Razor
4.11. Qualitative Analysis and Visualization
- Analysis of MMSD3.0 (Train Noisy → Test Clean): When the model is trained on the noisy MMSD dataset, which is rife with spurious correlations (e.g., hashtags, emojis), it learns a “defensive” strategy. The dynamic gate assigns significantly higher weights to uni-modal experts, particularly Image_Self_3 (average weight 0.154), while suppressing the cross-modal experts. This behavior suggests that, during training, the model discovers that relying on the relationship between modalities can be misleading due to noise. Instead, it learns to distrust the cross-modal signals and latches onto more reliable, albeit superficial, patterns within a single modality as predictive shortcuts. This reliance on isolated features is a classic symptom of models exposed to spurious cues, and, while it shows adaptability during training, it also explains why generalizing from a noisy dataset is fundamentally challenging.
- Analysis of MMSD4.0 (Train Clean → Test Noisy): In stark contrast, when MM-MoE is trained on the clean MMSD2.0 dataset, it is forced to learn the genuine semantic and emotional incongruity that defines sarcasm. Having built this robust, “semantically-grounded” understanding, its behavior at test time on noisy data becomes highly insightful. As shown in Figure 6, the model dynamically and significantly increases the weights for the cross-modal experts (Cross_Text and Cross_Image). This strategic shift is crucial: when faced with distracting signals within the text or image, the model intelligently focuses its resources on analyzing the interaction between the modalities. It essentially learns to “cut through the noise” by prioritizing the cross-modal relationship, which it knows is the most reliable path to identifying true sarcastic incongruity. This is not merely a static fusion strategy; it is an active, input-dependent adaptation that serves as direct evidence of the model’s robustness.
- Case Study on Nuanced Sarcasm: To provide a more granular insight beyond average weights, we present a case study on the sarcastic example shown in Figure 7. The input consists of an image of a snowy owl, as shown in Figure 7a and the text: “snowy owl came too close to traffic camera and he is being ticketed for not reading the signs in #french”. Here, the sarcasm arises not from a direct visual-semantic conflict, but from a commonsense violation. Interestingly, the expert weight distribution in Figure 7b reveals that the gating network assigns the highest weights to uni-modal experts (both Image Self-attention and Text Self-attention) and relatively low weights to the cross-modal experts. This suggests a sophisticated reasoning process: the model first dedicates resources to deeply understand the image’s real-world context and the text’s internal absurdity independently. Only after forming strong uni-modal representations can the model detect the higher-level, commonsense incongruity in their combination. This case vividly illustrates our model’s flexibility in handling nuanced sarcasm where robust uni-modal understanding is a prerequisite to revealing the irony.
5. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Tiwari, D.; Kanojia, D.; Ray, A.; Nunna, A.; Bhattacharyya, P. Predict and Use: Harnessing Predicted Gaze to Improve Multimodal Sarcasm Detection. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; pp. 15933–15948. [Google Scholar] [CrossRef]
- Maynard, D.G.; Greenwood, M.A. Who cares about sarcastic tweets? Investigating the impact of sarcasm on sentiment analysis. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 26–31 May 2014; European Language Resources Association (ELRA): Reykjavik, Iceland, 2014; pp. 4238–4243. [Google Scholar]
- Li, F.; Si, X.; Tang, S.; Wang, D.; Han, K.; Han, B.; Zhou, G.; Song, Y.; Chen, H. Contextual Distillation Model for Diversified Recommendation. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 5307–5316. [Google Scholar] [CrossRef]
- Frenda, S. The role of sarcasm in hate speech. A multilingual perspective. In Proceedings of the Doctoral Symposium of the XXXIV International Conference of the Spanish Society for Natural Language Processing (SEPLN 2018), Seville, Spain, 18–21 September 2018; Lloret, E., Saquete, E., Martínez-Barco, P., Moreno, I., Eds.; pp. 13–17. [Google Scholar]
- Rothermich, K.; Ogunlana, A.; Jaworska, N. Change in humor and sarcasm use based on anxiety and depression symptom severity during the COVID-19 pandemic. J. Psychiatr. Res. 2021, 140, 95–100. [Google Scholar] [CrossRef]
- Das, R.; Singh, T.D. Multimodal sentiment analysis: A survey of methods, trends, and challenges. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
- Alevizopoulou, S.; Koloveas, P.; Tryfonopoulos, C.; Raftopoulou, P. Social Media Monitoring for IoT Cyber-Threats. In Proceedings of the 2021 IEEE International Conference on Cyber Security and Resilience (CSR), Virtual Event, Rhodes, Greece, 26–28 July 2021; pp. 436–441. [Google Scholar] [CrossRef]
- Farabi, S.; Ranasinghe, T.; Kanojia, D.; Kong, Y.; Zampieri, M. A Survey of Multimodal Sarcasm Detection. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, Jeju, Republic of Korea, 3–9 August 2024; Larson, K., Ed.; pp. 8020–8028. [Google Scholar] [CrossRef]
- Dutta, P.; Bhattacharyya, C.K. Multi-Modal Sarcasm Detection in Social Networks: A Comparative Review. In Proceedings of the 2022 6th International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 29–31 March 2022; pp. 207–214. [Google Scholar] [CrossRef]
- Qin, L.; Huang, S.; Chen, Q.; Cai, C.; Zhang, Y.; Liang, B.; Che, W.; Xu, R. MMSD2.0: Towards a Reliable Multi-modal Sarcasm Detection System. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; pp. 10834–10845. [Google Scholar]
- Cai, Y.; Cai, H.; Wan, X. Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Korhonen, A., Traum, D., Màrquez, L., Eds.; pp. 2506–2515. [Google Scholar] [CrossRef]
- Pan, H.; Lin, Z.; Fu, P.; Qi, Y.; Wang, W. Modeling Intra and Inter-modality Incongruity for Multi-Modal Sarcasm Detection. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; Cohn, T., He, Y., Liu, Y., Eds.; pp. 1383–1392. [Google Scholar] [CrossRef]
- Liang, B.; Lou, C.; Li, X.; Yang, M.; Gui, L.; He, Y.; Pei, W.; Xu, R. Multi-Modal Sarcasm Detection via Cross-Modal Graph Convolutional Network. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; pp. 1767–1777. [Google Scholar] [CrossRef]
- Liu, H.; Wang, W.; Li, H. Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; pp. 4995–5006. [Google Scholar] [CrossRef]
- Tian, Y.; Xu, N.; Zhang, R.; Mao, W. Dynamic Routing Transformer Network for Multimodal Sarcasm Detection. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Volume 1: Long Papers, pp. 2468–2480. [Google Scholar] [CrossRef]
- Wei, Y.; Yuan, S.; Zhou, H.; Wang, L.; Yan, Z.; Yang, R.; Chen, M. G2SAM: Graph-Based Global Semantic Awareness Method for Multimodal Sarcasm Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 9151–9159. [Google Scholar]
- Khodak, M.; Saunshi, N.; Vodrahalli, K. A Large Self-Annotated Corpus for Sarcasm. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018; Calzolari, N., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., et al., Eds.; [Google Scholar]
- Oprea, S.; Magdy, W. iSarcasm: A Dataset of Intended Sarcasm. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; pp. 1279–1289. [Google Scholar] [CrossRef]
- Hazarika, D.; Poria, S.; Gorantla, S.; Cambria, E.; Zimmermann, R.; Mihalcea, R. CASCADE: Contextual Sarcasm Detection in Online Discussion Forums. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; Bender, E.M., Derczynski, L., Isabelle, P., Eds.; pp. 1837–1848. [Google Scholar]
- Liu, Y.; Wang, Y.; Sun, A.; Meng, X.; Li, J.; Guo, J. A Dual-Channel Framework for Sarcasm Recognition by Detecting Sentiment Conflict. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, USA, 10–15 July 2022; Carpuat, M., de Marneffe, M.C., Meza Ruiz, I.V., Eds.; pp. 1670–1680. [Google Scholar] [CrossRef]
- Schifanella, R.; de Juan, P.; Tetreault, J.; Cao, L. Detecting Sarcasm in Multimodal Social Platforms. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 1136–1145. [Google Scholar] [CrossRef]
- Liang, B.; Lou, C.; Li, X.; Gui, L.; Yang, M.; Xu, R. Multi-Modal Sarcasm Detection with Interactive In-Modal and Cross-Modal Graphs. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 4707–4715. [Google Scholar] [CrossRef]
- Wen, C.; Jia, G.; Yang, J. DIP: Dual Incongruity Perceiving Network for Sarcasm Detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 2540–2550. [Google Scholar] [CrossRef]
- Chen, J.; Yu, H.; Huang, S.; Liu, S.; Zhang, L. InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection. arXiv 2024, arXiv:2406.16464. [Google Scholar]
- Yuan, S.; Wei, Y.; Zhou, H.; Xu, Q.; Chen, M.; He, X. Enhancing Semantic Awareness by Sentimental Constraint with Automatic Outlier Masking for Multimodal Sarcasm Detection. IEEE Trans. Multimed. 2025, 27, 5376–5386. [Google Scholar] [CrossRef]
- Jacobs, R.A.; Jordan, M.I.; Nowlan, S.J.; Hinton, G.E. Adaptive mixtures of local experts. Neural Comput. 1991, 3, 79–87. [Google Scholar] [CrossRef]
- Masoudnia, S.; Ebrahimpour, R. Mixture of experts: A literature survey. Artif. Intell. Rev. 2014, 42, 275–293. [Google Scholar] [CrossRef]
- Yuksel, S.E.; Wilson, J.N.; Gader, P.D. Twenty years of mixture of experts. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 1177–1193. [Google Scholar] [CrossRef]
- Cai, W.; Jiang, J.; Wang, F.; Tang, J.; Kim, S.; Huang, J. A survey on mixture of experts. arXiv 2024, arXiv:2407.06204. [Google Scholar]
- Zhou, Y.; Lei, T.; Liu, H.; Du, N.; Huang, Y.; Zhao, V.; Dai, A.; Chen, Z.; Le, Q.; Laudon, J. Mixture-of-experts with expert choice routing. Adv. Neural Inf. Process. Syst. 2022, 35, 7103–7114. [Google Scholar]
- Chen, Z.; Deng, Y.; Wu, Y.; Gu, Q.; Li, Y. Towards understanding the mixture-of-experts layer in deep learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23049–23062. [Google Scholar]
- Du, N.; Huang, Y.; Dai, A.M.; Tong, S.; Lepikhin, D.; Xu, Y.; Krikun, M.; Zhou, Y.; Yu, A.W.; Firat, O.; et al. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S., Eds.; Volume 162, pp. 5547–5569. [Google Scholar]
- Lin, B.; Tang, Z.; Ye, Y.; Cui, J.; Zhu, B.; Jin, P.; Huang, J.; Zhang, J.; Pang, Y.; Ning, M.; et al. Moe-llava: Mixture of experts for large vision-language models. arXiv 2024, arXiv:2401.15947. [Google Scholar]
- Fedus, W.; Zoph, B.; Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res. 2022, 23, 1–39. [Google Scholar]
- Gao, Z.F.; Liu, P.; Zhao, W.X.; Lu, Z.Y.; Wen, J.R. Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; Calzolari, N., Huang, C.R., Kim, H., Pustejovsky, J., Wanner, L., Choi, K.S., Ryu, P.M., Chen, H.H., Donatelli, L., Ji, H., et al., Eds.; pp. 3263–3273. [Google Scholar]
- Wu, Z.; Chen, X.; Pan, Z.; Liu, X.; Liu, W.; Dai, D.; Gao, H.; Ma, Y.; Wu, C.; Wang, B.; et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv 2024, arXiv:2412.10302. [Google Scholar]
- Mustafa, B.; Riquelme, C.; Puigcerver, J.; Jenatton, R.; Houlsby, N. Multimodal contrastive learning with limoe: The language-image mixture of experts. Adv. Neural Inf. Process. Syst. 2022, 35, 9564–9576. [Google Scholar]
- Goyal, A.; Kumar, N.; Guha, T.; Narayanan, S.S. A multimodal mixture-of-experts model for dynamic emotion prediction in movies. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 2822–2826. [Google Scholar] [CrossRef]
- Liu, Y.; Liu, Y.; Li, Z.; Yao, R.; Zhang, Y.; Wang, D. Modality Interactive Mixture-of-Experts for Fake News Detection. arXiv 2025, arXiv:2501.12431. [Google Scholar]
- Jiang, R.; Liu, L.; Chen, C. MoPE: Mixture of Prompt Experts for Parameter-Efficient and Scalable Multimodal Fusion. arXiv 2024, arXiv:2403.10568. [Google Scholar]
- Xie, Y.; Zhu, Z.; Chen, X.; Chen, Z.; Huang, Z. MoBA: Mixture of Bi-directional Adapter for Multi-modal Sarcasm Detection. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 4264–4272. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; Volume 139, pp. 8748–8763. [Google Scholar]
- Chen, Y. Convolutional Neural Network for Sentence Classification. Master’s Thesis, University of Waterloo, Waterloo, ON, Canada, 2015. [Google Scholar]
- Zhou, P.; Shi, W.; Tian, J.; Qi, Z.; Li, B.; Hao, H.; Xu, B. Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Erk, K., Smith, N.A., Eds.; Volume 2: Short Papers, pp. 207–212. [Google Scholar] [CrossRef]
- Xiong, T.; Zhang, P.; Zhu, H.; Yang, Y. Sarcasm Detection with Self-matching Networks and Low-rank Bilinear Pooling. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 2115–2124. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Volume 1: Long and Short Papers, pp. 4171–4186. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Jia, M.; Xie, C.; Jing, L. Debiasing Multimodal Sarcasm Detection with Contrastive Learning. Proc. AAAI Conf. Artif. Intell. 2024, 38, 18354–18362. [Google Scholar] [CrossRef]
- Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. Gpt-4o system card. arXiv 2024, arXiv:2410.21276. [Google Scholar] [CrossRef]
- Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 technical report. arXiv 2025, arXiv:2505.09388. [Google Scholar] [CrossRef]
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in pytorch. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Rasmussen, C.; Ghahramani, Z. Occam’s Razor. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 27 November–2 December 2000; Leen, T., Dietterich, T., Tresp, V., Eds.; MIT Press: Cambridge, MA, USA, 2000; Volume 13. [Google Scholar]







| Dataset | Label | Train | Val | Test |
|---|---|---|---|---|
| MMSD | Sarcasm | 8642 | 959 | 959 |
| N-sarcasm | 11,174 | 1451 | 1450 | |
| All | 19,816 | 2410 | 2409 | |
| MMSD2.0 | Sarcasm | 9576 | 1042 | 1037 |
| N-sarcasm | 10,240 | 1368 | 1372 | |
| All | 19,816 | 2410 | 2409 |
| Model | MMSD3.0 | MMSD4.0 | ||||||
|---|---|---|---|---|---|---|---|---|
| Acc (%) | P (%) | R (%) | F1 (%) | Acc (%) | P (%) | R (%) | F1 (%) | |
| Text-modality Methods | ||||||||
| TextCNN | 70.69 | 71.98 | 52.27 | 60.56 | 73.52 | 62.72 | 90.05 | 73.94 |
| Bi-LSTM | 60.91 | 66.04 | 60.56 | 63.18 | 75.51 | 67.05 | 81.19 | 73.45 |
| SMSD | 69.49 | 66.38 | 59.02 | 62.48 | 73.06 | 63.01 | 85.77 | 72.65 |
| BERT | 76.21 | 80.05 | 59.59 | 68.33 | 79.99 | 70.87 | 84.46 | 77.07 |
| Image-modality Methods | ||||||||
| ResNet | 68.83 | 60.58 | 72.34 | 65.94 | 66.21 | 58.61 | 73.19 | 65.09 |
| ViT | 75.63 | 68.43 | 77.21 | 72.56 | 71.56 | 63.71 | 78.88 | 70.49 |
| Multi-modality Methods | ||||||||
| HFM | 69.90 | 70.10 | 52.46 | 60.01 | 72.97 | 65.44 | 78.69 | 71.45 |
| CMGCN | 71.94 | 65.74 | 72.71 | 69.05 | 74.71 | 68.17 | 77.24 | 72.42 |
| DIP | 76.31 | 76.72 | 74.31 | 75.50 | 84.83 | 84.27 | 85.13 | 84.70 |
| DynRT | 57.20 | 62.94 | 50.35 | 55.95 | 70.29 | 70.23 | 71.26 | 70.74 |
| DMSD-CL | 77.96 | 75.10 | 73.00 | 74.03 | 81.94 | 80.61 | 71.95 | 76.03 |
| G2SAM | 74.93 | 69.28 | 75.02 | 72.04 | 82.98 | 77.42 | 80.81 | 79.08 |
| Multi-View CLIP | 80.99 | 77.55 | 78.59 | 78.07 | 83.35 | 86.69 | 83.35 | 84.99 |
| ESAM | 75.97 | 70.56 | 75.80 | 73.08 | 80.95 | 76.15 | 75.91 | 76.03 |
| GPT-4o | 72.69 | 59.49 | 66.75 | 62.91 | 70.61 | 68.67 | 54.47 | 60.75 |
| Qwen | 70.78 | 67.74 | 56.64 | 61.70 | 71.86 | 58.42 | 65.55 | 61.78 |
| MM-MoE | ||||||||
| Model | MMSD3.0 | MMSD4.0 | ||
|---|---|---|---|---|
| Acc (%) | F1 (%) | Acc (%) | F1 (%) | |
| BASELINE | ||||
| w/o Text_Self | ||||
| w/o Image_Self | ||||
| w/o Cross_Text | ||||
| w/o Cross_Image | ||||
| w/o All Cross | ||||
| w/o Gating_network | ||||
| w/o Aug | ||||
| Model | MMSD | MMSD2.0 | ||||||
|---|---|---|---|---|---|---|---|---|
| Acc (%) | P (%) | R (%) | F1 (%) | Acc (%) | P (%) | R (%) | F1 (%) | |
| GPT-4o | 72.69 | 59.49 | 66.75 | 62.91 | 70.61 | 68.67 | 54.47 | 60.75 |
| Qwen | 70.78 | 67.74 | 56.64 | 61.70 | 71.86 | 58.42 | 65.55 | 61.78 |
| HFM | 83.44 | 76.57 | 84.15 | 80.18 | 70.57 | 64.84 | 69.05 | 66.88 |
| CMGCN | 87.55 | 83.63 | 84.69 | 84.16 | 79.83 | 75.82 | 78.01 | 76.90 |
| DIP | 88.50 | 84.59 | 87.82 | 86.18 | 80.92 | 77.46 | 78.43 | 77.94 |
| Multi-View CLIP | 88.33 | 82.66 | 88.65 | 85.55 | 85.64 | 80.33 | 88.24 | 84.10 |
| DMSD-CL | 88.95 | 84.89 | 87.90 | 86.37 | 81.74 | 82.74 | 80.22 | 81.46 |
| G2SAM | 91.07 | 88.27 | 90.09 | 89.17 | 79.43 | 78.07 | 72.04 | 74.93 |
| ESAM | 90.11 | 86.87 | 89.54 | 88.19 | 85.87 | 83.12 | 86.05 | 84.56 |
| MM-MoE | ||||||||
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhao, G.; Zhao, Y.; Yin, X.; Lin, L.; Zhu, J. Beyond Spurious Cues: Adaptive Multi-Modal Fusion via Mixture-of-Experts for Robust Sarcasm Detection. Mathematics 2025, 13, 3250. https://doi.org/10.3390/math13203250
Zhao G, Zhao Y, Yin X, Lin L, Zhu J. Beyond Spurious Cues: Adaptive Multi-Modal Fusion via Mixture-of-Experts for Robust Sarcasm Detection. Mathematics. 2025; 13(20):3250. https://doi.org/10.3390/math13203250
Chicago/Turabian StyleZhao, Guilong, Yixia Zhao, Xiangrong Yin, Lei Lin, and Jizhao Zhu. 2025. "Beyond Spurious Cues: Adaptive Multi-Modal Fusion via Mixture-of-Experts for Robust Sarcasm Detection" Mathematics 13, no. 20: 3250. https://doi.org/10.3390/math13203250
APA StyleZhao, G., Zhao, Y., Yin, X., Lin, L., & Zhu, J. (2025). Beyond Spurious Cues: Adaptive Multi-Modal Fusion via Mixture-of-Experts for Robust Sarcasm Detection. Mathematics, 13(20), 3250. https://doi.org/10.3390/math13203250

