MMKT: Multimodal Sentiment Analysis Model Based on Knowledge-Enhanced and Text-Guided Learning
Abstract
1. Introduction
- Employing Transformer layers to project initial features of textual, audio, and visual modalities into low-dimensional spaces, filtering redundant information.
- Constructing a sentiment knowledge graph for the textual modality using the SenticNet knowledge base, which directly annotates word-level sentiment polarity to provide explicit sentiment labels. This strengthens the model’s understanding of emotional vocabulary while generating global sentiment knowledge features through graph embedding computations, thereby enhancing multimodal fusion.
- Introducing a dynamic text-guided learning approach that leverages multi-scale textual features to actively suppress redundant or conflicting information in visual and audio modalities, producing refined cross-modal representations.
- Our experimental results on the CMU-MOSEI and Twitter2019 datasets demonstrate that the MMKT model outperforms baseline methods in multimodal sentiment analysis tasks while validating the efficacy of its constituent modules.
2. Related Works
2.1. Multimodal Sentiment Analysis
2.2. Transformer
3. Methodology
3.1. Task Definition
3.2. Overall Architecture
3.3. Multimodal Embedding
3.4. Dynamic Text-Guided Learning
3.5. External Sentiment Knowledge Introduction
3.5.1. Knowledge Graph Definition and Construction
3.5.2. Graph Embedding Computation
3.6. Multimodal Feature Fusion
3.7. Sentiment Prediction
4. Experiments
4.1. Datasets and Evaluation Metrics
4.2. Experimental Settings
4.3. Comparative Experiments
- Baseline Models on CMU-MOSEI Dataset:
- Baseline Models on Twitter2019 Dataset:
4.4. Ablation Study
5. Conclusions
6. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Birjali, M.; Kasri, M.; Beni-Hssane, A. A comprehensive survey on sentiment analysis: Approaches, challenges and trends. Knowl.-Based Syst. 2021, 226, 107134. [Google Scholar] [CrossRef]
- Chaturvedi, I.; Cambria, E.; Welsch, R.E.; Herrera, F. Distinguishing between facts and opinions for sentiment analysis: Survey and challenges. Inf. Fusion 2018, 44, 65–77. [Google Scholar] [CrossRef]
- Liu, B.; Zhang, L. A survey of opinion mining and sentiment analysis. In Mining Text Data; Springer: Boston, MA, USA, 2012; pp. 415–463. [Google Scholar]
- Teijeiro-Mosquera, L.; Biel, J.I.; Alba-Castro, J.L.; Gatica-Perez, D. What your face vlogs about: Expressions of emotion and big-five traits impressions in youtube. IEEE Trans. Affect Comput. 2015, 6, 193–205. [Google Scholar] [CrossRef]
- Cambria, E. Affective computing and sentiment analysis. IEEE Intell. Syst. 2016, 31, 102–107. [Google Scholar] [CrossRef]
- Cambria, E.; Poria, S.; Gelbukh, A.; Nacional, I.P.; Thelwall, M. Sentiment analysis is a big suitcase. IEEE Intell. Syst. 2017, 32, 74–80. [Google Scholar] [CrossRef]
- Li, H.; Xu, H. Video-based Sentiment Analysis with hvnLBP-TOP Feature and Bi-LSTM. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 9963–9964. [Google Scholar]
- Hazarika, D.; Zimmermann, R.; Poria, S. MISA: Modality-invariant and specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, USA, 12–16 October 2020; pp. 1122–1131. [Google Scholar]
- Rahman, W.; Hasan, M.K.; Lee, S.; Zadeh, A.; Mao, C.; Morency, L.; Hoque, E. Integrating multimodal information in large pretrained transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 2359–2369. [Google Scholar]
- Guo, J.; Tang, J.; Dai, W.; Ding, Y.; Kong, W. Dynamically adjust word representations using unaligned multimodal information. In Proceedings of the 30th ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, USA, 10–14 October 2022; pp. 3394–3402. [Google Scholar]
- Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.P. Mosi: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv 2016, arXiv:1606.06259. [Google Scholar] [CrossRef]
- Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.P. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intell. Syst. 2016, 31, 82–88. [Google Scholar] [CrossRef]
- Poria, S.; Cambria, E.; Hazarika, D.; Majumder, N.; Zadeh, A.; Morency, L.-P. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 873–883. [Google Scholar]
- Shenoy, A.; Sardana, A. Multilogue-Net: A Context Aware RNN for Multi-modal Emotion Detection and Sentiment Analysis in Conversation. arXiv 2020, arXiv:2002.08267. [Google Scholar]
- Williams, J.; Kleinegesse, S.; Comanescu, R.; Radu, O. Recognizing emotions in video using multimodal dnn feature fusion. In Proceedings of the Grand Challenge and Workshop on Human Multimodal Language, Melbourne, Australia, 20 July 2018; pp. 11–19. [Google Scholar]
- Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.-P. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 1103–1114. [Google Scholar]
- Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal Transformer for unaligned multimodal language sequences. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6549–6558. [Google Scholar]
- Xu, J.; Huang, F.; Zhang, X.; Wang, S.; Li, Z.; He, Y. Visual-textual sentiment classification with bi-directional multilevel attention networks. Knowl.-Based Syst. 2019, 178, 61–73. [Google Scholar] [CrossRef]
- Harish, A.B.; Sadat, F. Trimodal Attention Module for Multimodal Sentiment Analysis (Student Abstract). In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13803–13804. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Devlin, J.; Chang, M.W.; Kenton, L.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT, Minneapolis, MI, USA, 2 June 2019; pp. 4171–4186. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Volume 12346, pp. 213–229. [Google Scholar]
- Chen, W.; Xing, X.; Xu, X.; Pang, J.; Du, L. Speechformer: A hierarchical efffcient framework incorporating the characteristics of speech. arXiv 2022, arXiv:2203.03812. [Google Scholar]
- Liu, Y.; Wang, W.; Feng, C.; Zhang, H.; Chen, Z.; Zhan, Y. Expression snippet transformer for robust video-based facial expression recognition. Pattern Recognit. 2023, 138, 109368. [Google Scholar] [CrossRef]
- Huang, J.; Tao, J.; Liu, B.; Lian, Z.; Niu, M. Multimodal transformer fusion for continuous emotion recognition. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3507–3511. [Google Scholar]
- Liu, Y.; Zhang, H.; Zhan, Y.; Chen, Z.; Yin, G.; Wei, L.; Chen, Z. Noise-resistant multimodal transformer for emotion recognition. arXiv 2023, arXiv:2305.02814. [Google Scholar] [CrossRef]
- Yuan, Z.; Li, W.; Xu, H.; Yu, W. Transformer-based feature reconstruction network for robust multimodal sentiment analysis. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 4400–4407. [Google Scholar]
- Cai, Y.; Cai, H.; Wan, X. Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model. In Proceedings of the 57th annual meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Volume 4, pp. 2506–2515. [Google Scholar]
- Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.; Morency, L.P. Efficient Low-rank Multimodal Fusion with Modality-Specific Factors. arXiv 2018, arXiv:1806.00064. [Google Scholar]
- Tsai, Y.H.H.; Liang, P.P.; Zadeh, A.; Morency, L.P.; Salakhutdinov, R. Learning Factorized Multimodal Representations. arXiv 2018, arXiv:1806.06176. [Google Scholar]
- Sun, Z.; Sarma, P.; Sethares, W.; Liang, Y. Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis. In Proceedings of the National Conference on Artificial Intelligence. Association for the Advancement of Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8992–8999. [Google Scholar]
- Yu, W.; Xu, H.; Yuan, Z.; Wu, J. Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis. arXiv 2021, arXiv:2102.04830. [Google Scholar] [CrossRef]
- Han, W.; Chen, H.; Poria, S. Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. arXiv 2021, arXiv:2109.00412. [Google Scholar] [CrossRef]
- Peng, H.; Gu, X.; Li, J.; Wang, Z.; Xu, H. Text-Centric Multimodal Contrastive Learning for Sentiment Analysis. Electronics 2024, 13, 1149. [Google Scholar] [CrossRef]
- Hou, J.; Omar, N.; Tiun, S.; Saad, S.; He, Q. TCHFN: Multimodal sentiment analysis based on text-centric hierarchical fusion network. Knowl.-Based Syst. 2024, 300, 112220. [Google Scholar] [CrossRef]
- Xu, N.; Zeng, Z.; Mao, W. Reasoning with Multimodal Sarcastic Tweets via Modeling Cross-Modality Contrast and Semantic Association. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3777–3786. [Google Scholar]
- Pan, H.; Lin, Z.; Fu, P.; Qi, Y.; Wang, W. Modeling Intra and Inter-modality Incongruity for Multi-Modal Sarcasm Detection. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 1383–1392. [Google Scholar]
Dataset | Training | Valid | Test | Total |
---|---|---|---|---|
CMU-MOSEI | 16,417 | 2345 | 4691 | 23,453 |
Twitter2019 | 19,816 | 2410 | 2409 | 24,635 |
Parameter | CMU-MOSEI | Twitter2019 |
---|---|---|
batch_size | 64 | 64 |
learning_rate | 1 × 10−4 | 1 × 10−4 |
dropout | 0.1 | 0.1 |
orig_d_text | 768 | 768 |
orig_d_audio | 74 | / |
orig_d_vision | 35 | 2048 |
optimizer | Adam | Adam |
Model | Acc-2 ↑ | F1 ↑ | MAE ↓ | Corr ↑ |
---|---|---|---|---|
TFN [16] | 82.5 | 82.1 | 0.593 | 0.700 |
LMF [29] | 82.0 | 82.1 | 0.623 | 0.677 |
MFM [30] | 84.4 | 84.3 | 0.568 | 0.717 |
ICCN [31] | 84.2 | 84.2 | 0.565 | 0.713 |
MulT [17] | 82.5 | 82.3 | 0.580 | 0.703 |
MISA [8] | 84.23 | 83.97 | 0.568 | 0.724 |
MAG-BERT [9] | 84.82 | 84.71 | 0.543 | 0.755 |
Self-MM [32] | 84.16 | 84.24 | 0.535 | 0.763 |
MMIM [33] | 85.04 | 84.88 | 0.543 | 0.758 |
TCMCL [34] | 85.8 | 85.7 | 0.541 | 0.759 |
TCHFN [35] | 86.27 | 86.48 | 0.538 | 0.770 |
MMKT | 86.43 | 86.56 | 0.529 | 0.772 |
Model | Acc ↑ | R ↑ | F1 ↑ |
---|---|---|---|
HPM [28] | 83.44 | 84.15 | 80.18 |
D&R Net [36] | 84.02 | 83.42 | 80.60 |
IIMI-MMSD [37] | 86.05 | 85.08 | 82.92 |
MMKT | 88.13 | 85.79 | 85.05 |
Model | Acc-2 ↑ | F1 ↑ | MAE ↓ | Corr ↑ |
---|---|---|---|---|
MMKT | 86.43 | 86.56 | 0.529 | 0.772 |
MMKT-T | 81.20 | 81.13 | 0.595 | 0.695 |
MMKT-K | 85.51 | 85.64 | 0.566 | 0.771 |
MMKT-KT | 80.95 | 81.06 | 0.618 | 0.702 |
Model | Acc-2 ↑ | F1 ↑ | MAE ↓ | Corr ↑ |
---|---|---|---|---|
T + A + V | 86.43 | 86.56 | 0.529 | 0.772 |
T | 84.53 | 84.62 | 0.556 | 0.751 |
T + A | 85.22 | 85.35 | 0.548 | 0.762 |
T + V | 85.77 | 85.82 | 0.535 | 0.766 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shi, C.; Zhang, Y. MMKT: Multimodal Sentiment Analysis Model Based on Knowledge-Enhanced and Text-Guided Learning. Appl. Sci. 2025, 15, 9815. https://doi.org/10.3390/app15179815
Shi C, Zhang Y. MMKT: Multimodal Sentiment Analysis Model Based on Knowledge-Enhanced and Text-Guided Learning. Applied Sciences. 2025; 15(17):9815. https://doi.org/10.3390/app15179815
Chicago/Turabian StyleShi, Chengkai, and Yunhua Zhang. 2025. "MMKT: Multimodal Sentiment Analysis Model Based on Knowledge-Enhanced and Text-Guided Learning" Applied Sciences 15, no. 17: 9815. https://doi.org/10.3390/app15179815
APA StyleShi, C., & Zhang, Y. (2025). MMKT: Multimodal Sentiment Analysis Model Based on Knowledge-Enhanced and Text-Guided Learning. Applied Sciences, 15(17), 9815. https://doi.org/10.3390/app15179815