A Review of Multimodal Sentiment Analysis in Online Public Opinion Monitoring
Abstract
1. Introduction
2. Review Methodology
2.1. Scope
- the evolutionary trajectory of multimodal fusion techniques, from traditional feature-level approaches to contemporary Transformer-based architectures;
- task-specific methodological deployments across social media monitoring, product/service feedback analysis, and public safety/crisis management;
- comparative performance assessment across English and Chinese benchmark datasets, including CMU-MOSI, CMU-MOSEI, CH-SIMS, and CH-SIMSv2;
- emerging challenges and future research directions arising from LLM integration and domain-specific requirements.
2.2. Strategy
2.3. Contributions
- Synthesize existing methodologies to provide researchers with a detailed understanding of available methods and resources;
- Systematically analyze the evolution of fusion strategies from conventional paradigms to modern transformer-based approaches;
- Evaluate practical applications across key public opinion monitoring scenarios;
- Identify pressing challenges and prospective research avenues in the current technological landscape.
3. Multimodal Sentiment Analysis
3.1. Unimodal Sentiment Analysis
3.1.1. Text Modality
- Data Preprocessing: The raw data undergoes systematic cleaning and normalization, which includes removing irrelevant characters such as stop words and punctuation marks, performing dictionary matching to identify sentiment-bearing lexicons, and conducting lexical recognition to parse words and phrases for further analysis.
- Feature Extraction: At this stage, the processed text is transformed into machine-readable numerical representations using techniques such as Bag-of-Words for frequency-based encoding or word embeddings for semantic vectorization. Simultaneously, sentiment scores and polarity values are computed to quantify the emotional orientation embedded in the textual content.
- Model Training: The extracted features are fed into appropriate learning algorithms for training predictive models. This includes traditional machine learning methods such as Support Vector Machines (SVMs) for classification, as well as advanced deep learning architectures like BERT for contextual language understanding and ResNet for handling complex feature mappings.
- Result Visualization: Finally, the analysis outcomes are presented through intuitive visual formats—such as charts, graphs, or interactive dashboards—to effectively convey sentiment patterns, trends, and insights derived from the model predictions.
3.1.2. Visual Modality
3.1.3. Speech Modality
3.2. Multimodal Feature Fusion
3.2.1. Fusion Strategy
- Early Fusion (Feature-Level Fusion)
- ii.
- Late Fusion (Decision-Level Fusion)
- iii.
- Hybrid Fusion (Mid-Level Fusion)
- iv.
- Tensor Fusion
- v.
- Model-Level Fusion
- vi.
- Transformer-Based Fusion
- vii.
- Hierarchical Fusion
3.2.2. Current Research Status in Multimodal Sentiment Analysis
4. Online Public Opinion Monitoring
4.1. Theoretical Foundation
4.2. Manual Monitoring Methods
4.3. Machine Learning–Based Methods
4.4. Multimodal Sentiment Analysis in Public Opinion Monitoring
4.4.1. Social Media Monitoring
4.4.2. Product and Service Feedback
4.4.3. Public Safety and Crisis Management
4.5. Performance Comparison of Multimodal Sentiment Analysis Methods
- CMU-MOSI: Developed by Carnegie Mellon University in 2018, this dataset contains 2199 video clips and focuses on sentiment computing and public opinion analysis. It includes text, visual, and audio modalities and is available in English.
- CMU-MOSEI: Also from Carnegie Mellon University, this dataset was created in 2018 and comprises 23,500 video clips. It is used for sentiment computing, public opinion analysis, and human–computer interaction, incorporating text, visual, and audio data in English.
- CH-SIMS: This dataset, developed by Tsinghua University in 2020, contains 2281 video clips. It is utilized for sentiment computing, user behavior analysis, and public opinion analysis, covering text, visual, and audio modalities in Chinese.
- CH-SIMSv2: An extension of CH-SIMS, this dataset was also developed by Tsinghua University and released in 2022. It includes a larger volume of 14,563 video clips and is used for similar applications as CH-SIMS, focusing on text, visual, and audio data in Chinese.
- The LMF [80] model can dynamically and selectively fuse information from language, visual, and audio modalities to capture the interaction relationships among different modalities, achieving multimodal emotion computation and intent recognition.
- The MFN [81] independently models the interactions within each perspective and captures the cross-interactions between different perspectives, while storing and updating this interaction information through a multi-perspective gated memory module to achieve multi-modal and multi-perspective sequence learning.
- MISA [82] decomposes each modality into modality-invariant and modality-specific features, fuses them, and predicts emotional states, reducing the modality gap while lowering model complexity.
- EF-LSTM [83] uses recurrent neural networks and tensor operations to obtain semantic combination relationships at the phrase and sentence levels.
- LF-DNN [84] is a multi-modal, multi-perspective sequence learning method based on early fusion of input-level multi-modal DNN features, using a BLSTM network to jointly process audio, video, and text features, achieving simultaneous prediction of six types of emotions and their intensities.
- Self-MM [85] automatically generates single-modal labels to jointly train multi-modal and single-modal tasks, effectively capturing the consistency and differences between modalities, and achieving self-supervised multi-task learning without additional manual annotations.
- MMIM [86] maximizes mutual information at the input and fusion levels to reduce the loss of task-related information, using both parametric and non-parametric methods to estimate the lower bound of maximizing mutual information, thereby improving the quality of multi-modal data fusion.
- MFM [87] decomposes multi-modal representations into “cross-modal discriminative factors” and “modality-specific generative factors”, with the former used for task prediction and the latter for data reconstruction and missing modality completion, achieving joint optimization of generation and target discrimination.
- Graph-MFN [64] uses a graph structure to dynamically control the weights of language, visual, and acoustic modalities in real time, explicitly modeling single-, dual-, and triple-modal interactions, achieving more efficient modality fusion.
- (1)
- Those using a BERT model trained in English as the text modality encoder;
- (2)
- Those using a BERT model trained in Chinese as the text modality encoder.
- Top-5 Classification Accuracy: This metric evaluates the performance of a classification model by considering whether the correct label is within the top five predictions for each sample. The formula is given bywhere N is the total number of samples and I is an indicator function returning 1 if at least one of the predicted top five sentiment labels matches the ground-truth label.
- Mean Absolute Error (MAE): This metric measures the average magnitude of errors between predicted and actual values without considering their direction. It is calculated using the following formula:where is the true sentiment intensity for the i-th sample, and is the predicted value.
- Correlation coefficient:This metric quantifies the strength and direction of the linear relationship between predicted and actual sentiment intensities. It is defined by the following formula:where and are the mean true and predicted sentiment intensities, respectively.
5. Conclusions
- (1)
- Collaborative Representation
- (2)
- Fine-Grained Sentiment Recognition
- (3)
- Datasets
- VQA 2.0: Released by Virginia Tech and Georgia Institute of Technology in 2017, this dataset targets emotion classification, product recommendation, and visual question answering, integrating text and visual modalities.
- Twitter 2017: Curated by Fudan University in 2018, this resource facilitates sentiment analysis, user behavior analysis, and cross-lingual sentiment analysis, comprising text and visual data.
- CMU-MOSI: Produced by Carnegie Mellon University in 2018, this dataset serves sentiment analysis and public opinion monitoring, encompassing text, visual, and audio modalities.
- CMU-MOSEI: Also compiled in 2018 by Carnegie Mellon University and the University of Rochester, this collection supports sentiment analysis, public opinion monitoring, and cross-modal representation learning, featuring text, visual, and audio inputs.
- UR-FUNNY: Issued by Carnegie Mellon University in 2019, this dataset is designed for humor detection, multi-modal sentiment analysis, and human-computer interaction, combining text, visual, and acoustic information.
- CH-SIMS: Developed by Tsinghua University in 2020, this resource addresses sentiment analysis, user behavior analysis, and Chinese public opinion monitoring, integrating textual, visual, and auditory channels.
- MUGE: Published in 2022 by Alibaba DAMO Academy, Tsinghua University, and Alibaba Cloud TI Platform, this dataset enables emotion classification, image captioning, text-to-image retrieval, and image generation from textual descriptions, utilizing text and visual modalities.
- Wukong: Released by Huawei Noah’s Ark Lab in 2022, this collection supports image-text retrieval, zero-shot image classification, and Chinese public opinion monitoring, incorporating text and visual data.
- CH-SIMSV2: An expanded version from Tsinghua University published in 2022, this dataset continues to serve sentiment analysis, user behavior analysis, and Chinese public opinion monitoring, featuring text, visual, and audio modalities.
- Touch100k: Introduced in 2024 by Beijing Jiaotong University, Beijing University of Posts and Telecommunications, and Tencent WeChat AI Team, this pioneering dataset focuses on haptic perception, imitation learning, and sentiment analysis, uniquely merging haptic and visual sensory data.
- PanoSent: Developed by the National University of Singapore in 2024, this resource is applied to sentiment analysis, user behavior analysis, and public opinion monitoring, integrating text, visual, and audio modalities.
- SEED-VII: Released in 2024 by Shanghai Jiao Tong University, this specialized dataset facilitates cross-modal analysis, sentiment analysis, brain-computer interface research, and psychological studies, employing EEG and eye-tracking modalities.
- (4)
- Uncertain data processing
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Lu, Q.; Sun, X.; Long, Y.; Gao, Z.; Feng, J.; Sun, T. Sentiment Analysis: Comprehensive Reviews, Recent Advances, and Open Challenges. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 15092–15112. [Google Scholar] [CrossRef]
- Gandhi, A.; Adhvaryu, K.; Poria, S.; Cambria, E.; Hussain, A. Multimodal Sentiment Analysis: A Systematic Review of History, Datasets, Multimodal Fusion Methods, Applications, Challenges and Future Directions. Inf. Fusion 2023, 91, 424–444. [Google Scholar] [CrossRef]
- Xu, P.; Zhu, X.; Clifton, D.A. Multimodal Learning with Transformers: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12113–12132. [Google Scholar] [CrossRef]
- He, J.; Zhang, C.; Li, X. Survey of Research on Multimodal Fusion Technology for Deep Learning. Comput. Eng. 2020, 46, 1–11. [Google Scholar] [CrossRef]
- Zhao, H.; Yang, M.; Bai, X.; Liu, H. A Survey on Multimodal Aspect-Based Sentiment Analysis. IEEE Access 2024, 12, 12039–12052. [Google Scholar] [CrossRef]
- Zhang, C.; Tong, X.; Tong, H.; Yang, y. A Survey of Large Language Models in the Domain of Cybersecurity. Netinfo Secur. 2024, 24, 778. [Google Scholar] [CrossRef]
- Guo, X.; Wushour·Silamu, M.; Tuerhong, G. Survey of Sentiment Analysis Algorithms Based on Multimodal Fusion. Comput. Eng. Appl. 2024, 60, 1–18. [Google Scholar] [CrossRef]
- Liu, X.; Wei, F.; Jiang, W.; Zheng, Q.; Qiao, Y.; Liu, J.; Niu, L.; Chen, Z.; Dong, H. MTR-SAM: Visual Multimodal Text Recognition and Sentiment Analysis in Public Opinion Analysis on the Internet. Appl. Sci. 2023, 13, 7307. [Google Scholar] [CrossRef]
- Hu, M.; Liu, B. Mining and Summarizing Customer Reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’04, Seattle, WA, USA, 22–25 August 2004; pp. 168–177. [Google Scholar] [CrossRef]
- Maas, A.L.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; Association for Computational Linguistics: Stroudsburg, PA, USA, 2011; pp. 142–150. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the International Conference on Learning Representations, ICLR 2013, Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
- He, Y.; Sun, S.; Niu, F. A Deep Learning Model Enhanced with Emotion Semantics for Microblog Sentiment Analysis. Chin. J. Comput. 2017, 40, 773–790. [Google Scholar]
- Jin, D.; Ren, H.; Tang, R. Research on Offensive Language Detection in Social Networks Based on Emotion-Assisted Multi-Task Learning. Netinfo Secur. 2025, 25, 281. [Google Scholar] [CrossRef]
- LI, Y.; Dong, H. Text sentiment analysis based on feature fusion of convolution neural network and bidirectional long short-term memory network. J. Comput. Appl. 2018, 38, 3075. [Google Scholar] [CrossRef]
- Han, P.; Sun, J.; Fang, C. Micro-blog sentiment analysis based on emotional fusion and multi-dimensional self-attention mechanism. J. Comput. Appl. 2019, 39, 75–78. [Google Scholar]
- Tamura, H.; Mori, S.; Yamawaki, T. Textural Features Corresponding to Visual Perception. IEEE Trans. Syst. Man Cybern. 1978, 8, 460–473. [Google Scholar] [CrossRef]
- Colombo, C.; Del Bimbo, A.; Pala, P. Semantics in Visual Information Retrieval. IEEE MultiMed. 1999, 6, 38–53. [Google Scholar] [CrossRef]
- Machajdik, J.; Hanbury, A. Affective Image Classification Using Features Inspired by Psychology and Art Theory. In Proceedings of the 18th ACM International Conference on Multimedia, MM’10, Firenze, Italy, 25–29 October 2010; pp. 83–92. [Google Scholar] [CrossRef]
- Borth, D.; Ji, R.; Chen, T.; Breuel, T.; Chang, S.F. Large-Scale Visual Sentiment Ontology and Detectors Using Adjective Noun Pairs. In Proceedings of the 21st ACM International Conference on Multimedia, MM’13, Barcelona, Spain, 21–25 October 2013; pp. 223–232. [Google Scholar] [CrossRef]
- Yang, J.; Sun, M.; Sun, X. Learning Visual Sentiment Distributions via Augmented Conditional Probability Neural Network. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2017; Volume 31. [Google Scholar] [CrossRef]
- Zhu, J.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2242–2251. [Google Scholar] [CrossRef]
- Chen, F.; Ji, R.; Su, J.; Cao, D.; Gao, Y. Predicting Microblog Sentiments via Weakly Supervised Multimodal Deep Learning. IEEE Trans. Multimed. 2018, 20, 997–1007. [Google Scholar] [CrossRef]
- He, Y.; Ding, G. Deep Transfer Learning for Image Emotion Analysis: Reducing Marginal and Joint Distribution Discrepancies Together. Neural Process. Lett. 2020, 51, 2077–2086. [Google Scholar] [CrossRef]
- Zhao, Z.; Liu, Q. Former-DFER: Dynamic Facial Expression Recognition Transformer. In Proceedings of the 29th ACM International Conference on Multimedia, MM’21, Virtual Event, 20–24 October 2021; pp. 1553–1561. [Google Scholar] [CrossRef]
- Lin, Y.; Wei, G. Speech Emotion Recognition Based on HMM and SVM. In Proceedings of the 2005 International Conference on Machine Learning and Cybernetics, Guangzhou, China, 18–21 August 2005; Volume 8, pp. 4898–4901. [Google Scholar] [CrossRef]
- Wu, C.H.; Liang, W.B. Emotion Recognition of Affective Speech Based on Multiple Classifiers Using Acoustic-Prosodic Information and Semantic Labels. IEEE Trans. Affect. Comput. 2011, 2, 10–21. [Google Scholar] [CrossRef]
- Sundberg, J.; Patel, S.; Bjorkner, E.; Scherer, K.R. Interdependencies among Voice Source Parameters in Emotional Speech. IEEE Trans. Affect. Comput. 2011, 2, 162–174. [Google Scholar] [CrossRef]
- Jin, Q.; Li, C.; Chen, S.; Wu, H. Speech Emotion Recognition with Acoustic and Lexical Features. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 4749–4753. [Google Scholar] [CrossRef]
- Mencattini, A.; Martinelli, E.; Ringeval, F.; Schuller, B.; Natale, C.D. Continuous Estimation of Emotions in Speech by Dynamic Cooperative Speaker Models. IEEE Trans. Affect. Comput. 2017, 8, 314–327. [Google Scholar] [CrossRef]
- Eskimez, S.E.; Duan, Z.; Heinzelman, W. Unsupervised Learning Approach to Feature Analysis for Automatic Speech Emotion Recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5099–5103. [Google Scholar] [CrossRef]
- Pourebrahim, Y.; Razzazi, F.; Sameti, H. Semi-Supervised Parallel Shared Encoders for Speech Emotion Recognition. Digit. Signal Process. 2021, 118, 103205. [Google Scholar] [CrossRef]
- Zeng, Z.; Pantic, M.; Roisman, G.I.; Huang, T.S. A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 39–58. [Google Scholar] [CrossRef]
- Ren, Z.; Wang, Z.; Ke, Z.; Li, Z.; Wushour, S. Survey of Multimodal Data Fusion. Comput. Eng. Appl. 2021, 57, 49. [Google Scholar] [CrossRef]
- Bayoudh, K. A Survey of Multimodal Hybrid Deep Learning for Computer Vision: Architectures, Applications, Trends, and Challenges. Inf. Fusion 2024, 105, 102217. [Google Scholar] [CrossRef]
- Lueangwitchajaroen, P.; Watcharapinchai, S.; Tepsan, W.; Sooksatra, S. Multi-Level Feature Fusion in CNN-Based Human Action Recognition: A Case Study on EfficientNet-B7. J. Imaging 2024, 10, 320. [Google Scholar] [CrossRef]
- Zhang, X.; Sun, F.; Feng, L. Multi-View Representations for Fake News Detection. Netinfo Secur. 2024, 24, 438. [Google Scholar] [CrossRef]
- Zhao, X.; Xie, Y.; Wan, Y. Detection and Identification Model of Gambling Websites Based on Multi-Modal Data. Netinfo Secur. 2023, 23, 77. [Google Scholar] [CrossRef]
- Zheng, A.; He, J.; Wang, M.; Li, C.; Luo, B. Category-Wise Fusion and Enhancement Learning for Multimodal Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4416212. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar] [CrossRef]
- Shvetsova, N.; Chen, B.; Rouditchenko, A.; Thomas, S.; Kingsbury, B.; Feris, R.; Harwath, D.; Glass, J.; Kuehne, H. Everything at Once—Multi-modal Fusion Transformer for Video Retrieval. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 19988–19997. [Google Scholar] [CrossRef]
- Xu, H.; Yan, M.; Li, C.; Bi, B.; Huang, S.; Xiao, W.; Huang, F. E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 503–513. [Google Scholar] [CrossRef]
- Girdhar, R.; Singh, M.; Ravi, N.; van der Maaten, L.; Joulin, A.; Misra, I. Omnivore: A Single Model for Many Visual Modalities. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 16081–16091. [Google Scholar] [CrossRef]
- Tschannen, M.; Mustafa, B.; Houlsby, N. CLIPPO: Image-and-Language Understanding from Pixels Only. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 11006–11017. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Huan, R.; Zhong, G.; Chen, P.; Liang, R. UniMF: A Unified Multimodal Framework for Multimodal Sentiment Analysis in Missing Modalities and Unaligned Multimodal Sequences. IEEE Trans. Multimed. 2024, 26, 5753–5768. [Google Scholar] [CrossRef]
- Yi, G.; Fan, C.; Tao, J.; Lv, Z.; Wen, Z.; Pei, G.; Li, T. A Two-Stage Stacked Transformer Framework for Multimodal Sentiment Analysis. Intell. Comput. 2024, 3, 0081. [Google Scholar] [CrossRef]
- Peng, C.; Zhang, C.; Xue, X.; Gao, J.; Liang, H.; Niu, Z. Cross-Modal Complementary Network with Hierarchical Fusion for Multimodal Sentiment Classification. Tsinghua Sci. Technol. 2022, 27, 664–679. [Google Scholar] [CrossRef]
- Zhang, T.; Zhou, G. Text-Image Gated Fusion Mechanism for Multimodal Aspect-based Sentiment Analysis. Comput. Sci. 2024, 51, 242–249. [Google Scholar] [CrossRef]
- Wang, S.; Cai, G.; Guangrui, L. Aspect-level multimodal co-attention graph convolutional sentiment analysis model. J. Image Graph. 2023, 28, 3838–3854. [Google Scholar] [CrossRef]
- Zhang, L.; Wang, K.; Zichao, P. Target-Oriented Interaction Graph Neural Networks for Multimodal Aspect-Level Sentiment Analysis. Comput. Eng. Appl. 2024, 60, 136. [Google Scholar] [CrossRef]
- Li, J.; Liu, R.; Miao, Q.; Wang, D.; Liu, X. CAETFN: Context Adaptively Enhanced Text-Guided Fusion Network for Multimodal Sentiment Analysis. IEEE Trans. Affect. Comput. 2025, 16, 3122–3138. [Google Scholar] [CrossRef]
- Lin, Z.; Long, Y.; Jiachen, D. A Multimodal Sentiment Recognition Method Based on Multitask Learning. Acta Sci. Nat. Univ. Pekin. 2021, 57, 7. [Google Scholar] [CrossRef]
- Fan, R.; He, T.; Chen, M.; Zhang, M.; Tu, X.; Dong, M. Dual Causes Generation Assisted Model for Multimodal Aspect-Based Sentiment Classification. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 9298–9312. [Google Scholar] [CrossRef]
- Tao, J.; Fan, C.; Lian, Z.; Lv, Z.; Ying, S.; Shan, L. Development of multimodal sentiment recognition and understanding. J. Image Graph. 2024, 29, 1607–1627. [Google Scholar] [CrossRef]
- Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen Technical Report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
- DeepSeek-AI; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar] [CrossRef]
- Pang, N.; Wu, W.; Hu, Y.; Xu, K.; Yin, Q.; Qin, L. Enhancing Multimodal Sentiment Analysis via Learning from Large Language Model. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Gu, Q.; Wang, X. Network Public Opinion Analysis: Theory, Technology and Application; Tsinghua University Press: Beijing, China, 2020. [Google Scholar]
- An, L.; Wu, L. An Integrated Analysis of Topical and Emotional Evolution of Microblog Public Opinions on Public Emergencies. Library Inf. Serv. 2017, 61, 120–129. [Google Scholar] [CrossRef]
- Wang, Z.; Guo, Y.; Fu, J. CLIP-PubOp: A CLIP-based Multimodal Representation Fusion Method for Public Opinion. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 2243–2246. [Google Scholar] [CrossRef]
- Chen, J. Research on Sentiment Analysis of Netizens Based on Fusion of Multi-modal Hierarchical Features. Master’s Thesis, Nanjing University of Aeronautics And Astronautics, Nanjing, China, 2022. [Google Scholar]
- Yang, X.; Feng, S.; Zhang, Y.; Wang, D. Multimodal Sentiment Detection Based on Multi-channel Graph Neural Networks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 328–339. [Google Scholar] [CrossRef]
- Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.P. Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages. IEEE Intell. Syst. 2016, 31, 82–88. [Google Scholar] [CrossRef]
- Bagher Zadeh, A.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.P. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 2236–2246. [Google Scholar] [CrossRef]
- Zhou, J.; Zhao, J.; Huang, J.X.; Hu, Q.V.; He, L. MASAD: A Large-Scale Dataset for Multimodal Aspect-Based Sentiment Analysis. Neurocomputing 2021, 455, 47–58. [Google Scholar] [CrossRef]
- Xiang, Y.; Cai, Y.; Guo, J. MSFNet: Modality Smoothing Fusion Network for Multimodal Aspect-Based Sentiment Analysis. Front. Phys. 2023, 11, 1187503. [Google Scholar] [CrossRef]
- Yang, D.; Li, X.; Li, Z.; Zhou, C.; Wang, X.; Chen, F. Prompt Fusion Interaction Transformer For Aspect-Based Multimodal Sentiment Analysis. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Hu, R.; Yi, J.; Chen, A.; Chen, L. Multichannel Cross-Modal Fusion Network for Multimodal Sentiment Analysis Considering Language Information Enhancement. IEEE Trans. Ind. Inform. 2024, 20, 9814–9824. [Google Scholar] [CrossRef]
- Xie, Z.; Yang, Y.; Wang, J.; Liu, X.; Li, X. Trustworthy Multimodal Fusion for Sentiment Analysis in Ordinal Sentiment Space. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7657–7670. [Google Scholar] [CrossRef]
- Wang, X.; Lyu, J.; Kim, B.G.; Parameshachari, B.D.; Li, K.; Li, Q. Exploring Multimodal Multiscale Features for Sentiment Analysis Using Fuzzy-Deep Neural Network Learning. IEEE Trans. Fuzzy Syst. 2025, 33, 28–42. [Google Scholar] [CrossRef]
- Du, P. Research and Application of MultiModal Sentiment Analysis Methods in Chinese. Master’s Thesis, University of Electronic Science And Technology of China, Chengdu, China, 2023. [Google Scholar]
- Ni, N. Research on Cross-Media Topic Detection and Opinion Analysis. Master’s Thesis, Beijing University of Posts and Telecommunications, Beijing, China, 2019. [Google Scholar]
- Xu, N.; Mao, W. Multi-Interactive Memory Network for Aspect Based Multimodal Sentiment Analysis. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2019; Volume 33, pp. 371–378. [Google Scholar] [CrossRef]
- Xue, X.; Zhang, C.; Niu, Z.; Wu, X. Multi-Level Attention Map Network for Multimodal Sentiment Analysis. IEEE Trans. Knowl. Data Eng. 2023, 35, 5105–5118. [Google Scholar] [CrossRef]
- Bi, K.; Xie, L.; Zhang, H.; Chen, X.; Gu, X.; Tian, Q. Accurate Medium-Range Global Weather Forecasting with 3D Neural Networks. Nature 2023, 619, 533–538. [Google Scholar] [CrossRef]
- Xu, Y.; Zhu, L.; Huang, B.; Ma, L.; Zhu, L. Public Opinion Analysis Based on EEMD-Transformer Model: Taking COVID-19 Public Opinion as an Example. J. Wuhan Univ. (Nat. Sci. Ed.) 2020, 66, 418–424. [Google Scholar] [CrossRef]
- Mao, H.; Yuan, Z.; Xu, H.; Yu, W.; Liu, Y.; Gao, K. M-SENA: An Integrated Platform for Multimodal Sentiment Analysis. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 204–213. [Google Scholar] [CrossRef]
- Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou, J.; Yang, K. CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 3718–3727. [Google Scholar] [CrossRef]
- Liu, Y.; Yuan, Z.; Mao, H.; Liang, Z.; Yang, W.; Qiu, Y.; Cheng, T.; Li, X.; Xu, H.; Gao, K. Make Acoustic and Visual Cues Matter: CH-SIMS v2.0 Dataset and AV-Mixup Consistent Module. In Proceedings of the 2022 International Conference on Multimodal Interaction, ICMI’22, Bengaluru, India, 7–11 November 2022; pp. 247–258. [Google Scholar] [CrossRef]
- Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 1103–1114. [Google Scholar] [CrossRef]
- Zadeh, A.; Liang, P.P.; Mazumder, N.; Poria, S.; Cambria, E.; Morency, L.P. Memory Fusion Network for Multi-View Sequential Learning. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2018; Volume 32, pp. 5634–5641. [Google Scholar] [CrossRef]
- Hazarika, D.; Zimmermann, R.; Poria, S. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. In MM ’20: Proceedings of the 28th ACM International Conference on Multimedia; ACM: New York, NY, USA, 2020; pp. 1122–1131. [Google Scholar] [CrossRef]
- Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.; Potts, C. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; pp. 1631–1642. [Google Scholar]
- Williams, J.; Kleinegesse, S.; Comanescu, R.; Radu, O. Recognizing Emotions in Video Using Multimodal DNN Feature Fusion. In Proceedings of the Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), Melbourne, Australia, 20 July 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 11–19. [Google Scholar] [CrossRef]
- Yu, W.; Xu, H.; Yuan, Z.; Wu, J. Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2021; Volume 35, pp. 10790–10797. [Google Scholar] [CrossRef]
- Han, W.; Chen, H.; Poria, S. Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 9180–9192. [Google Scholar] [CrossRef]
- Tsai, Y.H.H.; Liang, P.P.; Zadeh, A.; Morency, L.P.; Salakhutdinov, R. Learning Factorized Multimodal Representations. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Coppini, S.; Lucifora, C.; Vicario, C.M.; Gangemi, A. Experiments on Real-Life Emotions Challenge Ekman’s Model. Sci. Rep. 2023, 13, 9511. [Google Scholar] [CrossRef]
- Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6325–6334. [Google Scholar] [CrossRef]
- Zhang, Q.; Fu, J.; Liu, X.; Huang, X. Adaptive Co-Attention Network for Named Entity Recognition in Tweets. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2018; Volume 32, pp. 5674–5681. [Google Scholar] [CrossRef]
- Hasan, M.K.; Rahman, W.; Bagher Zadeh, A.; Zhong, J.; Tanveer, M.I.; Morency, L.P.; Hoque, M.E. UR-FUNNY: A Multimodal Language Dataset for Understanding Humor. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 2046–2056. [Google Scholar] [CrossRef]
- Lin, J.; Men, R.; Yang, A.; Zhou, C.; Ding, M.; Zhang, Y.; Wang, P.; Wang, A.; Jiang, L.; Jia, X.; et al. M6: A Chinese Multimodal Pretrainer. arXiv 2021, arXiv:2103.00823. [Google Scholar] [CrossRef]
- Gu, J.; Meng, X.; Lu, G.; Hou, L.; Minzhe, N.; Liang, X.; Yao, L.; Huang, R.; Zhang, W.; Jiang, X.; et al. Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 26418–26431. [Google Scholar]
- Cheng, N.; Guan, C.; Gao, J.; Wang, W.; Li, Y.; Meng, F.; Zhou, J.; Fang, B.; Xu, J.; Han, W. Touch100k: A Large-Scale Touch-Language-Vision Dataset for Touch-Centric Multimodal Representation. arXiv 2024, arXiv:2406.03813. [Google Scholar] [CrossRef]
- Luo, M.; Fei, H.; Li, B.; Wu, S.; Liu, Q.; Poria, S.; Cambria, E.; Lee, M.L.; Hsu, W. PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis. In Proceedings of the 32nd ACM International Conference on Multimedia, MM’2024, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 7667–7676. [Google Scholar] [CrossRef]
- Jiang, W.; Liu, X.; Zheng, W.; Lu, B. SEED-VII: A Multimodal Dataset of Six Basic Emotions with Continuous Labels for Emotion Recognition. IEEE Trans. Affect. Comput. 2025, 16, 969–985. [Google Scholar] [CrossRef]
- Wan, J.; Li, X.; Zhao, J.; Li, M.; Deng, Z.; Chen, H. Joint Uncertainty Model and Metric for Robust Feature Selection: A Bi-Level Distribution Consideration and Feature Evaluation Approach. Fuzzy Sets Syst. 2026, 523, 109615. [Google Scholar] [CrossRef]











| Fusion Method | Advantages | Disadvantages |
|---|---|---|
| Early Fusion | Preserves fine-grained cross-modal interactions Enables end-to-end joint optimization Learns low-level feature correlations | High computational complexity Strict temporal/spatial alignment required Vulnerable to noise and missing modalities |
| Late Fusion | Modular design, independent training High computational efficiency Robust to misalignment | Cannot capture modal interactions Loses cross-modal complementarity Struggles to optimize ensemble weights |
| Hybrid Fusion | Balances expression and efficiency Captures mid-level interactions Tolerant to partially missing modality | Requires careful fusion layer design Increases model complexity Fusion timing relies on heuristics |
| Tensor Fusion | Models high-order interactions Preserves complete correlations Strong theoretical capacity | Suffers from dimensionality explosion Requires large datasets Poor interpretability |
| Model-level Fusion | Deep integration, parameter-efficient Enables cross-modal sharing Facilitates transfer learning | Complex architecture design High coupling reduces flexibility Difficult training convergence |
| Transformer-based Fusion | Attention learns adaptive weights Captures long-range dependencies Highly scalable and generalizable | Quadratic computational complexity Requires large-scale pretraining Limited interpretability |
| Hierarchical Fusion | Multi-scale interaction capture Combines complementary advantages Strong robustness and adaptability | Complex structure, hard to train High computation and memory cost Tedious hyperparameter tuning |
| No. | Dataset Name | Application | Modality Types | Data Volume | Language | Year | Institution |
|---|---|---|---|---|---|---|---|
| 1 | CMU-MOSI [63] | Sentiment Computing, Public Opinion Analysis | Text, Visual, Audio | 2199 video clips | English | 2018 | Carnegie Mellon University |
| 2 | CMU-MOSEI [64] | Sentiment Computing, Public Opinion Analysis, Human-Computer Interaction | Text, Visual, Audio | 23,500 video clips | English | 2018 | Carnegie Mellon University |
| 3 | CH-SIMS [78] | Sentiment Computing, User Behavior Analysis, Public Opinion Analysis | Text, Visual, Audio | 2281 video clips | Chinese | 2020 | Tsinghua University |
| 4 | CH-SIMSv2 [79] | Sentiment Computing, User Behavior Analysis, Public Opinion Analysis | Text, Visual, Audio | 14,563 video clips | Chinese | 2022 | Tsinghua University |
| Model Name | Core Idea | Applicable Scenarios | Trainable Parameters |
|---|---|---|---|
| LMF [80] | Dynamic fusion of modalities to capture inter-modal interactions | Multimodal sentiment computing, intent recognition | ≈0.5 M |
| MFN [81] | Multi-perspective sequence learning to fully utilize cross-perspective interaction information | Multi-perspective video analysis, dialogue sentiment recognition | ≈2.2 M |
| MISA [82] | Decompose modalities into invariant and specific features to reduce modal differences | Cross-modal sentiment transfer, low-resource scenarios | ≈104 M |
| EF-LSTM [83] | Use early fusion and model phrase/sentence-level semantic composition | Text-speech sentiment computing, real-time interaction systems | ≈0.89 M |
| LF-DNN [84] | Joint prediction based on BLSTM-based late fusion | Multimodal emotion recognition, human–computer interaction | ≈0.6 M |
| Self-MM [85] | Self-supervised generation of single-modal labels and joint training | Label-scarce scenarios, cross-modal alignment | ≈103 M |
| MMIM [86] | Maximize mutual information between input and fusion layer | Noisy environments, information-missing scenarios | ≈103 M |
| MFM [87] | Decompose and represent cross-modal discriminative factors and modality-specific generative factors | Modality-missing scenarios, data completion | ≈1.41 M |
| Graph-MFN [64] | Use graph structure to dynamically control modality weights and explicitly model modal interactions | Complex multimodal dialogue, sentiment computing | ≈2.11 M |
| Model Name | Bert_en+MOSI | ||
|---|---|---|---|
| Mult_acc_5% | MAE% | Corr% | |
| LMF [80] | 39.65 | 96.81 | 65.02 |
| MFN [81] | 39.21 | 96.69 | 66.14 |
| MISA [82] | 46.99 | 80.91 | 76.60 |
| EF-LSTM [83] | 30.66 | 113.38 | |
| LF-DNN [84] | 38.39 | 96.13 | 65.53 |
| Self-MM [85] | 51.50 | 72.62 | 79.62 |
| MMIM [86] | 51.26 | 74.89 | 77.68 |
| MFM [87] | 39.94 | 93.29 | 65.53 |
| Graph-MFN [64] | 41.45 | 93.50 | 65.89 |
| Model Name | Bert_en+MOSEI | ||
|---|---|---|---|
| Mult_acc_5% | MAE% | Corr% | |
| LMF [80] | 53.55 | 56.51 | 73.25 |
| MFN [81] | 52.46 | 57.32 | 71.56 |
| MISA [82] | 53.92 | 54.79 | 76.04 |
| EF-LSTM [83] | 51.34 | 59.49 | 68.94 |
| LF-DNN [84] | 53.65 | 55.88 | 73.32 |
| Self-MM [85] | 55.41 | 53.57 | 75.95 |
| MMIM [86] | 51.07 | 58.49 | 71.38 |
| Graph-MFN [64] | 53.18 | 56.74 | 72.60 |
| Model Name | Bert_cn+CH-SIMS | ||
|---|---|---|---|
| Mult_acc_5% | MAE% | Corr% | |
| LMF [80] | 36.69 | 44.57 | 56.98 |
| MFN [81] | 38.73 | 44.62 | 56.12 |
| MISA [82] | 37.49 | 44.16 | 57.14 |
| EF-LSTM [83] | 36.40 | 44.94 | 59.20 |
| LF-DNN [84] | 64.62 | 45.25 | 54.58 |
| Self-MM [85] | 42.16 | 41.47 | 59.28 |
| Model Name | Bert_cn+CH-SIMSv2 | ||
|---|---|---|---|
| Mult_acc_5% | MAE% | Corr% | |
| LMF [80] | 48.87 | 35.66 | 58.32 |
| MFN [81] | 54.52 | 29.79 | 71.99 |
| MISA [82] | 41.52 | 38.70 | 55.33 |
| EF-LSTM [83] | 51.22 | 31.57 | 69.42 |
| LF-DNN [84] | 53.35 | 30.29 | 71.19 |
| Self-MM [85] | 52.35 | 31.63 | 70.76 |
| Graph-MFN [64] | 43.84 | 40.38 | 52.54 |
| Model Name | Trainable Parameters |
|---|---|
| LMF [80] | ≈0.5 M |
| MFN [81] | ≈2.2 M |
| MISA [82] | ≈104 M |
| EF-LSTM [83] | ≈0.89 M |
| LF-DNN [84] | ≈0.6 M |
| Self-MM [85] | ≈103 M |
| MMIM [86] | ≈103 M |
| MFM [87] | ≈1.41 M |
| Graph-MFN [64] | ≈2.11 M |
| No. | Dataset Name | Applications | Modalities | Year | Institution |
|---|---|---|---|---|---|
| 1 | VQA 2.0 [89] | Emotion Classification, Product Recommendation, Visual Question Answering | Text, Visual | 2017 | Virginia Tech, Georgia Institute of Technology |
| 2 | Twitter 2017 [90] | Sentiment Analysis, User Behavior Analysis, Cross-lingual Sentiment Analysis | Text, Visual | 2018 | Fudan University |
| 3 | CMU-MOSI [63] | Sentiment Analysis, Public Opinion Monitoring | Text, Visual, Audio | 2018 | Carnegie Mellon University |
| 4 | CMU-MOSEI [64] | Sentiment Analysis, Public Opinion Monitoring, Cross-modal Representation Learning | Text, Visual, Audio | 2018 | Carnegie Mellon University, University of Rochester |
| 5 | UR-FUNNY [91] | Humor Detection, Multi-modal Sentiment Analysis, Human-computer Interaction | Text, Visual, Audio | 2019 | Carnegie Mellon University |
| 6 | CH-SIMS [78] | Sentiment Analysis, User Behavior Analysis, Chinese Public Opinion Monitoring | Text, Visual, Audio | 2020 | Tsinghua University |
| 7 | MUGE [92] | Emotion Classification, Image Captioning, Text-to-image Retrieval, Text-based Image Generation | Text, Visual | 2022 | Alibaba DAMO Academy, Tsinghua University, Alibaba Cloud TI Platform |
| 8 | Wukong [93] | Image-text Retrieval, Zero-shot Image Classification, Chinese Public Opinion Monitoring | Text, Visual | 2022 | Huawei Noah’s Ark Lab |
| 9 | CH-SIMSV2 [79] | Sentiment Analysis, User Behavior Analysis, Chinese Public Opinion Monitoring | Text, Visual, Audio | 2022 | Tsinghua University |
| 10 | Touch100k [94] | Haptic Perception, Imitation Learning, Sentiment Analysis | Haptic, Visual | 2024 | Beijing Jiaotong University, Beijing University of Posts and Telecommunications, Tencent WeChat AI Team |
| 11 | PanoSent [95] | Sentiment Analysis, User Behavior Analysis, Public Opinion Monitoring | Text, Visual, Audio | 2024 | National University of Singapore |
| 12 | SEED-VII [96] | Cross-modal Analysis, Sentiment Analysis, Brain-computer Interface, Psychology Research | EEG, Eye-tracking | 2024 | Shanghai Jiao Tong University |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Liu, S.; Li, T. A Review of Multimodal Sentiment Analysis in Online Public Opinion Monitoring. Informatics 2026, 13, 10. https://doi.org/10.3390/informatics13010010
Liu S, Li T. A Review of Multimodal Sentiment Analysis in Online Public Opinion Monitoring. Informatics. 2026; 13(1):10. https://doi.org/10.3390/informatics13010010
Chicago/Turabian StyleLiu, Shuxian, and Tianyi Li. 2026. "A Review of Multimodal Sentiment Analysis in Online Public Opinion Monitoring" Informatics 13, no. 1: 10. https://doi.org/10.3390/informatics13010010
APA StyleLiu, S., & Li, T. (2026). A Review of Multimodal Sentiment Analysis in Online Public Opinion Monitoring. Informatics, 13(1), 10. https://doi.org/10.3390/informatics13010010

