A Review of Multimodal Sentiment Analysis in Online Public Opinion Monitoring

Liu, Shuxian; Li, Tianyi

doi:10.3390/informatics13010010

Open AccessReview

A Review of Multimodal Sentiment Analysis in Online Public Opinion Monitoring

by

Shuxian Liu

and

Tianyi Li

^*

College of Computer Science and Technology, Xinjiang University, Urumqi 830017, China

^*

Author to whom correspondence should be addressed.

Informatics 2026, 13(1), 10; https://doi.org/10.3390/informatics13010010

Submission received: 28 October 2025 / Revised: 11 January 2026 / Accepted: 12 January 2026 / Published: 14 January 2026

Download

Browse Figures

Review Reports Versions Notes

Abstract

With the rapid development of the Internet, online public opinion monitoring has emerged as a crucial task in the information era. Multimodal sentiment analysis, through the integration of multiple modalities such as text, images, and audio, combined with technologies including natural language processing and computer vision, offers novel technical means for online public opinion monitoring. Nevertheless, current research still faces many challenges, such as the scarcity of high-quality datasets, limited model generalization ability, and difficulties with cross-modal feature fusion. This paper reviews the current research progress of multimodal sentiment analysis in online public opinion monitoring, including its development history, key technologies, and application scenarios. Existing problems are analyzed and future research directions are discussed. In particular, we emphasize a fusion-architecture-centric comparison under online public opinion monitoring, and discuss cross-lingual differences that affect multimodal alignment and evaluation.

Keywords:

multimodal learning; public opinion analysis; sentiment analysis

1. Introduction

In the era of ubiquitous internet connectivity, online platforms have become the primary channels for public opinion expression and information dissemination. The rise of social media platforms such as X (Twitter), Instagram, and TikTok has amplified the scale, speed, and influence of public discourse. Online public opinion monitoring has consequently become essential for government agencies, corporations, and researchers, enabling real-time detection of emerging trends, crises, and sentiment shifts [1].

Multimodal sentiment analysis, a subfield of multimodal learning, integrates text, images, audio, and sometimes video, combining natural language processing (NLP), computer vision (CV), and speech processing to achieve more accurate and nuanced sentiment detection [2]. Its capacity to capture cross-modal cues allows better sentiment inference than unimodal approaches, thereby improving the timeliness and precision of public opinion monitoring.

2. Review Methodology

2.1. Scope

With multimodal sentiment analysis assuming escalating importance in online public opinion monitoring, systematic surveys are essential for consolidating emergent methodologies and forecasting research trajectories. While recent years have witnessed several influential surveys, each exhibits distinct limitations that leave critical gaps unfilled.

Xu et al. [3] recently surveyed Transformer-based multimodal learning through a geometric–topological perspective, analyzing modality-agnostic self-attention mechanisms across diverse applications. However, this breadth spanning vision, language, and beyond yields limited domain-specific insights for sentiment analysis within public opinion contexts. He et al. [4] systematically categorized multimodal fusion architectures into joint, collaborative, and encoder–decoder frameworks, meticulously detailing early, late, and hybrid fusion paradigms. Yet this foundational work predates the large-scale pretrained model era and advanced attention mechanisms that now dominate the field.

Zhao et al. [5] introduced the pioneering survey on Multimodal Aspect-Based Sentiment Analysis, focusing specifically on text–image fusion methods for aspect-level classification. While seminal, their analytical scope remains confined to aspect-based scenarios, neglecting the broader sentiment analysis challenges inherent to public opinion monitoring. Zhang et al. [6] examined large language model applications in cybersecurity, highlighting critical challenges in safety, interpretability, and resource dependency that parallel public opinion system concerns. Nevertheless, their cybersecurity-centric framework does not address multimodal sentiment analysis methodologies themselves.

These surveys collectively reveal a conspicuous void: none provide a comprehensive examination of multimodal sentiment analysis techniques specifically tailored for online public opinion monitoring, particularly within the contemporary landscape shaped by large language models and evolving social media platforms. Notably, despite conducting extensive performance evaluations across state-of-the-art models, existing surveys fail to systematically contextualize these advances within public opinion monitoring frameworks.

The present review establishes its scope across four interconnected dimensions:

the evolutionary trajectory of multimodal fusion techniques, from traditional feature-level approaches to contemporary Transformer-based architectures;
task-specific methodological deployments across social media monitoring, product/service feedback analysis, and public safety/crisis management;
comparative performance assessment across English and Chinese benchmark datasets, including CMU-MOSI, CMU-MOSEI, CH-SIMS, and CH-SIMSv2;
emerging challenges and future research directions arising from LLM integration and domain-specific requirements.

2.2. Strategy

The purpose of this systematic review is to chart the evolution and application of multimodal sentiment analysis methodologies within the specific context of online public opinion monitoring. We followed a structured and reproducible search-and-screening procedure, while using a targeted query design to focus on the most relevant and impactful studies in this fast-evolving domain. Our initial search focused on publications that concurrently addressed “multimodal sentiment analysis” and “online public opinion” between 2018 and 2025, scanning six major academic databases—Google Scholar, Web of Science, IEEE, Elsevier, ACM, and CNKI. This precise query yielded 116 promising studies that directly bridged computational methods with public opinion challenges.

Recognizing the interdisciplinary nature of this field, we subsequently broadened our exploration to include individual searches for “multimodal sentiment analysis”, “online public opinion monitoring”, and related fusion methodologies, enabling us to capture foundational techniques and parallel innovations that inform current practice. From this curated collection of literature, we prioritized works demonstrating clear methodological contributions, empirical validation, or practical deployment in real-world monitoring scenarios. Publications lacking substantive AI/ML frameworks, direct relevance to sentiment analysis applications, or sufficient experimental detail were excluded from further consideration. The resulting synthesis draws upon 97 carefully selected references spanning dataset development, feature extraction advances, fusion architecture innovations, and domain-specific implementations across social media, product feedback, and crisis management applications.

2.3. Contributions

To make the lens explicit, we organize and evaluate prior work primarily from a public opinion monitoring perspective, with a focus on how fusion architectures behave across application scenarios and English/Chinese settings. This work aims to

Synthesize existing methodologies to provide researchers with a detailed understanding of available methods and resources;
Systematically analyze the evolution of fusion strategies from conventional paradigms to modern transformer-based approaches;
Evaluate practical applications across key public opinion monitoring scenarios;
Identify pressing challenges and prospective research avenues in the current technological landscape.

3. Multimodal Sentiment Analysis

Sentiment analysis is currently a research hotspot in the interdisciplinary field of computer science, covering domains such as computer science, psychology, and social sciences. Similar to affective computing and emotion recognition research methods, it utilizes natural language processing, machine learning, and other techniques to mine and analyze opinions and topics contained in different modalities of data, and to identify sentiment polarity and orientation [5].

The technical development history of sentiment analysis, as shown in Figure 1, can be roughly divided into three stages:

Initial Exploration (1960–1990): During this foundational period, research predominantly focused on computer-assisted sentiment analysis techniques that relied heavily on manually constructed sentiment lexicons. Scholars developed rule-based approaches using predefined dictionaries of sentiment-laden words and phrases, with computational methods serving primarily as auxiliary tools for linguistic analysis rather than autonomous prediction systems.

Technological Development (1990–2010): This era witnessed the formal conceptualization of sentiment analysis as a distinct research field, accompanied by the emergence of systematic sentiment polarity analysis methodologies. A pivotal advancement was the introduction of the Latent Dirichlet Allocation (LDA) topic model around 2003, which enabled probabilistic modeling of semantic structures in textual data. Statistical machine learning approaches began supplementing purely lexicon-based methods, establishing the technical groundwork for subsequent data-driven paradigms.

Application Maturity (2010–present): The integration of sophisticated machine learning and deep learning architectures has propelled sentiment analysis into a phase of widespread practical deployment. Representative breakthroughs include the Word2Vec model (circa 2013) for dense vector representations, the Global Vectors for Word Representation (GloVe) framework (circa 2014), and the transformative Bidirectional Encoder Representations from Transformers (BERT) model (2018). These innovations have facilitated large-scale implementation across diverse domains, particularly in comprehensive public opinion monitoring, granular product and service feedback analysis, and critical public safety management systems.

Compared with traditional sentiment analysis methods, multimodal sentiment analysis can leverage data from multiple modalities such as text, audio, and images, enabling more comprehensive extraction and judgment of implicit emotional information [7].

3.1. Unimodal Sentiment Analysis

Over its development, unimodal sentiment analysis has achieved significant results in multiple aspects, including multi-dimensional data processing, big data computation, and complementary information across different data types.

3.1.1. Text Modality

As shown in Figure 2, the general process of text sentiment analysis consists of

Data Collection: This initial phase involves gathering textual data from diverse sources, including manual acquisition through curated datasets and automated web crawling conducted under strict legal and ethical conditions, ensuring compliance with platform policies and data protection regulations.

Data Preprocessing: The raw data undergoes systematic cleaning and normalization, which includes removing irrelevant characters such as stop words and punctuation marks, performing dictionary matching to identify sentiment-bearing lexicons, and conducting lexical recognition to parse words and phrases for further analysis.
Feature Extraction: At this stage, the processed text is transformed into machine-readable numerical representations using techniques such as Bag-of-Words for frequency-based encoding or word embeddings for semantic vectorization. Simultaneously, sentiment scores and polarity values are computed to quantify the emotional orientation embedded in the textual content.
Model Training: The extracted features are fed into appropriate learning algorithms for training predictive models. This includes traditional machine learning methods such as Support Vector Machines (SVMs) for classification, as well as advanced deep learning architectures like BERT for contextual language understanding and ResNet for handling complex feature mappings.
Result Visualization: Finally, the analysis outcomes are presented through intuitive visual formats—such as charts, graphs, or interactive dashboards—to effectively convey sentiment patterns, trends, and insights derived from the model predictions.

Early text sentiment analysis methods primarily relied on lexicon-based and rule-based approaches for sentiment recognition and classification [8]. On this basis, Hu et al. [9] used adjectives as prior knowledge to determine sentence sentiment polarity, but this approach was limited in handling non-adjective sentiment expressions and complex contexts. To capture implicit semantic and emotional associations between words, Maas et al. [10] proposed learning word vectors containing both semantic and sentiment information via joint optimization of semantic and sentiment objectives. However, their method lacked adaptability to dynamic semantic changes, limiting performance in complex sentiment analysis tasks.

For simpler text sentiment analysis tasks, the Bag-of-Words (BoW) model is often used, representing source text as a vector of word occurrence counts:

BoW (D) = (w_{1}, w_{2}, \dots, w_{n}),

(1)

Another common method is Term Frequency–Inverse Document Frequency (TF-IDF), which evaluates the importance of a word in a document relative to a corpus:

TF - IDF (t, d) = TF (t, d) \times IDF (t),

(2)

where

TF (t, d)

is the term frequency, and

IDF (t)

is the inverse document frequency.

Mikolov et al. [11] proposed using Word2Vec to compute continuous word vectors from large-scale datasets, enabling quantitative semantic analysis of words. In Word2Vec, the Continuous Bag-of-Words (CBOW) model minimizes the cross-entropy loss of predicting a target word, while the Skip-gram model predicts context words from a given target word. The CBOW objective function is

J = \frac{1}{T} \sum_{t = 1}^{T} log P (w_{t} | w_{t - k}, \dots, w_{t + k}),

(3)

where T is the total number of words, k is the context window size,

w_{t}

is the target word, and

w_{t - k}

,

. . .

,

w_{t + k}

are its context words.

He et al. [12] proposed a deep learning model enhanced with emotion semantics for microblog sentiment analysis, mapping emojis into an emotional space and combining them with deep models—which are effective when emojis are present but limited when they are absent. Jin et al. [13] treated sentiment data as an auxiliary task within a multi-task learning framework for offensive language detection, reducing reliance on explicit sentiment cues. Li et al. [14] employed Convolutional Neural Networks (CNNs) and Bidirectional Long Short-Term Memory (BiLSTM) networks for feature extraction and fusion, improving classification accuracy but limiting performance on long-text tasks due to CNN constraints. Han et al. [15] applied multi-dimensional attention mechanisms to capture inter-word dependencies and high-level semantic–emotional information, achieving effective feature extraction but facing high computational complexity, making it unsuitable for large-scale datasets.

3.1.2. Visual Modality

In the visual modality, sentiment analysis was initially applied to image emotion prediction. Tamura et al. [16] computed texture features of images for sentiment analysis, but their method failed to fully exploit multi-scale features in similarity measurement and lacked precision in describing texture elements. Compared to simple surface-feature computation, Colombo et al. [17] applied the Hough transform to compute contour features of image regions, generating slope histograms and establishing a correspondence with emotional features, thus capturing the influence of line orientation on emotion. Machajdik et al. [18] extracted low-level visual features to predict image sentiment, but both methods suffered from dataset sensitivity and limited generalization.

With the advancement of machine learning, many researchers have adopted machine learning-based methods for visual sentiment analysis. Borth et al. [19] proposed a large-scale visual sentiment ontology based on Plutchik’s emotion wheel, building a detector library by detecting adjective–noun pairs related to emotions in images. However, because their sentiment classification was simplified to a three-polarity system, the method struggled with complex emotional information. Yang et al. [20] introduced binary label encoding and label noise augmentation to address the ambiguity in image sentiment, enabling multi-label sentiment recognition.

To address model dependency on high-quality datasets, Zhu et al. [21] leveraged adversarial and cycle-consistency losses to map between unpaired image domains, but their approach was limited in handling geometric transformations. Chen et al. [22] trained models using noisy emoji labels easily obtained from microblogs, alleviating dataset labeling scarcity; however, recognition performance for complex emotions declined compared to specialized visual sentiment models. He et al. [23] improved CNN classification performance on small datasets by reducing marginal and joint distribution discrepancies, though large domain shifts between source and target datasets could introduce noise. Zhao et al. [24] used a convolutional spatial Transformer and a temporal Transformer to learn spatial and temporal features, addressing challenges like occlusion, non-frontal poses, and head movements, but struggled with emotion categories having sparse samples.

The development of visual sentiment analysis parallels that of text sentiment analysis: both aim to achieve efficient and accurate data recognition while increasingly focusing on extracting deeper-level emotional information.

3.1.3. Speech Modality

For basic speech modality sentiment analysis, Lin et al. [25] captured the temporal dynamics of speech using a Mel-frequency sub-band energy difference feature extraction method, achieving high classification accuracy and robustness in gender-independent scenarios.

Benefiting from advancements in machine learning and neural networks, research on acoustic and prosodic features has become a hotspot in speech sentiment analysis. Wu et al. [26] improved sentiment recognition accuracy through acoustic modeling with multiple classifiers combined via a meta-decision tree; however, the method was constrained by pre-defined emotional rules and knowledge bases, making it less effective for ambiguous or personalized emotional expressions. Sunberg et al. [27] applied multiple discriminant analysis and canonical correlation analysis to acoustic parameters, revealing correlations between vocal physiological signals and emotions, but their method was limited by small dataset size and weak acoustic features. Jin et al. [28] extracted low-level acoustic features and emotional vectors for feature representation, but the method relied on acted emotional data, making it less representative of genuine emotional expression and inconsistent in recognizing different emotion categories. Mencattini et al. [29] developed a dynamic cooperative speaker model for continuous emotion estimation in natural speech, but due to subjective labeling and scarce related data, the method—like that of Sunberg et al. [27]—had limited generalizability and stability.

To address dataset scarcity, Eskimez et al. [30] employed variational autoencoders, adversarial autoencoders, and adversarial variational Bayes to learn features from unlabeled speech data, improving sentiment recognition. However, due to emotional diversity, these methods showed inconsistent performance across different emotion categories. Pourebrahim et al. [31] reduced label distribution discrepancies between samples by using parallel shared encoders with a maximum mean discrepancy loss, but their model had high complexity due to the combined use of autoencoders and classification tasks.

3.2. Multimodal Feature Fusion

Multimodal sentiment analysis refers to techniques that use multiple modalities (such as text, images, and audio) to perform sentiment analysis. Compared with unimodal sentiment analysis, it leverages richer data sources to obtain more comprehensive emotional information, thereby improving the accuracy and reliability of sentiment recognition [32].

3.2.1. Fusion Strategy

In multimodal learning, the timing of fusion strategy implementation has a significant impact on the effectiveness of multimodal integration. With the continuous development of machine learning, deep learning models have been introduced into the fusion process to narrow the gap between modalities and enhance feature representation. Since each fusion method has its own advantages and disadvantages, it is often necessary to experiment within training tasks to achieve optimal results [33]. The main fusion strategies are as follows:

Early Fusion (Feature-Level Fusion)

As shown in Figure 3, early fusion refers to combining the features from each modality before decision-making. This approach merges features at the feature level, which can help reduce subsequent processing costs, but requires handling a large volume of heterogeneous feature formats [34].

ii.: Late Fusion (Decision-Level Fusion)

As shown in Figure 4, late fusion integrates the outputs of different modalities after independent decision-making. This allows each modality to use its own optimal classifier, but may incur additional training costs [4].

iii.: Hybrid Fusion (Mid-Level Fusion)

As shown in Figure 5, hybrid fusion is performed after feature extraction but before the final decision, allowing the model to capture complementary information between modalities while also leveraging individual modality-specific features [35]. For example, Zhang Xinyou et al. [36] addressed the problem of uncertain information propagation direction in fake news detection by fusing multi-view features from content and news context to generate more comprehensive representations.

iv.: Tensor Fusion

As shown in Figure 6, tensor fusion represents data from different modalities as tensors and fuses them through specific mathematical operations. This can capture intrinsic correlations between modalities but may face challenges in handling high-dimensional data [2].

Zhao Xinhe et al. [37] applied tensor fusion in gambling website detection, aligning textual and visual features to unified dimensions and employing focal loss to enhance classification performance on imbalanced datasets. The tensor fusion operation can be expressed as follows:

z_{fusion} = concat (z_{t}, z_{v}, z_{a}, z_{t} \otimes z_{v}, z_{v} \otimes z_{a}, z_{a} \otimes z_{t}, z_{t} \otimes z_{v} \otimes z_{a}),

(4)

where

z_{t}

,

z_{v}

, and

z_{a}

denote the feature vectors from text, visual, and audio modalities, respectively, and ⊗ represents the outer product operation.

v.: Model-Level Fusion

As shown in Figure 7, Model-level fusion integrates multimodal data at various stages of model learning, jointly optimizing feature extraction and fusion strategies during training. This requires addressing issues such as balancing the contribution of each modality and handling modality-specific characteristics. Lueangwitchajaroen et al. [35] proposed a multi-layer feature fusion approach based on EfficientNet-B7, integrating spatial and temporal information from RGB video frames at early, middle, and late stages to improve action recognition accuracy. Zheng et al. [38] designed a reinforcement learning strategy leveraging category priors to perform category-wise feature fusion and address data imbalance, thereby reducing reliance on large-scale training data.

vi.: Transformer-Based Fusion

Since Vaswani et al. [39] proposed the Transformer architecture, Transformer-based multimodal fusion has become a research hotspot [3]. For instance, Shvetsova et al. [40] developed a Transformer-based fusion mechanism for zero-shot video retrieval in modality-agnostic environments. Xu et al. [41] built a unified Transformer framework to combine object detection and captioning into pre-training, jointly learning visual representation and cross-modal semantic alignment—though this increases training complexity and imposes high requirements on input image quality.

Researchers have also introduced attention mechanisms into Transformer-based fusion to enhance the learning of key content across modalities. Girdhar et al. [42] combined modalities into spatio-temporal blocks and applied a self-attention-based Transformer for multimodal classification tasks. The attention mechanism can be expressed as follows:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,

(5)

where Q, K, and V denote the query, key, and value matrices, and

d_{k}

is the dimensionality of the key vectors.

Other examples include Tschannen et al. [43], who adapted the CLIP model [44] to process both images and text using ViT (Vision Transformer) and contrastive learning; Huan et al. [45], who used attention-based fusion to complete missing modalities; and Yi et al. [46], who proposed a two-stage stacked Transformer capturing intra-modality communication and inter-fusion representation interactions with an adaptive weight accumulation mechanism. A schematic diagram of the Transformer-Based Fusion Strategy is shown in Figure 8:

vii.: Hierarchical Fusion

As shown in Figure 9, hierarchical fusion integrates multimodal features at multiple abstraction levels, such as low-level perceptual features and high-level semantic features [47]. This approach can better preserve contextual and semantic information, making it effective for handling complex multimodal data.

A generic hierarchical fusion function can be expressed as

F = g (f_{1}, f_{2}, . . ., f_{m}),

(6)

where

f_{i}

are the features extracted from different sources and g is the fusion function.

Given the distinct characteristics of each fusion strategy discussed above, Table 1 provides a systematic comparison of their respective strengths and limitations.

3.2.2. Current Research Status in Multimodal Sentiment Analysis

Recent work includes multimodal aspect-based sentiment analysis (MABSA), multi-task learning frameworks, and the integration of large language models for cross-modal reasoning. MABSA aims to analyze sentiment evaluations of specific aspects within multimodal data. Challenges include handling the complexity of multimodal data, aligning different modalities’ temporal sequences, and improving model generalization and interpretability. For example, Zhang et al. [48] used a gating mechanism to reduce noise interference and enhance image semantic representation via adjective–noun pairs. Wang et al. [49] combined orthogonally constrained self-attention with a gated local cross-modal interaction mechanism to improve MABSA accuracy, but their method suffered from low training efficiency and high sensitivity to hyperparameters. Zhang et al. [50] integrated cross-attention and graph attention networks to improve performance, though the results depended on graph construction parameters. Li et al. [51] used a text-guided fusion approach to reduce redundancy and employed adaptive context enhancement to improve polarity recognition.

In multi-task multimodal sentiment recognition, research focuses on using multi-task learning frameworks to improve performance. Lin et al. [52] proposed a model with shared layers for visual and speech modalities that could jointly learn emotional information.

Multimodal emotion recognition based on deep learning also demonstrates great potential [53]. These models can learn and adapt to specific emotion analysis tasks with a small amount of samples, to a certain extent solving the problem of scarce data. Moreover, multimodal emotion recognition technology is increasingly important in the detection and intervention of emotional disorders such as depression, and relevant scholars are constantly exploring how to use multimodal data for more accurate assessment and intervention [54].

Meanwhile, with the development of large language models (LLMs), the amount of research on the processing of multimodal data has been increasing. By leveraging large language models, in-depth data mining and analysis of modal data such as text, images, and audio can be achieved, such as the multimodal data understanding and text generation tasks based on large language models like ChatGPT-4, Qwen-14B series [55], and DeepSeek-R1-Zero [56]. Pang et al. [57] utilized the auxiliary knowledge of multimodal large language models to improve the accuracy of sentiment analysis and reasoning ability. This method requires computational resources to generate auxiliary knowledge, which increases the complexity of the model and the training cost.

4. Online Public Opinion Monitoring

Public opinion refers to the collective emotional tendencies and viewpoints widely held by the public regarding a particular issue within a certain social space. With the rapid development of the Internet, the concept of online public opinion has emerged, characterized by its wide reach and high transmission speed. Therefore, online public opinion monitoring is of great significance for the decision-making of governments, enterprises, and various organizations.

4.1. Theoretical Foundation

As one of the main forms of public opinion, online public opinion retains the core characteristics of general public opinion, preserving its essential attributes while manifesting in digital environments. As systematically illustrated in Figure 10, the dissemination process of online public opinion unfolds through three sequential stages: information generation, where original content is initially created and introduced into the digital ecosystem; information diffusion, which involves the propagation and spread of that content across various platforms and networks; and formation of influence, where the cumulative effect of disseminated information shapes public perceptions and generates tangible impacts. This dynamic process fundamentally involves key elements such as information sources that serve as the originators of content, transmission channels that act as conduits for distribution and amplification, and audience responses that reflect the reception, interpretation, and reactive behaviors of end-users, together constituting the interactive framework of online public opinion dissemination.

The formation and evolution of online public opinion are influenced by multiple factors, including social events, media reports, and online user interactions. Thus, evaluating online public opinion requires a comprehensive assessment of its dissemination dynamics, scope of influence, and potential social impact, so as to predict its development trends and possible consequences [58].

4.2. Manual Monitoring Methods

Manual content analysis performed by professionals has always been an indispensable part of public opinion monitoring. From the early 20th century to the mid-20th century, information monitoring mainly relied on reading traditional media and collecting materials via clipping services. With the popularity of radio and television, monitoring expanded to include program listening and random telephone surveys.

In the 1950s, many government agencies and enterprises began using focus groups to conduct qualitative research on public opinion. In the 1980s, the rapid development of computer and mobile communication technologies further improved the informatization of databases and archive management, greatly enhancing the efficiency of data storage, retrieval, and analysis. Questionnaires also evolved from paper-based to electronic formats [59].

4.3. Machine Learning–Based Methods

As depicted in Figure 11, the machine learning-based approach to public opinion monitoring harnesses natural language processing (NLP) and data mining techniques to scrutinize and oversee public sentiment and opinions circulating on the Internet. This method involves several key steps: initially, it gathers data from various online sources; following this, it employs NLP to process and interpret textual content; next, it applies data mining to identify trends and patterns; subsequently, it conducts real-time monitoring to track the evolution of public sentiment; and finally, it predicts trends to inform practical applications and decision-making processes. This comprehensive strategy effectively translates raw data into valuable insights, facilitating a proactive stance on public opinion management.

Based on the CLIP framework, Wang et al. [60] introduced a linear feature fusion layer to significantly improve multimodal representation. However, this method required finding optimal fusion ratios during training and was less adaptable to variations in language and data quality across contexts.

Chen Jie [61] combined the DR-Transformer multimodal fusion mechanism with hierarchical multimodal features for sentiment polarity recognition, mapping relationships between graded features and high-level emotional information while narrowing semantic gaps.

Yang et al. [62] introduced a multi-channel graph neural network to learn global multimodal representations, combined with a multi-head attention mechanism for predicting sentiment from image–text pairs. While effective, their model was relatively complex and had limited training efficiency and generalization.

The general process includes model training, integration and deployment, real-time monitoring, data source processing, trend prediction, and application feedback.

4.4. Multimodal Sentiment Analysis in Public Opinion Monitoring

This emerging field combines multiple data modalities to detect and analyze public sentiment, with applications including the following.

4.4.1. Social Media Monitoring

Multimodal sentiment analysis can be applied to social platforms to analyze user-generated text, images, and videos for sentiment trends. For instance, analyzing tweets with images can yield more accurate sentiment assessments, as text and image content may convey different emotions.

Zadeh et al. developed two benchmark multimodal sentiment corpora:

CMU-MOSI [63]—2199 opinionated video segments with audio;

CMU-MOSEI [64]—over 23,500 sentences from YouTube speakers with audio.

Leveraging MABSA, Zhou et al. [65] used aspect extraction, polarity prediction, and adversarial training to enhance text–image–aspect learning, but their model’s complex interaction mechanisms were sensitive to hyperparameters. Xiang et al. [66] applied feature smoothing and multi-channel attention to bridge semantic gaps across modalities, improving MABSA performance but also requiring complex computations. Yang et al. [67] introduced an image-assisted module with multimodal prompt fusion to improve text–image feature integration, though the method required substantial training data.

To address unbalanced modality proportions in real-world data, Hu et al. [68] enhanced linguistic information while reducing non-linguistic redundancy, slightly lowering performance on non-linguistic features. Xie et al. [69] introduced uncertainty estimation and ordinal regression for dynamic modality quality weighting, improving prediction stability but at the cost of higher computational complexity. Wang et al. [70] incorporated fuzzy deep neural networks for multi-scale emotion uncertainty modeling but faced difficulties in real-time applications.

In Chinese social networks, Du Peipei [71] used topic matching and emoji-masking tasks with gating mechanisms to filter redundant information in Weibo sentiment analysis. Ni Ningning [72] addressed heterogeneous cross-media data using a graph-based cross-media fusion framework with a background-topic model, though performance depended on high-quality graph construction.

4.4.2. Product and Service Feedback

Enterprises can combine text, speech, and image feedback to better assess customer satisfaction. Xu et al. [73] designed a multi-interaction memory network exploiting cross-modal dependencies and built the Multi-ZOL dataset (5288 phone reviews from the ZOL forum). Xue et al. [74] applied co-attention fusion after filtering noise and capturing multi-granular feature correlations, though their model was complex and untested in dynamic scenarios.

In specialized industries, Huawei’s Pangu Model uses an encoder–decoder architecture integrating language and vision for predictive tasks in meteorology, medicine, and water resource management [75].

4.4.3. Public Safety and Crisis Management

Multimodal sentiment analysis can detect emergencies and shifts in collective emotions via video surveillance and social media. Xu Yang et al. [76] combined ensemble empirical mode decomposition (EEMD) with Transformer attention to analyze COVID-19 public opinion heat trends, though performance depended heavily on preprocessing. Liu et al. [8] improved PP-OCR text detection/recognition with a global–local attention mechanism for multimodal sentiment tasks, but the method was dataset- and language-specific.

4.5. Performance Comparison of Multimodal Sentiment Analysis Methods

Based on the MMSA [77] integrated framework, the following models were tested in both Chinese and English settings using the CMU-MOSI [63], CMU-MOSEI [64], CH-SIMS [78], and CH-SIMSv2 [79] datasets: LMF [80], MFN [81], MISA [82], EF-LSTM [83], LF-DNN [84], Self-MM [85], MMIM [86], MFM [87], Graph-MFN [64].

Table 2 outlines key information about four datasets used in multimodal sentiment computing, detailing their applications, modality types, data volume, language, year of creation, and the institutions responsible for their development.

CMU-MOSI: Developed by Carnegie Mellon University in 2018, this dataset contains 2199 video clips and focuses on sentiment computing and public opinion analysis. It includes text, visual, and audio modalities and is available in English.
CMU-MOSEI: Also from Carnegie Mellon University, this dataset was created in 2018 and comprises 23,500 video clips. It is used for sentiment computing, public opinion analysis, and human–computer interaction, incorporating text, visual, and audio data in English.
CH-SIMS: This dataset, developed by Tsinghua University in 2020, contains 2281 video clips. It is utilized for sentiment computing, user behavior analysis, and public opinion analysis, covering text, visual, and audio modalities in Chinese.
CH-SIMSv2: An extension of CH-SIMS, this dataset was also developed by Tsinghua University and released in 2022. It includes a larger volume of 14,563 video clips and is used for similar applications as CH-SIMS, focusing on text, visual, and audio data in Chinese.

Each dataset offers a rich resource for researchers and practitioners in the field of sentiment analysis, enabling the development and evaluation of models that can interpret and analyze multimodal data effectively.

Table 3 synthesizes specifications for nine multimodal architectures utilized in sentiment computing, capturing their fundamental principles, deployment contexts, and parameter magnitudes—metrics that determine both intricacy and data-driven adaptation potential.

The LMF [80] model can dynamically and selectively fuse information from language, visual, and audio modalities to capture the interaction relationships among different modalities, achieving multimodal emotion computation and intent recognition.
The MFN [81] independently models the interactions within each perspective and captures the cross-interactions between different perspectives, while storing and updating this interaction information through a multi-perspective gated memory module to achieve multi-modal and multi-perspective sequence learning.
MISA [82] decomposes each modality into modality-invariant and modality-specific features, fuses them, and predicts emotional states, reducing the modality gap while lowering model complexity.
EF-LSTM [83] uses recurrent neural networks and tensor operations to obtain semantic combination relationships at the phrase and sentence levels.
LF-DNN [84] is a multi-modal, multi-perspective sequence learning method based on early fusion of input-level multi-modal DNN features, using a BLSTM network to jointly process audio, video, and text features, achieving simultaneous prediction of six types of emotions and their intensities.
Self-MM [85] automatically generates single-modal labels to jointly train multi-modal and single-modal tasks, effectively capturing the consistency and differences between modalities, and achieving self-supervised multi-task learning without additional manual annotations.
MMIM [86] maximizes mutual information at the input and fusion levels to reduce the loss of task-related information, using both parametric and non-parametric methods to estimate the lower bound of maximizing mutual information, thereby improving the quality of multi-modal data fusion.
MFM [87] decomposes multi-modal representations into “cross-modal discriminative factors” and “modality-specific generative factors”, with the former used for task prediction and the latter for data reconstruction and missing modality completion, achieving joint optimization of generation and target discrimination.
Graph-MFN [64] uses a graph structure to dynamically control the weights of language, visual, and acoustic modalities in real time, explicitly modeling single-, dual-, and triple-modal interactions, achieving more efficient modality fusion.

The experiments were divided into two parts:

(1): Those using a BERT model trained in English as the text modality encoder;
(2): Those using a BERT model trained in Chinese as the text modality encoder.

Evaluation metrics included top five classification accuracy, Mean Absolute Error (MAE), and correlation coefficient. Each result was the average of three runs with random seeds (1111, 1112, 1113).

Top-5 Classification Accuracy: This metric evaluates the performance of a classification model by considering whether the correct label is within the top five predictions for each sample. The formula is given by

${Acc}_{5} = \frac{1}{N} \sum_{i = 1}^{N} I (Top- 5 {labels}_{i} \cap True {label}_{i} \neq ⌀),$

(7)

where N is the total number of samples and I is an indicator function returning 1 if at least one of the predicted top five sentiment labels matches the ground-truth label.
Mean Absolute Error (MAE): This metric measures the average magnitude of errors between predicted and actual values without considering their direction. It is calculated using the following formula:

$MAE = \frac{1}{N} \sum_{i = 1}^{N} | y_{i} - {\hat{y}}_{i} |,$

(8)

where $y_{i}$ is the true sentiment intensity for the i-th sample, and ${\hat{y}}_{i}$ is the predicted value.
Correlation coefficient:This metric quantifies the strength and direction of the linear relationship between predicted and actual sentiment intensities. It is defined by the following formula:

$Corr = \frac{\sum_{i = 1}^{N} (y_{i} - \bar{y}) ({\hat{y}}_{i} - \bar{\hat{y}})}{\sqrt{\sum_{i = 1}^{N} {(y_{i} - \bar{y})}^{2}} \sqrt{\sum_{i = 1}^{N} {({\hat{y}}_{i} - \bar{\hat{y}})}^{2}}},$

(9)

where $\bar{y}$ and $\bar{\hat{y}}$ are the mean true and predicted sentiment intensities, respectively.

As shown in Table 4 and Table 5, the performance comparison results of the above-mentioned partial models tested in an English environment are as follows:

Based on the MOSI dataset (Table 4), the Self-MM model performs best in multimodal sentiment analysis tasks. Self-MM utilizes a self-attention mechanism and multimodal fusion strategy to effectively capture the correlations and complementary information between different modalities. Its five-class accuracy is 51.5%, its mean absolute error is 72.62%, and its correlation coefficient is 79.62%. The next best model is MISA. MISA improves the model’s understanding and the fusion capability of multimodal data by learning modality-invariant and modality-specific representations. Its five-class accuracy is 46.99%, its mean absolute error is 80.91%, and its correlation coefficient is 76.6%. Although MISA has a higher mean absolute error compared to Self-MM, its five-class accuracy and correlation coefficient are lower than those of Self-MM.

Based on the MOSEI dataset (Table 5), the Self-MM model achieves the highest performance in multimodal sentiment analysis tasks. It employs a self-supervised learning approach and multimodal fusion to effectively capture the relationships and complementary information across different modalities. The model attains a five-class accuracy of 55.41%, a mean absolute error (MAE) of 53.57%, and a correlation coefficient (Corr%) of 75.95%. Following closely behind is the MISA model. MISA enhances the model’s comprehension and integration of multimodal data by learning invariant and specific features for each modality. It achieves a five-class accuracy of 53.92%, an MAE of 54.79%, and a Corr% of 76.04%. Despite MISA’s slightly higher MAE compared to Self-MM, its five-class accuracy and Corr% are marginally lower than those of Self-MM.

As shown in Table 6 and Table 7, the above-mentioned models were tested in the Chinese environment, and the performance comparison results are as follows:

Based on the CH-SIMS dataset (Table 6), the Self-MM model demonstrates remarkable performance in multimodal sentiment analysis tasks, achieving a five-class accuracy of 42.16%, a mean absolute error (MAE) of 41.47%, and a correlation coefficient of 59.28%. The second-best performing model is still MISA.

Based on the CH-SIMSv2 dataset (Table 7), both MFN and LF-DNN models exhibit excellent performance. MFN achieves a five-class accuracy of 54.52%, an MAE as low as 29.79%, and a correlation coefficient as high as 71.99%. LF-DNN attains a five-class accuracy of 53.35%, an MAE of 30.29%, and a correlation coefficient of 71.19%. MFN effectively integrates multimodal information through its modal fusion network and attention mechanisms, while LF-DNN leverages its deep architecture to achieve highly efficient feature extraction.

As shown in Table 8, models with fewer parameters (such as LMF, EF-LSTM, LF-DNN) perform well on specific datasets. For example, LF-DNN achieves high classification accuracy on both the CH-SIMS and CH-SIMSv2 datasets, indicating that its lightweight structure has good adaptability and efficiency in Chinese contexts. Models with a medium number of parameters (such as MISA, EF-LSTM, MMIM) perform well on certain datasets. For instance, MISA achieved the best performance on the MOSI dataset, demonstrating some advantages in handling complex multimodal data. Models with larger numbers of parameters (such as Self-MM, Graph-MFN) generally perform better on large-scale datasets. For example, Self-MM achieved optimal performance on the MOSEI dataset, showing that its complex structure is better equipped to handle large-scale, complex multimodal data.

As shown in Table 3, Table 4, Table 5 and Table 6, model performance exhibits a significant cross-lingual gap, with architectures that excel in Chinese contexts often underperforming in English environments. In Chinese datasets, MFN and LF-DNN demonstrate superior adaptability, where MFN achieves 54.52% accuracy on CH-SIMSv2 (Table 6) and LF-DNN reaches 64.62% on CH-SIMS (Table 5), indicating that attention-heavy fusion mechanisms effectively capture nuanced interplay between text and culturally specific visual cues. In contrast, Self-MM leads in English datasets, attaining 55.41% accuracy on MOSEI (Table 4) and 51.50% on MOSI (Table 3), revealing that self-supervised multimodal alignment excels when visual and textual modalities maintain direct semantic correspondence. This discrepancy stems from the semantic ambiguity of visual modalities in Chinese communication—particularly the polysemy of indigenous emojis and sticker derivatives whose meanings shift across subcultural contexts—versus English datasets where visual cues exhibit clearer alignment with textual sentiment. Future research must develop cross-lingual alignment frameworks that incorporate culture-aware visual disambiguation modules and meta-learning techniques to dynamically adapt fusion weights across linguistic landscapes.

5. Conclusions

This review does not introduce new models. Instead, it consolidates how multimodal sentiment analysis methods have been adapted for online public opinion monitoring, where data are noisy, modality availability is uneven, and evaluation settings vary across platforms and languages. By using fusion architectures as the main comparative axis and by contrasting representative English and Chinese benchmarks, this work summarizes both established findings and unresolved issues.

Multimodal sentiment analysis has gradually integrated technologies and theories from multiple disciplines. With the rise of large language models, large-scale datasets, and high-performance computing, new challenges have emerged:

(1): Collaborative Representation

Emotional information embedded in different modalities varies in nature, and in practical applications, the proportion of each modality can differ significantly. Effectively integrating multimodal data while eliminating inter-modal discrepancies is key to improving sentiment polarity recognition accuracy.

(2): Fine-Grained Sentiment Recognition

In psychology, there is no universally accepted definition of human emotions, and emotional expression varies widely across contexts [88]. Current multimodal sentiment analysis often focuses on broad categories such as joy, anger, sadness, and annoyance. There is an urgent need for more fine-grained sentiment analysis frameworks. While large language models such as GPT, LLaMA, Qwen, and DeepSeek have partially addressed this challenge in general domains, issues such as long-term dependency and “hallucinations” [6] mean that performance in domain-specific scenarios is still inadequate.

(3): Datasets

As shown in Table 9, most mainstream multimodal sentiment analysis datasets are non-Chinese in origin; Chinese datasets emerged later. In the era of short videos, the number of topics is exploding, but certain new forms of emotional expression in Chinese social media—such as novel gestures, sticker packs, and emoji derivatives—cannot yet be effectively recognized. Additionally, text often dominates in dataset composition.

This table catalogues essential specifications for twelve datasets employed in multimodal sentiment analysis investigations, specifying their use cases, modality configurations, publication years, and the research entities that compiled them.

VQA 2.0: Released by Virginia Tech and Georgia Institute of Technology in 2017, this dataset targets emotion classification, product recommendation, and visual question answering, integrating text and visual modalities.
Twitter 2017: Curated by Fudan University in 2018, this resource facilitates sentiment analysis, user behavior analysis, and cross-lingual sentiment analysis, comprising text and visual data.
CMU-MOSI: Produced by Carnegie Mellon University in 2018, this dataset serves sentiment analysis and public opinion monitoring, encompassing text, visual, and audio modalities.
CMU-MOSEI: Also compiled in 2018 by Carnegie Mellon University and the University of Rochester, this collection supports sentiment analysis, public opinion monitoring, and cross-modal representation learning, featuring text, visual, and audio inputs.
UR-FUNNY: Issued by Carnegie Mellon University in 2019, this dataset is designed for humor detection, multi-modal sentiment analysis, and human-computer interaction, combining text, visual, and acoustic information.
CH-SIMS: Developed by Tsinghua University in 2020, this resource addresses sentiment analysis, user behavior analysis, and Chinese public opinion monitoring, integrating textual, visual, and auditory channels.
MUGE: Published in 2022 by Alibaba DAMO Academy, Tsinghua University, and Alibaba Cloud TI Platform, this dataset enables emotion classification, image captioning, text-to-image retrieval, and image generation from textual descriptions, utilizing text and visual modalities.
Wukong: Released by Huawei Noah’s Ark Lab in 2022, this collection supports image-text retrieval, zero-shot image classification, and Chinese public opinion monitoring, incorporating text and visual data.
CH-SIMSV2: An expanded version from Tsinghua University published in 2022, this dataset continues to serve sentiment analysis, user behavior analysis, and Chinese public opinion monitoring, featuring text, visual, and audio modalities.
Touch100k: Introduced in 2024 by Beijing Jiaotong University, Beijing University of Posts and Telecommunications, and Tencent WeChat AI Team, this pioneering dataset focuses on haptic perception, imitation learning, and sentiment analysis, uniquely merging haptic and visual sensory data.
PanoSent: Developed by the National University of Singapore in 2024, this resource is applied to sentiment analysis, user behavior analysis, and public opinion monitoring, integrating text, visual, and audio modalities.
SEED-VII: Released in 2024 by Shanghai Jiao Tong University, this specialized dataset facilitates cross-modal analysis, sentiment analysis, brain-computer interface research, and psychological studies, employing EEG and eye-tracking modalities.

These datasets jointly constitute a diverse repository for the research community, facilitating the design and assessment of models spanning varied languages, cultural contexts, and sensory modalities.

(4): Uncertain data processing

Uncertainty characteristics embedded in real-world data manifest in heterogeneous forms, and in practical applications of multimodal sentiment analysis and public opinion monitoring, the interplay between randomness, fuzziness, and inconsistency can differ significantly across modalities. Effectively modeling uncertain data while eliminating the interference of noisy information is key to improving sentiment recognition robustness and accuracy [97]. Moreover, there is no universally accepted taxonomy for data uncertainty in opinion analysis, and its manifestation varies widely across contexts.

In the era of social media, the volume of multimodal data is exploding, but certain new forms of uncertainty in online public opinion—such as acquisition interference, transmission distortions, and storage inconsistencies—cannot yet be effectively recognized.

Author Contributions

All authors contributed to the study’s conception and design. S.L. and T.L.: Conceptualization, investigation, writing and modification. T.L.: Writing Conceptualization, review, and editing. S.L.: Supervision. S.L. and T.L.: Investigation. The first draft of the manuscript was written by T.L. and All authors commented on previous versions of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China: 61762085.

Data Availability Statement

The datasets used and analyzed in this study are all publicly available.

Acknowledgments

We sincerely thank all the authors cited in this paper for their valuable contributions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lu, Q.; Sun, X.; Long, Y.; Gao, Z.; Feng, J.; Sun, T. Sentiment Analysis: Comprehensive Reviews, Recent Advances, and Open Challenges. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 15092–15112. [Google Scholar] [CrossRef]
Gandhi, A.; Adhvaryu, K.; Poria, S.; Cambria, E.; Hussain, A. Multimodal Sentiment Analysis: A Systematic Review of History, Datasets, Multimodal Fusion Methods, Applications, Challenges and Future Directions. Inf. Fusion 2023, 91, 424–444. [Google Scholar] [CrossRef]
Xu, P.; Zhu, X.; Clifton, D.A. Multimodal Learning with Transformers: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12113–12132. [Google Scholar] [CrossRef]
He, J.; Zhang, C.; Li, X. Survey of Research on Multimodal Fusion Technology for Deep Learning. Comput. Eng. 2020, 46, 1–11. [Google Scholar] [CrossRef]
Zhao, H.; Yang, M.; Bai, X.; Liu, H. A Survey on Multimodal Aspect-Based Sentiment Analysis. IEEE Access 2024, 12, 12039–12052. [Google Scholar] [CrossRef]
Zhang, C.; Tong, X.; Tong, H.; Yang, y. A Survey of Large Language Models in the Domain of Cybersecurity. Netinfo Secur. 2024, 24, 778. [Google Scholar] [CrossRef]
Guo, X.; Wushour·Silamu, M.; Tuerhong, G. Survey of Sentiment Analysis Algorithms Based on Multimodal Fusion. Comput. Eng. Appl. 2024, 60, 1–18. [Google Scholar] [CrossRef]
Liu, X.; Wei, F.; Jiang, W.; Zheng, Q.; Qiao, Y.; Liu, J.; Niu, L.; Chen, Z.; Dong, H. MTR-SAM: Visual Multimodal Text Recognition and Sentiment Analysis in Public Opinion Analysis on the Internet. Appl. Sci. 2023, 13, 7307. [Google Scholar] [CrossRef]
Hu, M.; Liu, B. Mining and Summarizing Customer Reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’04, Seattle, WA, USA, 22–25 August 2004; pp. 168–177. [Google Scholar] [CrossRef]
Maas, A.L.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; Association for Computational Linguistics: Stroudsburg, PA, USA, 2011; pp. 142–150. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the International Conference on Learning Representations, ICLR 2013, Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
He, Y.; Sun, S.; Niu, F. A Deep Learning Model Enhanced with Emotion Semantics for Microblog Sentiment Analysis. Chin. J. Comput. 2017, 40, 773–790. [Google Scholar]
Jin, D.; Ren, H.; Tang, R. Research on Offensive Language Detection in Social Networks Based on Emotion-Assisted Multi-Task Learning. Netinfo Secur. 2025, 25, 281. [Google Scholar] [CrossRef]
LI, Y.; Dong, H. Text sentiment analysis based on feature fusion of convolution neural network and bidirectional long short-term memory network. J. Comput. Appl. 2018, 38, 3075. [Google Scholar] [CrossRef]
Han, P.; Sun, J.; Fang, C. Micro-blog sentiment analysis based on emotional fusion and multi-dimensional self-attention mechanism. J. Comput. Appl. 2019, 39, 75–78. [Google Scholar]
Tamura, H.; Mori, S.; Yamawaki, T. Textural Features Corresponding to Visual Perception. IEEE Trans. Syst. Man Cybern. 1978, 8, 460–473. [Google Scholar] [CrossRef]
Colombo, C.; Del Bimbo, A.; Pala, P. Semantics in Visual Information Retrieval. IEEE MultiMed. 1999, 6, 38–53. [Google Scholar] [CrossRef]
Machajdik, J.; Hanbury, A. Affective Image Classification Using Features Inspired by Psychology and Art Theory. In Proceedings of the 18th ACM International Conference on Multimedia, MM’10, Firenze, Italy, 25–29 October 2010; pp. 83–92. [Google Scholar] [CrossRef]
Borth, D.; Ji, R.; Chen, T.; Breuel, T.; Chang, S.F. Large-Scale Visual Sentiment Ontology and Detectors Using Adjective Noun Pairs. In Proceedings of the 21st ACM International Conference on Multimedia, MM’13, Barcelona, Spain, 21–25 October 2013; pp. 223–232. [Google Scholar] [CrossRef]
Yang, J.; Sun, M.; Sun, X. Learning Visual Sentiment Distributions via Augmented Conditional Probability Neural Network. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2017; Volume 31. [Google Scholar] [CrossRef]
Zhu, J.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2242–2251. [Google Scholar] [CrossRef]
Chen, F.; Ji, R.; Su, J.; Cao, D.; Gao, Y. Predicting Microblog Sentiments via Weakly Supervised Multimodal Deep Learning. IEEE Trans. Multimed. 2018, 20, 997–1007. [Google Scholar] [CrossRef]
He, Y.; Ding, G. Deep Transfer Learning for Image Emotion Analysis: Reducing Marginal and Joint Distribution Discrepancies Together. Neural Process. Lett. 2020, 51, 2077–2086. [Google Scholar] [CrossRef]
Zhao, Z.; Liu, Q. Former-DFER: Dynamic Facial Expression Recognition Transformer. In Proceedings of the 29th ACM International Conference on Multimedia, MM’21, Virtual Event, 20–24 October 2021; pp. 1553–1561. [Google Scholar] [CrossRef]
Lin, Y.; Wei, G. Speech Emotion Recognition Based on HMM and SVM. In Proceedings of the 2005 International Conference on Machine Learning and Cybernetics, Guangzhou, China, 18–21 August 2005; Volume 8, pp. 4898–4901. [Google Scholar] [CrossRef]
Wu, C.H.; Liang, W.B. Emotion Recognition of Affective Speech Based on Multiple Classifiers Using Acoustic-Prosodic Information and Semantic Labels. IEEE Trans. Affect. Comput. 2011, 2, 10–21. [Google Scholar] [CrossRef]
Sundberg, J.; Patel, S.; Bjorkner, E.; Scherer, K.R. Interdependencies among Voice Source Parameters in Emotional Speech. IEEE Trans. Affect. Comput. 2011, 2, 162–174. [Google Scholar] [CrossRef]
Jin, Q.; Li, C.; Chen, S.; Wu, H. Speech Emotion Recognition with Acoustic and Lexical Features. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 4749–4753. [Google Scholar] [CrossRef]
Mencattini, A.; Martinelli, E.; Ringeval, F.; Schuller, B.; Natale, C.D. Continuous Estimation of Emotions in Speech by Dynamic Cooperative Speaker Models. IEEE Trans. Affect. Comput. 2017, 8, 314–327. [Google Scholar] [CrossRef]
Eskimez, S.E.; Duan, Z.; Heinzelman, W. Unsupervised Learning Approach to Feature Analysis for Automatic Speech Emotion Recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5099–5103. [Google Scholar] [CrossRef]
Pourebrahim, Y.; Razzazi, F.; Sameti, H. Semi-Supervised Parallel Shared Encoders for Speech Emotion Recognition. Digit. Signal Process. 2021, 118, 103205. [Google Scholar] [CrossRef]
Zeng, Z.; Pantic, M.; Roisman, G.I.; Huang, T.S. A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 39–58. [Google Scholar] [CrossRef]
Ren, Z.; Wang, Z.; Ke, Z.; Li, Z.; Wushour, S. Survey of Multimodal Data Fusion. Comput. Eng. Appl. 2021, 57, 49. [Google Scholar] [CrossRef]
Bayoudh, K. A Survey of Multimodal Hybrid Deep Learning for Computer Vision: Architectures, Applications, Trends, and Challenges. Inf. Fusion 2024, 105, 102217. [Google Scholar] [CrossRef]
Lueangwitchajaroen, P.; Watcharapinchai, S.; Tepsan, W.; Sooksatra, S. Multi-Level Feature Fusion in CNN-Based Human Action Recognition: A Case Study on EfficientNet-B7. J. Imaging 2024, 10, 320. [Google Scholar] [CrossRef]
Zhang, X.; Sun, F.; Feng, L. Multi-View Representations for Fake News Detection. Netinfo Secur. 2024, 24, 438. [Google Scholar] [CrossRef]
Zhao, X.; Xie, Y.; Wan, Y. Detection and Identification Model of Gambling Websites Based on Multi-Modal Data. Netinfo Secur. 2023, 23, 77. [Google Scholar] [CrossRef]
Zheng, A.; He, J.; Wang, M.; Li, C.; Luo, B. Category-Wise Fusion and Enhancement Learning for Multimodal Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4416212. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar] [CrossRef]
Shvetsova, N.; Chen, B.; Rouditchenko, A.; Thomas, S.; Kingsbury, B.; Feris, R.; Harwath, D.; Glass, J.; Kuehne, H. Everything at Once—Multi-modal Fusion Transformer for Video Retrieval. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 19988–19997. [Google Scholar] [CrossRef]
Xu, H.; Yan, M.; Li, C.; Bi, B.; Huang, S.; Xiao, W.; Huang, F. E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 503–513. [Google Scholar] [CrossRef]
Girdhar, R.; Singh, M.; Ravi, N.; van der Maaten, L.; Joulin, A.; Misra, I. Omnivore: A Single Model for Many Visual Modalities. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 16081–16091. [Google Scholar] [CrossRef]
Tschannen, M.; Mustafa, B.; Houlsby, N. CLIPPO: Image-and-Language Understanding from Pixels Only. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 11006–11017. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Huan, R.; Zhong, G.; Chen, P.; Liang, R. UniMF: A Unified Multimodal Framework for Multimodal Sentiment Analysis in Missing Modalities and Unaligned Multimodal Sequences. IEEE Trans. Multimed. 2024, 26, 5753–5768. [Google Scholar] [CrossRef]
Yi, G.; Fan, C.; Tao, J.; Lv, Z.; Wen, Z.; Pei, G.; Li, T. A Two-Stage Stacked Transformer Framework for Multimodal Sentiment Analysis. Intell. Comput. 2024, 3, 0081. [Google Scholar] [CrossRef]
Peng, C.; Zhang, C.; Xue, X.; Gao, J.; Liang, H.; Niu, Z. Cross-Modal Complementary Network with Hierarchical Fusion for Multimodal Sentiment Classification. Tsinghua Sci. Technol. 2022, 27, 664–679. [Google Scholar] [CrossRef]
Zhang, T.; Zhou, G. Text-Image Gated Fusion Mechanism for Multimodal Aspect-based Sentiment Analysis. Comput. Sci. 2024, 51, 242–249. [Google Scholar] [CrossRef]
Wang, S.; Cai, G.; Guangrui, L. Aspect-level multimodal co-attention graph convolutional sentiment analysis model. J. Image Graph. 2023, 28, 3838–3854. [Google Scholar] [CrossRef]
Zhang, L.; Wang, K.; Zichao, P. Target-Oriented Interaction Graph Neural Networks for Multimodal Aspect-Level Sentiment Analysis. Comput. Eng. Appl. 2024, 60, 136. [Google Scholar] [CrossRef]
Li, J.; Liu, R.; Miao, Q.; Wang, D.; Liu, X. CAETFN: Context Adaptively Enhanced Text-Guided Fusion Network for Multimodal Sentiment Analysis. IEEE Trans. Affect. Comput. 2025, 16, 3122–3138. [Google Scholar] [CrossRef]
Lin, Z.; Long, Y.; Jiachen, D. A Multimodal Sentiment Recognition Method Based on Multitask Learning. Acta Sci. Nat. Univ. Pekin. 2021, 57, 7. [Google Scholar] [CrossRef]
Fan, R.; He, T.; Chen, M.; Zhang, M.; Tu, X.; Dong, M. Dual Causes Generation Assisted Model for Multimodal Aspect-Based Sentiment Classification. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 9298–9312. [Google Scholar] [CrossRef]
Tao, J.; Fan, C.; Lian, Z.; Lv, Z.; Ying, S.; Shan, L. Development of multimodal sentiment recognition and understanding. J. Image Graph. 2024, 29, 1607–1627. [Google Scholar] [CrossRef]
Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen Technical Report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
DeepSeek-AI; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar] [CrossRef]
Pang, N.; Wu, W.; Hu, Y.; Xu, K.; Yin, Q.; Qin, L. Enhancing Multimodal Sentiment Analysis via Learning from Large Language Model. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar] [CrossRef]
Gu, Q.; Wang, X. Network Public Opinion Analysis: Theory, Technology and Application; Tsinghua University Press: Beijing, China, 2020. [Google Scholar]
An, L.; Wu, L. An Integrated Analysis of Topical and Emotional Evolution of Microblog Public Opinions on Public Emergencies. Library Inf. Serv. 2017, 61, 120–129. [Google Scholar] [CrossRef]
Wang, Z.; Guo, Y.; Fu, J. CLIP-PubOp: A CLIP-based Multimodal Representation Fusion Method for Public Opinion. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 2243–2246. [Google Scholar] [CrossRef]
Chen, J. Research on Sentiment Analysis of Netizens Based on Fusion of Multi-modal Hierarchical Features. Master’s Thesis, Nanjing University of Aeronautics And Astronautics, Nanjing, China, 2022. [Google Scholar]
Yang, X.; Feng, S.; Zhang, Y.; Wang, D. Multimodal Sentiment Detection Based on Multi-channel Graph Neural Networks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 328–339. [Google Scholar] [CrossRef]
Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.P. Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages. IEEE Intell. Syst. 2016, 31, 82–88. [Google Scholar] [CrossRef]
Bagher Zadeh, A.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.P. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 2236–2246. [Google Scholar] [CrossRef]
Zhou, J.; Zhao, J.; Huang, J.X.; Hu, Q.V.; He, L. MASAD: A Large-Scale Dataset for Multimodal Aspect-Based Sentiment Analysis. Neurocomputing 2021, 455, 47–58. [Google Scholar] [CrossRef]
Xiang, Y.; Cai, Y.; Guo, J. MSFNet: Modality Smoothing Fusion Network for Multimodal Aspect-Based Sentiment Analysis. Front. Phys. 2023, 11, 1187503. [Google Scholar] [CrossRef]
Yang, D.; Li, X.; Li, Z.; Zhou, C.; Wang, X.; Chen, F. Prompt Fusion Interaction Transformer For Aspect-Based Multimodal Sentiment Analysis. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar] [CrossRef]
Hu, R.; Yi, J.; Chen, A.; Chen, L. Multichannel Cross-Modal Fusion Network for Multimodal Sentiment Analysis Considering Language Information Enhancement. IEEE Trans. Ind. Inform. 2024, 20, 9814–9824. [Google Scholar] [CrossRef]
Xie, Z.; Yang, Y.; Wang, J.; Liu, X.; Li, X. Trustworthy Multimodal Fusion for Sentiment Analysis in Ordinal Sentiment Space. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7657–7670. [Google Scholar] [CrossRef]
Wang, X.; Lyu, J.; Kim, B.G.; Parameshachari, B.D.; Li, K.; Li, Q. Exploring Multimodal Multiscale Features for Sentiment Analysis Using Fuzzy-Deep Neural Network Learning. IEEE Trans. Fuzzy Syst. 2025, 33, 28–42. [Google Scholar] [CrossRef]
Du, P. Research and Application of MultiModal Sentiment Analysis Methods in Chinese. Master’s Thesis, University of Electronic Science And Technology of China, Chengdu, China, 2023. [Google Scholar]
Ni, N. Research on Cross-Media Topic Detection and Opinion Analysis. Master’s Thesis, Beijing University of Posts and Telecommunications, Beijing, China, 2019. [Google Scholar]
Xu, N.; Mao, W. Multi-Interactive Memory Network for Aspect Based Multimodal Sentiment Analysis. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2019; Volume 33, pp. 371–378. [Google Scholar] [CrossRef]
Xue, X.; Zhang, C.; Niu, Z.; Wu, X. Multi-Level Attention Map Network for Multimodal Sentiment Analysis. IEEE Trans. Knowl. Data Eng. 2023, 35, 5105–5118. [Google Scholar] [CrossRef]
Bi, K.; Xie, L.; Zhang, H.; Chen, X.; Gu, X.; Tian, Q. Accurate Medium-Range Global Weather Forecasting with 3D Neural Networks. Nature 2023, 619, 533–538. [Google Scholar] [CrossRef]
Xu, Y.; Zhu, L.; Huang, B.; Ma, L.; Zhu, L. Public Opinion Analysis Based on EEMD-Transformer Model: Taking COVID-19 Public Opinion as an Example. J. Wuhan Univ. (Nat. Sci. Ed.) 2020, 66, 418–424. [Google Scholar] [CrossRef]
Mao, H.; Yuan, Z.; Xu, H.; Yu, W.; Liu, Y.; Gao, K. M-SENA: An Integrated Platform for Multimodal Sentiment Analysis. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 204–213. [Google Scholar] [CrossRef]
Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou, J.; Yang, K. CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 3718–3727. [Google Scholar] [CrossRef]
Liu, Y.; Yuan, Z.; Mao, H.; Liang, Z.; Yang, W.; Qiu, Y.; Cheng, T.; Li, X.; Xu, H.; Gao, K. Make Acoustic and Visual Cues Matter: CH-SIMS v2.0 Dataset and AV-Mixup Consistent Module. In Proceedings of the 2022 International Conference on Multimodal Interaction, ICMI’22, Bengaluru, India, 7–11 November 2022; pp. 247–258. [Google Scholar] [CrossRef]
Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 1103–1114. [Google Scholar] [CrossRef]
Zadeh, A.; Liang, P.P.; Mazumder, N.; Poria, S.; Cambria, E.; Morency, L.P. Memory Fusion Network for Multi-View Sequential Learning. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2018; Volume 32, pp. 5634–5641. [Google Scholar] [CrossRef]
Hazarika, D.; Zimmermann, R.; Poria, S. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. In MM ’20: Proceedings of the 28th ACM International Conference on Multimedia; ACM: New York, NY, USA, 2020; pp. 1122–1131. [Google Scholar] [CrossRef]
Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.; Potts, C. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; pp. 1631–1642. [Google Scholar]
Williams, J.; Kleinegesse, S.; Comanescu, R.; Radu, O. Recognizing Emotions in Video Using Multimodal DNN Feature Fusion. In Proceedings of the Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), Melbourne, Australia, 20 July 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 11–19. [Google Scholar] [CrossRef]
Yu, W.; Xu, H.; Yuan, Z.; Wu, J. Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2021; Volume 35, pp. 10790–10797. [Google Scholar] [CrossRef]
Han, W.; Chen, H.; Poria, S. Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 9180–9192. [Google Scholar] [CrossRef]
Tsai, Y.H.H.; Liang, P.P.; Zadeh, A.; Morency, L.P.; Salakhutdinov, R. Learning Factorized Multimodal Representations. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Coppini, S.; Lucifora, C.; Vicario, C.M.; Gangemi, A. Experiments on Real-Life Emotions Challenge Ekman’s Model. Sci. Rep. 2023, 13, 9511. [Google Scholar] [CrossRef]
Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6325–6334. [Google Scholar] [CrossRef]
Zhang, Q.; Fu, J.; Liu, X.; Huang, X. Adaptive Co-Attention Network for Named Entity Recognition in Tweets. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2018; Volume 32, pp. 5674–5681. [Google Scholar] [CrossRef]
Hasan, M.K.; Rahman, W.; Bagher Zadeh, A.; Zhong, J.; Tanveer, M.I.; Morency, L.P.; Hoque, M.E. UR-FUNNY: A Multimodal Language Dataset for Understanding Humor. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 2046–2056. [Google Scholar] [CrossRef]
Lin, J.; Men, R.; Yang, A.; Zhou, C.; Ding, M.; Zhang, Y.; Wang, P.; Wang, A.; Jiang, L.; Jia, X.; et al. M6: A Chinese Multimodal Pretrainer. arXiv 2021, arXiv:2103.00823. [Google Scholar] [CrossRef]
Gu, J.; Meng, X.; Lu, G.; Hou, L.; Minzhe, N.; Liang, X.; Yao, L.; Huang, R.; Zhang, W.; Jiang, X.; et al. Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 26418–26431. [Google Scholar]
Cheng, N.; Guan, C.; Gao, J.; Wang, W.; Li, Y.; Meng, F.; Zhou, J.; Fang, B.; Xu, J.; Han, W. Touch100k: A Large-Scale Touch-Language-Vision Dataset for Touch-Centric Multimodal Representation. arXiv 2024, arXiv:2406.03813. [Google Scholar] [CrossRef]
Luo, M.; Fei, H.; Li, B.; Wu, S.; Liu, Q.; Poria, S.; Cambria, E.; Lee, M.L.; Hsu, W. PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis. In Proceedings of the 32nd ACM International Conference on Multimedia, MM’2024, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 7667–7676. [Google Scholar] [CrossRef]
Jiang, W.; Liu, X.; Zheng, W.; Lu, B. SEED-VII: A Multimodal Dataset of Six Basic Emotions with Continuous Labels for Emotion Recognition. IEEE Trans. Affect. Comput. 2025, 16, 969–985. [Google Scholar] [CrossRef]
Wan, J.; Li, X.; Zhao, J.; Li, M.; Deng, Z.; Chen, H. Joint Uncertainty Model and Metric for Robust Feature Selection: A Bi-Level Distribution Consideration and Feature Evaluation Approach. Fuzzy Sets Syst. 2026, 523, 109615. [Google Scholar] [CrossRef]

Figure 1. Development of text sentiment analysis process based on single-modal sentiment analysis.

Figure 2. The process of single-modal sentiment analysis.

Figure 3. A schematic diagram of early fusion.

Figure 4. The schematic diagram of late fusion.

Figure 5. A schematic diagram of the hybrid fusion strategy.

Figure 6. The schematic diagram of tensor fusion strategy.

Figure 7. A schematic diagram of a model-level fusion strategy.

Figure 8. A schematic diagram of the Transformer-Based Fusion Strategy.

Figure 9. A schematic diagram of the hierarchical fusion strategy.

Figure 10. The process of online public opinion dissemination.

Figure 11. The network public opinion monitoring process based on machine learning.

Table 1. Comparison of multimodal fusion methods.

Fusion Method	Advantages	Disadvantages
Early Fusion	Preserves fine-grained cross-modal interactions Enables end-to-end joint optimization Learns low-level feature correlations	High computational complexity Strict temporal/spatial alignment required Vulnerable to noise and missing modalities
Late Fusion	Modular design, independent training High computational efficiency Robust to misalignment	Cannot capture modal interactions Loses cross-modal complementarity Struggles to optimize ensemble weights
Hybrid Fusion	Balances expression and efficiency Captures mid-level interactions Tolerant to partially missing modality	Requires careful fusion layer design Increases model complexity Fusion timing relies on heuristics
Tensor Fusion	Models high-order interactions Preserves complete correlations Strong theoretical capacity	Suffers from dimensionality explosion Requires large datasets Poor interpretability
Model-level Fusion	Deep integration, parameter-efficient Enables cross-modal sharing Facilitates transfer learning	Complex architecture design High coupling reduces flexibility Difficult training convergence
Transformer-based Fusion	Attention learns adaptive weights Captures long-range dependencies Highly scalable and generalizable	Quadratic computational complexity Requires large-scale pretraining Limited interpretability
Hierarchical Fusion	Multi-scale interaction capture Combines complementary advantages Strong robustness and adaptability	Complex structure, hard to train High computation and memory cost Tedious hyperparameter tuning

Table 2. Information of multimodal sentiment computing datasets.

No.	Dataset Name	Application	Modality Types	Data Volume	Language	Year	Institution
1	CMU-MOSI [63]	Sentiment Computing, Public Opinion Analysis	Text, Visual, Audio	2199 video clips	English	2018	Carnegie Mellon University
2	CMU-MOSEI [64]	Sentiment Computing, Public Opinion Analysis, Human-Computer Interaction	Text, Visual, Audio	23,500 video clips	English	2018	Carnegie Mellon University
3	CH-SIMS [78]	Sentiment Computing, User Behavior Analysis, Public Opinion Analysis	Text, Visual, Audio	2281 video clips	Chinese	2020	Tsinghua University
4	CH-SIMSv2 [79]	Sentiment Computing, User Behavior Analysis, Public Opinion Analysis	Text, Visual, Audio	14,563 video clips	Chinese	2022	Tsinghua University

Table 3. Summary of mainstream multi-modal model information.

Model Name	Core Idea	Applicable Scenarios	Trainable Parameters
LMF [80]	Dynamic fusion of modalities to capture inter-modal interactions	Multimodal sentiment computing, intent recognition	≈0.5 M
MFN [81]	Multi-perspective sequence learning to fully utilize cross-perspective interaction information	Multi-perspective video analysis, dialogue sentiment recognition	≈2.2 M
MISA [82]	Decompose modalities into invariant and specific features to reduce modal differences	Cross-modal sentiment transfer, low-resource scenarios	≈104 M
EF-LSTM [83]	Use early fusion and model phrase/sentence-level semantic composition	Text-speech sentiment computing, real-time interaction systems	≈0.89 M
LF-DNN [84]	Joint prediction based on BLSTM-based late fusion	Multimodal emotion recognition, human–computer interaction	≈0.6 M
Self-MM [85]	Self-supervised generation of single-modal labels and joint training	Label-scarce scenarios, cross-modal alignment	≈103 M
MMIM [86]	Maximize mutual information between input and fusion layer	Noisy environments, information-missing scenarios	≈103 M
MFM [87]	Decompose and represent cross-modal discriminative factors and modality-specific generative factors	Modality-missing scenarios, data completion	≈1.41 M
Graph-MFN [64]	Use graph structure to dynamically control modality weights and explicitly model modal interactions	Complex multimodal dialogue, sentiment computing	≈2.11 M

Table 4. Comparison of performance of multimodal sentiment analysis methods in English environments (based on the MOSI dataset). The best-performing results are highlighted in bold.

Model Name	Bert_en+MOSI
Model Name	Mult_acc_5%	MAE%	Corr%
LMF [80]	39.65	96.81	65.02
MFN [81]	39.21	96.69	66.14
MISA [82]	46.99	80.91	76.60
EF-LSTM [83]	30.66	113.38
LF-DNN [84]	38.39	96.13	65.53
Self-MM [85]	51.50	72.62	79.62
MMIM [86]	51.26	74.89	77.68
MFM [87]	39.94	93.29	65.53
Graph-MFN [64]	41.45	93.50	65.89

Table 5. Comparison of performance of multimodal sentiment analysis methods in English environments (based on the MOSEI dataset). The best-performing results are highlighted in bold.

Model Name	Bert_en+MOSEI
Model Name	Mult_acc_5%	MAE%	Corr%
LMF [80]	53.55	56.51	73.25
MFN [81]	52.46	57.32	71.56
MISA [82]	53.92	54.79	76.04
EF-LSTM [83]	51.34	59.49	68.94
LF-DNN [84]	53.65	55.88	73.32
Self-MM [85]	55.41	53.57	75.95
MMIM [86]	51.07	58.49	71.38
Graph-MFN [64]	53.18	56.74	72.60

Table 6. Comparison of performance of multimodal sentiment analysis methods in Chinese environments (based on the CH-SIMS dataset). The best-performing results are highlighted in bold.

Model Name	Bert_cn+CH-SIMS
Model Name	Mult_acc_5%	MAE%	Corr%
LMF [80]	36.69	44.57	56.98
MFN [81]	38.73	44.62	56.12
MISA [82]	37.49	44.16	57.14
EF-LSTM [83]	36.40	44.94	59.20
LF-DNN [84]	64.62	45.25	54.58
Self-MM [85]	42.16	41.47	59.28

Table 7. Comparison of performance of multimodal sentiment analysis methods in Chinese environments (based on the CH-SIMSv2 dataset). The best-performing results are highlighted in bold.

Model Name	Bert_cn+CH-SIMSv2
Model Name	Mult_acc_5%	MAE%	Corr%
LMF [80]	48.87	35.66	58.32
MFN [81]	54.52	29.79	71.99
MISA [82]	41.52	38.70	55.33
EF-LSTM [83]	51.22	31.57	69.42
LF-DNN [84]	53.35	30.29	71.19
Self-MM [85]	52.35	31.63	70.76
Graph-MFN [64]	43.84	40.38	52.54

Table 8. Comparison of trainable parameter quantities for each model.

Model Name	Trainable Parameters
LMF [80]	≈0.5 M
MFN [81]	≈2.2 M
MISA [82]	≈104 M
EF-LSTM [83]	≈0.89 M
LF-DNN [84]	≈0.6 M
Self-MM [85]	≈103 M
MMIM [86]	≈103 M
MFM [87]	≈1.41 M
Graph-MFN [64]	≈2.11 M

Table 9. Multimodal datasets for sentiment analysis.

No.	Dataset Name	Applications	Modalities	Year	Institution
1	VQA 2.0 [89]	Emotion Classification, Product Recommendation, Visual Question Answering	Text, Visual	2017	Virginia Tech, Georgia Institute of Technology
2	Twitter 2017 [90]	Sentiment Analysis, User Behavior Analysis, Cross-lingual Sentiment Analysis	Text, Visual	2018	Fudan University
3	CMU-MOSI [63]	Sentiment Analysis, Public Opinion Monitoring	Text, Visual, Audio	2018	Carnegie Mellon University
4	CMU-MOSEI [64]	Sentiment Analysis, Public Opinion Monitoring, Cross-modal Representation Learning	Text, Visual, Audio	2018	Carnegie Mellon University, University of Rochester
5	UR-FUNNY [91]	Humor Detection, Multi-modal Sentiment Analysis, Human-computer Interaction	Text, Visual, Audio	2019	Carnegie Mellon University
6	CH-SIMS [78]	Sentiment Analysis, User Behavior Analysis, Chinese Public Opinion Monitoring	Text, Visual, Audio	2020	Tsinghua University
7	MUGE [92]	Emotion Classification, Image Captioning, Text-to-image Retrieval, Text-based Image Generation	Text, Visual	2022	Alibaba DAMO Academy, Tsinghua University, Alibaba Cloud TI Platform
8	Wukong [93]	Image-text Retrieval, Zero-shot Image Classification, Chinese Public Opinion Monitoring	Text, Visual	2022	Huawei Noah’s Ark Lab
9	CH-SIMSV2 [79]	Sentiment Analysis, User Behavior Analysis, Chinese Public Opinion Monitoring	Text, Visual, Audio	2022	Tsinghua University
10	Touch100k [94]	Haptic Perception, Imitation Learning, Sentiment Analysis	Haptic, Visual	2024	Beijing Jiaotong University, Beijing University of Posts and Telecommunications, Tencent WeChat AI Team
11	PanoSent [95]	Sentiment Analysis, User Behavior Analysis, Public Opinion Monitoring	Text, Visual, Audio	2024	National University of Singapore
12	SEED-VII [96]	Cross-modal Analysis, Sentiment Analysis, Brain-computer Interface, Psychology Research	EEG, Eye-tracking	2024	Shanghai Jiao Tong University

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, S.; Li, T. A Review of Multimodal Sentiment Analysis in Online Public Opinion Monitoring. Informatics 2026, 13, 10. https://doi.org/10.3390/informatics13010010

AMA Style

Liu S, Li T. A Review of Multimodal Sentiment Analysis in Online Public Opinion Monitoring. Informatics. 2026; 13(1):10. https://doi.org/10.3390/informatics13010010

Chicago/Turabian Style

Liu, Shuxian, and Tianyi Li. 2026. "A Review of Multimodal Sentiment Analysis in Online Public Opinion Monitoring" Informatics 13, no. 1: 10. https://doi.org/10.3390/informatics13010010

APA Style

Liu, S., & Li, T. (2026). A Review of Multimodal Sentiment Analysis in Online Public Opinion Monitoring. Informatics, 13(1), 10. https://doi.org/10.3390/informatics13010010

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Review of Multimodal Sentiment Analysis in Online Public Opinion Monitoring

Abstract

1. Introduction

2. Review Methodology

2.1. Scope

2.2. Strategy

2.3. Contributions

3. Multimodal Sentiment Analysis

3.1. Unimodal Sentiment Analysis

3.1.1. Text Modality

3.1.2. Visual Modality

3.1.3. Speech Modality

3.2. Multimodal Feature Fusion

3.2.1. Fusion Strategy

3.2.2. Current Research Status in Multimodal Sentiment Analysis

4. Online Public Opinion Monitoring

4.1. Theoretical Foundation

4.2. Manual Monitoring Methods

4.3. Machine Learning–Based Methods

4.4. Multimodal Sentiment Analysis in Public Opinion Monitoring

4.4.1. Social Media Monitoring

4.4.2. Product and Service Feedback

4.4.3. Public Safety and Crisis Management

4.5. Performance Comparison of Multimodal Sentiment Analysis Methods

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI