Modality-Enhanced Multimodal Integrated Fusion Attention Model for Sentiment Analysis

Zhang, Zhenwei; Wu, Wenyan; Yuan, Tao; Feng, Guang

doi:10.3390/app151910825

Open AccessArticle

Modality-Enhanced Multimodal Integrated Fusion Attention Model for Sentiment Analysis

¹

School of Education, Guangzhou University, Guangzhou 510006, China

²

School of Computer Science, Guangdong University of Technology, Guangzhou 510006, China

³

School of Automation, Guangdong University of Technology, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10825; https://doi.org/10.3390/app151910825

Submission received: 25 August 2025 / Revised: 27 September 2025 / Accepted: 4 October 2025 / Published: 9 October 2025

Download

Browse Figures

Versions Notes

Abstract

Multimodal sentiment analysis aims to utilize multisource information such as text, speech and vision to more comprehensively and accurately identify an individual’s emotional state. However, existing methods still face challenges in practical applications, including modality heterogeneity, insufficient expressive power of non-verbal modalities, and low fusion efficiency. To address these issues, this paper proposes a Modality Enhanced Multimodal Integration Model (MEMMI). First, a modality enhancement module is designed to leverage the semantic guidance capability of the text modality, enhancing the feature representation of non-verbal modalities through a multihead attention mechanism and a dynamic routing strategy. Second, a gated fusion mechanism is introduced to selectively inject speech and visual information into the dominant text modality, enabling robust information completion and noise suppression. Finally, a combined attention fusion module is constructed to synchronously fuse information from all three modalities within a unified architecture, hile a multiscale encoder is used to capture feature representations at different semantic levels. Experimental results on three benchmark datasets—CMU-MOSEI, CMU-MOSI, and CH-SIMS—demonstrate the superiority of the proposed model. On CMU-MOSI, it achieves an Acc-7 of 45.91, with binary accuracy/F1 of 82.86/84.60, MAE of 0.734, and Corr of 0.790, outperforming TFN and MulT by a large margin. On CMU-MOSEI, the model reaches an Acc-7 of 54.17, Acc-2/F1 of 83.69/86.02, MAE of 0.526, and Corr of 0.779, surpassing all baselines, including ALMT. On CH-SIMS, it further achieves 41.88, 66.52, and 77.68 in Acc-5/Acc-3/Acc-2, with F1 of 77.85, MAE of 0.450, and Corr of 0.594, establishing new state-of-the-art performance across datasets. These results confirm that MEMMI achieves state-of-the-art performance across multiple metrics. Furthermore, ablation studies validate the effectiveness of each module in enhancing modality representation and fusion efficiency.

Keywords:

modality enhancement; multimodal sentiment recognition; attention mechanism; gated fusion; multiscale feature encoding

1. Introduction

With the rapid development of artificial intelligence and big data technologies, affective computing has gradually become an important research direction in fields such as human-computer interaction, education, healthcare, and public opinion analysis. Compared to sentiment recognition that relies solely on a single data source, Multimodal Sentiment Analysis (MSA) can simultaneously utilize multisource information from text, speech, and vision to capture more comprehensive and fine-grained emotional features, thereby significantly improving recognition accuracy and robustness. However, inherent differences exist between different modalities, such as heterogeneity in expression, information redundancy and conflict, and common issues like missing modalities and noise interference in real-world applications. These problems severely restrict the performance of multimodal sentiment analysis models.

To alleviate these issues, scholars have recently proposed various cross-modal alignment and fusion methods, including attention-based modality interaction, graph network-based feature modeling, and Transformer-based multimodal pre-training models. These methods have, to some extent, improved the complementarity and fusion effectiveness among modalities. Nevertheless, current research still faces two key challenges: (1) The semantic expressive power of non-verbal modalities (such as speech and vision) is limited, often making the fusion process heavily dependent on the text modality, but lacking an effective enhancement mechanism. (2) Traditional modality fusion methods, which often use simple weighting or concatenation, struggle to suppress noisy information and selectively complete key information, ultimately affecting overall sentiment recognition performance.

To address these challenges, this paper proposes a Modality-Enhanced Multimodal Integration Model (MEMMI). The novelty of MEMMI lies in its ability to achieve modality enhancement and cross-modal fusion within a unified architecture, which directly advances existing multimodal sentiment analysis approaches in the following three aspects:

First, a modality enhancement module is designed that leverages the semantic guidance capability of the text modality through a multihead attention mechanism and a dynamic routing strategy, effectively strengthening the feature representations of speech and visual modalities.

Second, a gated fusion mechanism is proposed, which selectively injects speech and visual features into the dominant text modality, thereby completing missing information while suppressing noise interference, which is a significant improvement compared to prior fusion strategies.

Finally, a combined attention fusion module is constructed and integrated with a multiscale encoder to capture cross-modal semantic interactions at both global and local levels, leading to more robust and efficient sentiment recognition.

Nevertheless, the current evaluation remains limited to English and Chinese datasets, suggesting that cross-cultural validation should be explored in future work. The rest of this paper is organized as follows. Section 2 reviews related work and highlights existing limitations. Section 3 presents the proposed MEMMI model in detail. Section 4 describes the experimental setup, datasets, evaluation metrics, ablation experiment and the experimental results. Section 5 concludes the paper and outlines future research directions.

2. Related Works

Multimodal sentiment analysis research has evolved from early traditional machine learning feature concatenation to complex deep learning-based modality interaction and fusion models, primarily including methods focused on modality fusion and those centered on modality representation. In the area of modality fusion, model architectures are often designed based on attention mechanisms. For example, in 2017, Vaswani et al. proposed the Transformer model based on a self-attention mechanism [1], which abandoned traditional recurrent and convolutional structures and focused on capturing global dependencies between inputs and outputs, fundamentally changing the paradigm of sequence modeling. Building on this, Tsai et al. created the classic MulT model [2], which used an attention mechanism to build a cross-modal attention fusion model. This model’s unique properties allowed it to directly process unaligned multimodal data, significantly improving its performance on such data. This method has been widely adopted by subsequent researchers as a common approach for modality fusion.

Soon after, Rahman et al. utilized BERT features to design the MAG component [3], which used a conditional self-attention mechanism to enable BERT [4] and XLNet [5] to seamlessly adapt to multimodal input during fine-tuning, becoming a new method for modality fusion in the field. Tang et al. proposed a bidirectional dynamic routing mechanism that uses a bidirectional attention mechanism to capture fine-grained multimodal sentiment, enhancing the ability to extract emotional context [6]. Huang et al. introduced a modality binding mechanism and enhanced cross-modal feature interaction with a fine-grained convolutional module, addressing the problem of losing fine-grained modal features during fusion [7]. Kyeonghun et al. proposed the AOBERT model to avoid information loss in traditional multimodal methods by using a single network to process text, visual, and speech modalities simultaneously [8]. Han et al. addressed the issue that existing methods do not fully consider the unequal importance of modalities by proposing a pairwise modality fusion framework and using a gating mechanism to balance the influence of different modalities on sentiment polarity judgment [9]. Cai et al. proposed a unimodal feature extraction network (UFEN) to extract unimodal features with stronger representation capabilities; then introduced multitask fusion network (MTFN) to improve the correlation and fusion effect between multiple modalities. Multilayer feature extraction, attention mechanisms and Transformer are used in the model to mine potential relationships between features [10].

In the area of modality representation, research primarily divides modal information into modality-shared and modality-specific representations. This approach originated from the MISA framework proposed by Hazarika et al. in 2020 [11], which learns modality-invariant and specific representations by mapping each modality’s representation to a shared modality-invariant subspace and a unique modality-specific subspace, enhancing the model’s understanding of different modalities. Subsequently, Wu et al. designed a text-centric shared-private framework for feature fusion to differentiate shared and private features in text, visual, and speech modalities [12]. Xu et al. proposed the SATI multimodal sentiment decoding model based on time-invariant learning to address the issues of excessive noise and information redundancy in the visual modality [13]. This model uses modality-invariant representation to guide interactions between different modalities while maintaining the consistency of time-series features. Lai et al. argued that modality representation methods in multimodal sentiment analysis have not effectively distinguished and extracted shared information between different modalities [14]. They designed a deep modality-shared information learning module to guide the model in learning cross-modal shared features. Wang et al., considering the limitations of modality decomposition and the problem of modality heterogeneity, introduced a policy and critic model from reinforcement learning to dynamically adjust the importance of each modality’s specific representation [15]. Zhou et al., based on the premise of modality decomposition, built a language-focused attractor to enhance the representation of the language modality, attracting complementary information from other modalities and improving overall performance [16].

Despite the significant progress made by the aforementioned methods in multimodal sentiment analysis, two prominent issues remain. On the one hand, the semantic expressive power of non-verbal modalities (such as speech and vision) is limited, often requiring semantic alignment with the text modality, but existing methods lack effective enhancement mechanisms. On the other hand, the fusion process is still insufficient in handling modality redundancy and noise, which can easily lead to interference from irrelevant information and affect sentiment recognition. Based on this, the Modality-Enhanced Multimodal Integration Model (MEMMI) proposed in this paper aims to strengthen the feature representation of non-verbal modalities and achieve noise suppression and information completion through a semantically guided modality enhancement and a gated fusion mechanism, thereby improving the overall sentiment analysis performance.

3. Methods

3.1. Overall Architecture

Figure 1 shows the overall architecture of the model and its workflow. The proposed multimodal integrated fusion attention framework primarily consists of two stages: modality interaction enhancement and combined attention fusion. It is capable of effectively processing information from three modalities: text, speech, and vision. First, for feature extraction, BERT-base-uncased is adopted to obtain contextualized textual embeddings, where each utterance is tokenized and represented as the last hidden state of the [CLS] token. Acoustic features are extracted using COVAREP, including pitch, energy, and spectral descriptors, sampled at 100 Hz. Visual features are obtained with OpenFace 2.0, which provides facial action units, gaze, and head pose information. Next, in the modality enhancement stage, the extracted visual and speech features are processed by their respective modality enhancement modules, followed by a self-attention calculation to obtain enhanced visual and speech representations guided by the text modality. Simultaneously, the text modality first undergoes self-attention calculation and then passes through a gated fusion module to inject non-verbal information for information completion. Finally, in the combined attention fusion stage, the three modalities are summed and fed into the combined attention fusion module. This module, with a total of four layers, also uses a multilayer encoder to extract features of different scales from each modality for modeling. The final results are then passed through a multilayer perceptron to obtain the multimodal representation for the sentiment analysis task.

3.2. Feature Extraction and Symbol Definition

In this task, the input data consists of text, visual, and speech modalities. After extraction by their respective feature extractors, the feature sequences for the three modalities can be represented as a triplet

(X_{t}, X_{v}, X_{a})

, where

X_{t} \in R^{T_{t} \times d_{t}}

,

X_{v} \in R^{T_{v} \times d_{v}}

and

X_{a} \in R^{T_{a} \times d_{a}}

. Here,

T_{m}, m \in \{t, v, a\}

is the sequence length and

d_{m}

is the vector dimension. The predicted sentiment score

\hat{y}

,

\hat{y}

∈[−3, 3] is a discrete value. Values greater than, equal to, and less than 0,represent positive, neutral, and negative sentiment, respectively.

3.3. Modality Interaction Enhancement Stage

3.3.1. Modality Enhancement Module

In multimodal sentiment analysis, traditional methods often treat the initial features of each modality as fixed inputs for fusion, lacking interaction between modalities. If the initial single-modality representations are weak or full of noise, the subsequent fusion performance may be poor. Therefore, this study designs a modality enhancement module that directly addresses the problem of insufficient single-modality expressive power from the outset. The module’s structure is shown in Figure 2, and it leverages the semantic guidance capability of the text modality to enhance the feature representation of non-verbal modalities. The core idea is to use a multihead attention mechanism and a dynamic routing iteration mechanism to extract guidance information from the text modality from different perspectives, dynamically optimizing the non-verbal modal features. Finally, a convolutional fusion is used to generate the final enhanced features. The multihead attention mechanism, by setting multiple independent attention heads, calculates the similarity between modalities, allowing each head to capture the relationship between text and non-verbal modalities from different perspectives. The dynamic routing iteration mechanism involves performing multiple iterative similarity calculations for a single attention head and accumulating them with historical similarity values. This ensures that each iterative update can comprehensively consider previous association information, progressively optimizing the non-verbal features to be more aligned with the semantic content of the text. For example, in a multimodal sentiment analysis task, if the text says “happy,” the enhanced speech features may more prominently highlight the positive tone in the voice.

Taking text-enhanced speech modality as an example, for the

h

-th attention head, assuming

k

iterations are performed, the formulas are as follows:

X_{a_{h}}^{c u r} = X_{a_{h}}^{m} \cdot X_{t}^{T}

(1)

X_{a_{h}}^{s u m} = \sum_{i = 1}^{k - 1} X_{a_{h}}^{i} + X_{a_{h}}^{c u r}

(2)

X_{a_{h}}^{k} = s o f t m a x (X_{a_{h}}^{s u m}) \cdot X_{t}

(3)

For each iteration, the current similarity value between modalities is calculated by Formula (1).

X_{a_{h}}^{m}

is the enhanced speech feature from the previous iteration. The resulting

X_{a_{h}}^{c u r}

will be stored as the historical modality similarity value for the next iteration. When performing the

X_{a_{h}}^{c u r}

iteration, the current calculated modality similarity value needs to be added to the historical modality similarity value.

Then, the softmax function is used to obtain the attention weight, which is multiplied by

X_{t}

to perform a weighted summation, and the result is the enhanced speech feature for the current

k

-th iteration. When

k = 1

there is no historical modality similarity value, so

X_{a_{h}}^{s u m}

is the original speech modality, that is

X_{a_{h}}^{c u r} = X_{a_{h}}^{m} = X_{a_{h}}^{s u m}

.

After all attention heads have been processed, the enhanced speech modality from each head is obtained. These outputs are then concatenated and passed through a one-dimensional convolution for dimensionality reduction while preserving key information. The result after the convolution is the final enhanced speech modality. The formula is as follows:

X_{a}^{enhanced} = C o n v 1 d {(cat (X_{a_{1}}^{enhanced}, X_{a_{2}}^{enhanced}, X_{a_{3}}^{enhanced}, \dots, X_{a_{h}}^{enhanced})^{T})}^{T}

(4)

3.3.2. Gated Fusion Module

To facilitate the interaction between text and non-verbal modalities and ensure the text modality remains dominant, this paper introduces a gated mechanism. This mechanism allows the text modality to integrate information from speech and vision. As shown in Figure 3, visual and speech features are first concatenated with text features. They are then mapped through a linear layer and weighted by the original linear mappings of speech and visual features. These two parts are then summed to represent the joint contribution of speech and vision to the text. The formulas are:

W_{v} = R e L U (W_{h v} (cat (X_{v}, X_{t})))

(5)

W_{a} = R e L U (W_{h a} (c a t (X_{a}, X_{t})))

(6)

H_{m} = W_{v} ⊙ W_{o v} (X_{v}) + W_{a} ⊙ W_{o a} (X_{a})

(7)

Here,

W_{h v}, W_{h a}, W_{o v}, W_{o a}

is a linear layer, and

H_{m}

is the final joint representation. This design allows text features to guide the extraction of visual and speech features, ensuring that the fused features remain consistent with the text semantics.

To balance the contributions of speech and visual features, a dynamic scaling factor

α

is introduced into the gated mechanism. By calculating the L2 norm of the text and fused features and combining it with an adjustable parameter

β

, the module generates a threshold and constrains the scaling factor to be between 0 and 1. The final enhanced feature is a weighted combination of the fused features and the original text features. Finally, the module uses Layer Normalization and Dropout to process the enhanced features, improving model stability and generalization ability and preventing overfitting. The formulas are as follows:

α = \min ((\frac{‖ X_{t} ‖_{2}}{‖ H_{m} ‖_{2} + ϵ}) \times β, 1)

(8)

X_{t}^{output} = α ⊙ H_{m} + X_{t}

(9)

X_{t}^{fusion} = Dropout (LayerNorm (X_{t}^{output}))

(10)

Here,

ϵ

is to prevent division by zero, and

β

is an adjustable hyperparameter that controls the maximum influence of the non-verbal modalities. The final output

X_{t}^{fusion}

is the text feature fused with speech and visual information.

3.4. Combined Attention Fusion Stage

After the modality enhancement is completed, a combined attention fusion module is used to achieve synchronous fusion of the three modality representations. To comprehensively capture the multilevel information of each modality, a scale feature extraction module is designed using an encoder, which can effectively generate multiscale feature representations. This design allows the model to fully focus on and utilize all valuable information within each modality, achieving more efficient multimodal fusion.

3.4.1. Scale Feature Extraction Module

The scale feature extraction module uses Transformer encoder layers to extract modal features. As shown in Figure 4, each layer has three encoders, and the number of layers in the scale feature extraction module is the same as the number of layers in the combined attention fusion module. They are used to encode the text, visual, and speech modalities, respectively. The outputs of the encoders serve as the keys and values in the cross-attention of the combined attention fusion layer. The formulas are:

H_{m}^{i} = {Encoder}_{m} (H_{m}^{i - 1}, θ_{m e})

(11)

Here,

i \in \{1, 2, 3, 4\}

,

m \in \{t, v, a\}

,

{Encoder}_{m}

are the encoders,

θ_{m e}

represents the internal parameters of the encoders,

H_{m}^{i}

is the output of the encoders, and

H_{m}^{0}

are the original inputs of the three modalities. By stacking multiple encoder layers, features of different scales are extracted and modeled in the combined attention fusion, which deepens the interaction between modalities and provides a better solution to the problem of modality heterogeneity.

3.4.2. Combined Attention Fusion Module

To address the issue of low traditional fusion efficiency, this paper designs a combined attention fusion module. This module is capable of synchronously and integrally fusing text, visual, and speech information within a unified architecture. Specifically, the module sums the enhanced outputs of each modality from the modality interaction enhancement stage to form an initial hybrid representation containing information from all three modalities. This hybrid representation then serves as the query for the cross-attention calculation within the combined attention fusion layer. Because it contains integrated information from all three modalities, it can interact with the representations of each individual modality from a global perspective in subsequent attention calculations, ensuring that all modal information is fully considered and avoiding the problem of neglecting some modal information that can occur with traditional step-by-step fusion methods.

The structure of the combined attention fusion module is shown in Figure 4. It is composed of three major stages: modality-specific encoding, cross-modal interaction, and final fusion. In the encoding stage, the module first uses a linear layer to align the three modal features, ensuring they are comparable in the same feature space. Subsequently, in the cross-modal interaction, the three aligned modal features are summed to serve as the initial fused representation, which is used as the query for cross-attention in the combined attention fusion layer. The module adopts a residual connection structure, and feature expression is enhanced through normalization and a feed-forward layer network. The processing results of the three modalities are summed after normalization and then input into a multihead self-attention mechanism for further fusion. Finally, a feed-forward layer generates the input representation for the next layer. The specific formulas are as follows:

I_{m} = Linear (X_{m}^{enhanced})

(12)

Y^{0} = I_{t} + I_{a} + I_{v}

(13)

F_{m}^{i} = CrossAttn (Y^{i - 1}, H_{m}^{i}, H_{m}^{i}, θ_{m}^{i})

(14)

{\hat{F}}_{m} = LayerNorm (FFN (LayerNorm (Y^{i - 1} + F_{m}^{i})))

(15)

Y^{i} = FFN (Multihead (LayerNorm (\sum_{m} {\hat{F}}_{m}), θ))

(16)

Here,

m \in \{t, v, a\}

,

i \in \{1, 2, 3, 4\}

.

θ

represents the internal parameters of each attention module, and the number of multihead attention heads is 8.

I_{m}

are the aligned modal features.

Y^{0}

is the sum of the three modal features.

CrossAttn

represents cross-attention, and

Multihead

represents multihead attention.

F_{m}^{i}

are the results of the i-th layer’s cross-attention calculation, and the summation term

{\hat{F}}_{m}

is the result of the cross-attention calculation followed by two layers of normalization. LayerNorm and FFN represent layer normalization and a feed-forward layer, respectively.

Y^{i}

is the final output of the i-th layer of the combined attention fusion module.

In the cross-attention mechanism, the fused representation

Y^{i - 1}

from the previous layer is used as the query, while the feature representations

H_{m}^{i}

of each modality are used as the keys and values. This design allows the fusion process to adaptively extract relevant information from each modality based on the current fusion state. The multihead self-attention mechanism further enhances the model’s ability to capture complex relationships between different features, allowing the final fused representation to integrate information from each modality more comprehensively. Furthermore, the combination of Layer Normalization and the feed-forward layer (FFN) not only stabilizes the training process but also enhances the model’s expressive power. The residual connection structure ensures the effective transmission of information in deep networks, preventing the vanishing gradient problem. Through this designed combined attention fusion mechanism, the model can fully utilize the complementary information of different modalities while reducing redundancy and noise between them, thereby improving the accuracy and robustness of multimodal sentiment analysis.

3.5. Output and Loss

After the final combined attention fusion block is computed, a MultiLayer Perceptron (MLP) is used to predict the sentiment score. The Mean Absolute Error (MAE) is used as the loss function, with the formula as follows:

\hat{y} = MLP (Mean (Y^{f}))

(17)

L = \frac{1}{N} \sum_{i = 1}^{N} |y_{i} - {\hat{y}}_{i}|

(18)

Here,

y

is the ground truth label from the dataset, and

\hat{y}

is the predicted value of the model. Mean represents the mean operation,

L

which is the model’s loss value.

4. Experiments

4.1. Dataset Preparation

This paper selects the public datasets CMU-MOSEI, CMU-MOSI, and CH-SIMS for experiments. Table 1 shows the partitioning of the training, validation, and test sets for these datasets. The effectiveness of the proposed framework is validated by comparing the scores on the three datasets.

CMU-MOSI is a dataset created by Zadeh et al. in 2016 [17]. MOSI contains 2199 annotated video clips, all of which are self-monologues by speakers to ensure data consistency and controllability. Each video clip in MOSI is annotated with a sentiment intensity score using a [−3, +3] scale, where −3 represents very negative and +3 represents very positive. This annotation method allows researchers to quantify emotional intensity for analysis. This dataset has made a significant contribution to the development of multimodal sentiment analysis and is one of the most commonly used MSA datasets.

CMU-MOSEI, created by Zadeh et al., is another very important dataset in the field of multimodal sentiment analysis [18]. It contains 23,453 annotated video clips from 5000 videos and 1000 different speakers, covering 250 different topics such as product reviews and movie reviews. It also consists of self-monologues by speakers, and the sentiment intensity annotation is the same as in MOSI. Furthermore, its annotation system is more comprehensive. As shown in Figure 5, the CMU-MOSEI annotation system includes two dimensions: sentiment polarity and emotional intensity. Sentiment polarity is represented by a continuous value from −3 to +3, from extremely negative to extremely positive. For better interpretability, these values are grouped into five categories: Negative (−3, −2), Weakly Negative (−1), Neutral (0), Weakly Positive (+1), and Positive (+2, +3). Emotional intensity is annotated for six basic emotions (anger, disgust, fear, happiness, sadness, and surprise), with a focus on gender balance.

CH-SIMS is a dataset created by Yu et al. that focuses on Chinese MSA [19]. It contains 2281 carefully selected Chinese video clips, covering various scenarios and roles. The data source for CH-SIMS is primarily popular Chinese variety shows, which include rich dialogue and interactive scenes with high video quality and rich facial expressions and body language. In terms of modality composition, CH-SIMS is similar to the CMU series datasets, containing video, speech, and text modalities. However, its annotation system is different; for sentiment polarity, it uses a three-category annotation [−1, 1], representing negative, neutral, and positive, respectively. It also provides fine-grained emotional annotations for six basic emotions: happiness, surprise, disgust, fear, sadness, and anger. This allows researchers to not only perform multimodal sentiment analysis but also analyze the visual, speech, or text information in the videos separately, capturing fine-grained emotions in the data.

4.2. Baseline Models

To validate the reliability and effectiveness of the proposed model, this paper selects 10 classic baseline models from the field of multimodal sentiment analysis for comparison. Among these, MISA [11]. represents the modality representation direction, Self-MM [20]. belongs to the self-supervised learning direction, and the rest are modality fusion methods. Specifically, these include: TFN [21], which achieves multimodal interaction representation through tensor outer product; MulT [2], which uses a cross-modal attention mechanism to solve modality misalignment; MAG-BERT [3], which dynamically injects non-verbal features into BERT using a gating mechanism; MISA, which improves robustness by separating modality-invariant and modality-specific spaces; BBFN [9] which uses a bidirectional bimodal fusion to balance performance and complexity; MMIM [22], which enhances modality complementarity based on mutual information maximization; Self-MM, which combines predicting missing modalities with contrastive learning for self-supervised optimization; CubeMLP [23], which proposes a cubic MLP structure for cross-modal interaction; ALMT [24], which focuses on the modality alignment problem; and KuDA [25], which combines knowledge transfer with domain adaptation to improve model generalization ability. DLF [16], is a Disentangled-Language-Focused (DLF) multimodal representation learning framework, which incorporates a feature disentanglement module to separate modality-shared and modality-specific information. CRNet [26] leverages different acoustic and visual representation subspaces to interact with linguistic modality. TMBL [7], a new modality-binding learning framework, is proposed, and the internal structure of the transformer model is redesigned. These methods cover different approaches to multimodal fusion, providing a comprehensive baseline for the performance evaluation of the proposed model.

4.3. Experimental Environment and Parameters

The experiments were trained on an NVIDIA GeForce RTX 4090 graphics card with 24 GB of memory. The AdamW optimizer was used to optimize the model, with four layers of combined attention fusion and scale feature extraction layers. The initial learning rate was 1 × 10⁻⁴. Detailed parameter settings can be found in Table 2. Hyperparameters and training:batch_size = 64; epochs = 100; initial learning rate = 1 × 10⁻⁴; optimizer = AdamW (weight_decay = 0.01); dropout after fusion/attention layers = 0.3; dropout in classification FC layers = 0.5; warmup for first 5% steps then linear decay. Early stopping is applied with patience = 10 epochs on validation loss; the best validation model is saved.

Compute cost (measured on a single NVIDIA GeForce RTX 4090, 24 GB): full training time ≈ 2.5 hrs (MOSI), 6 hrs (MOSEI), 8 hrs (CH-SIMS); peak GPU memory ≈ 18 GB/20 GB/22 GB, respectively. Model size ≈ 14.2 M parameters; forward FLOPs ≈ 8.5 GFLOPs/sample (batch = 1); inference latency ≈ 12 ms/sample(end-to-end, including feature extraction; core MEMMI module ≈ 6–8 ms).

4.4. Evaluation Metrics

To comprehensively evaluate the model’s performance, this paper uses a variety of regression and classification metrics. In the regression task, Mean Absolute Error (MAE) is used to measure the average deviation between predicted and true values, with a lower value indicating higher prediction accuracy. Pearson Correlation Coefficient (Corr) is used to assess the linear correlation between predicted and true values, reflecting the model’s ability to capture sentiment trends. In the classification task, Binary Accuracy (Acc-2), Three-class Accuracy (Acc-3), Five-class Accuracy (Acc-5), and Seven-class Accuracy (Acc-7) are used to evaluate sentiment polarity or intensity at different granularities. Acc-2 includes two forms, Has0-Acc2 and Non0-Acc2, to adapt to different label partitioning methods. Furthermore, the F1 score is introduced to balance precision and recall, which is particularly suitable for sentiment recognition scenarios with imbalanced class distributions.

4.5. Results Analysis

Table 3, Table 4 and Table 5 show the experimental results of the model on the three datasets. The Acc-2 metric represents Has0-Acc2/Non0-Acc2, and the F1 metric represents Has0-F1/Non0-F1. The data indicate that the proposed model demonstrates a significant advantage. On the MOSI dataset, the proposed model achieves an Acc-7 of 45.91, a significant improvement compared to earlier methods like TFN’s 34.9 and MulT’s 40.0. The binary accuracy and F1 score reach 82.86/84.60 and 82.70/84.56, respectively, with an MAE of 0.73 and a Corr of 0.79. All metrics surpass all baseline models. On the MOSEI dataset, the advantage is even more pronounced, with the proposed model achieving an Acc-7 of 54.17, surpassing the second-best model, ALMT, at 53.72. The Acc-2 and F1 scores reach 83.69/86.02 and 83.22/86.01, respectively, MAE drops to 0.53, and Corr reaches 0.78, also showing the best performance.

Figure 6 shows the Non0-Acc2 scores of the different models. Compared to earlier methods like TFN, MulT, and MAG-BERT, the advantage of the proposed model lies primarily in its enhanced modality interaction capability. While TFN created a unified representation space for multimodal interaction, its static fusion strategy cannot adapt to the changing importance of modalities in different samples. MulT introduced a cross-modal attention mechanism to solve the modality misalignment problem, but its directional cross-attention structure limited comprehensive interaction between modalities. MAG-BERT incorporated non-verbal information through a multimodal adaptive gating mechanism, but its excessive reliance on the text modality led to insufficient robustness. On CH-SIMS, its Corr was only 0.399, while the proposed model achieved 0.594. In contrast, the proposed model, through its modality enhancement and multimodal integrated fusion attention framework, achieves more comprehensive and adaptive modality interaction.

Compared to recent methods, the proposed model also demonstrates advantages. BBFN’s bidirectional bimodal fusion mechanism processes different modalities in parallel, balancing relationship modeling and computational complexity, but struggles to capture the complete relationship among three modalities. Self-MM enhances modality consistency through self-supervised tasks, but its multitask learning framework complicates the training process. The proposed model avoids these limitations by using an iterative self-attention mechanism and a precise gating mechanism to achieve dynamic feature enhancement and noise control, simplifying the training process while improving performance.

CubeMLP’s three-dimensional cubic MLP structure avoids the computational overhead of attention mechanisms, but its fixed projection and transformation methods limit its modeling capability. ALMT focuses on solving the cross-modal alignment problem, but its single strategy struggles to handle the complexity of emotional expression. KuDA combines knowledge transfer with adaptive dominant modality technology, but its reliance on external knowledge increases model complexity. In contrast, the proposed model does not require external knowledge support. By relying solely on its internal modality enhancement and fusion mechanisms, it improves Acc-7 to 45.91 on MOSI, proving the intrinsic effectiveness of the proposed method.

On the CH-SIMS dataset, the proposed model achieves the highest Acc-5 and Acc-3 scores of 41.88 and 66.52, respectively. The Corr metric, in particular, reaches 0.59, which is superior to all comparison models. A comprehensive analysis shows that the proposed model’s advantages mainly stem from the modality enhancement mechanism, the combined attention fusion framework, and the optimized gating mechanism, which effectively improve the performance of multimodal sentiment analysis.

4.6. Ablation Experiment

To further explore the effectiveness and contribution of each module in the model and to prove the rationality of the model design, this paper designs four ablation experiments.

4.6.1. Impact of Different Modules

To verify the effectiveness of each module in the model, the performance change after removing each core module was tested on the MOSI and CH-SIMS datasets. The results are shown in Table 6, where MEB stands for Modality Enhancement Block, GFM for Gated Fusion Module, CAF for Combined Attention Fusion, and SFE for Scale Feature Extraction encoder. The notation w/o indicates the removal of the corresponding module.

The experimental results show that removing the combined attention fusion module has the most significant impact on model performance. Acc-5 on the CH-SIMS dataset decreases by 6.74 percentage points, and Acc-7 on the MOSI dataset decreases by 5.65 percentage points. When the modality enhancement module is removed, the model’s accuracy drops by about 1 percentage point on both datasets, for both visual and audio modalities. Removing the scale feature extraction encoder results in a 2.76 percentage point drop in Acc-5 on CH-SIMS and a 1.75 percentage point drop in Acc-7 on MOSI. These results demonstrate the crucial role of the combined attention fusion mechanism and also validate the importance of modality enhancement and scale feature extraction in improving multimodal sentiment analysis performance.

4.6.2. Impact of Different Modalities

To further investigate the contribution of each modality to the overall performance, this paper attempts to remove each modality to verify the necessity of multimodal fusion and compares the results with the ALMT and CubeMLP models. The results are shown in Table 7. In the table, the proposed model is denoted as MEMMI, and V+A, T+V, T+A represent the removal of text, speech, and visual modalities, respectively.

The modality ablation experimental results show that removing the text modality causes a sudden drop in scores for all three models on both datasets. This clearly demonstrates the crucial role of the text modality in multimodal sentiment analysis. In contrast, the score drops for removing the speech modality and removing the visual modality are not as large, and the degree of impact between the two is relatively close. This indicates that although speech and visual modalities are important, their contributions are relatively balanced. This experiment also compares the impact of removing different modalities on the baseline ALMT and CubeMLP models. Specifically, when the text modality is removed, the performance of the proposed model decreases more slowly than that of ALMT and CubeMLP. Figure 7 shows the trend changes of the Acc-5 and Acc-7 metrics when different modalities are removed. This strongly proves that the proposed modality enhancement mechanism can more effectively utilize the remaining non-verbal information to better maintain model performance when the primary information source is missing.

4.6.3. Impact of Different Fusion Techniques

To analyze the impact of different fusion techniques, this paper compares four multimodal fusion methods and conducts comparison experiments on the MOSI dataset, as shown in Table 8. The data shows that the proposed combined attention fusion module demonstrates the best performance, outperforming other fusion methods on all evaluation metrics. These experimental results fully prove that, compared to BBFN’s pairwise modality fusion, MulT’s three-modal pairwise cross-fusion, and TFN’s simple fusion methods, the proposed combined attention fusion mechanism can more effectively capture the complex interaction relationships among multiple modalities, thereby achieving comprehensively leading performance on all evaluation metrics.

4.6.4. Impact of the Number of Different Scale Feature Extraction Layers

To verify the impact of each layer of scale features on model performance, the ablation experiment designed in this paper was conducted by selectively enabling or disabling different encoder layers. The results are shown in Table 9. When all four encoder layers (L1-L4) are involved in scale feature modeling, the model achieves optimal performance on both datasets for key metrics. Removing any layer or combination of layers typically leads to a decrease in performance metrics to varying degrees. For example, when only L1 and L2 layers are used, the model performance is at its lowest. When only L3 and L4 layers are used, the Acc-7 on MOSI is 44.52 and the Acc-5 on CH-SIMS is 40.88, both of which are significantly lower than the results when all four layers are used. Although for the MOSI MAE metric, some three-layer combinations (e.g., L1, L2, L3) can achieve the same performance as the four-layer combination, there is still a gap in the main classification accuracy metrics Acc-7 and Acc-5. This indicates that while the contribution of different-level features to the final performance may have a slight focus, overall, each scale feature extraction layer makes a positive contribution to the model’s overall performance.

5. Conclusions

This paper addresses the issues of modality heterogeneity, insufficient expressive power of non-verbal modalities, and low fusion efficiency in multimodal sentiment analysis by proposing a Modality Enhanced Multimodal Integration Model (MEMMI). This model achieves modality enhancement and efficient fusion within a unified architecture. It improves the feature representation of non-verbal modalities through a semantically guided modality enhancement module. It also achieves selective injection of speech and visual information through a gated fusion mechanism, which completes information while effectively suppressing noise. Finally, it fully models cross-modal interaction features at global and local semantic levels using a combined attention fusion module and a multiscale feature encoder. Experiments on three mainstream datasets show that MEMMI surpasses existing methods on all metrics, validating its effectiveness and robustness in the multimodal sentiment recognition task. In addition, this paper designs ablation experiments to verify the crucial role of the combined attention fusion module and the effectiveness of the modality enhancement module, further proving the rationality and necessity of the proposed method.

Future work will focus on three aspects: first, exploring more efficient cross-modal pre-training methods to further improve the model’s generalization ability in low-resource scenarios. Second, extending MEMMI to broader multilingual and multicultural datasets, and improving its scalability to real-time applications. And third, we plan to further enhance the explainability and interpretability of the proposed framework. Specifically, we aim to incorporate model-agnostic interpretability techniques and design visualization methods to better illustrate how different modalities and features contribute to the final prediction. Such efforts will not only improve the transparency of the model but also facilitate its adoption in real-world applications.”

Author Contributions

Methodology, Z.Z.; Writing—original draft preparation, Z.Z.; Writing—review and editing, W.W.; Supervision, G.F.; Validation, T.Y.; Data curation, T.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China [62237001].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Tsai, Y.-H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
Rahman, W.; Hasan, M.K.; Lee, S.; Zadeh, A.; Mao, C.; Morency, L.-P.; Hoque, E. Integrating Multimodal Information in Large Pretrained Transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. Adv. Neural Inf. Process. Syst. 2019, 32, 5753–5763. [Google Scholar]
Tang, J.; Liu, D.; Jin, X.; He, K.; Li, F. BAFN: Bi-Direction Attention Based Fusion Network for Multimodal Sentiment Analysis. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 1966–1978. [Google Scholar] [CrossRef]
Huang, J.; Zhou, J.; Tang, Z.; Liu, W.; Han, J. TMBL: Transformer-Based Multimodal Binding Learning Model for Multimodal Sentiment Analysis. Knowl.-Based Syst. 2024, 285, 111346. [Google Scholar] [CrossRef]
Kim, K.; Park, S. AOBERT: All-Modalities-in-One BERT for Multimodal Sentiment Analysis. Inf. Fusion 2023, 92, 37–45. [Google Scholar] [CrossRef]
Han, W.; Chen, H.; Gelbukh, A.; Zadeh, A.; Morency, L.-P.; Poria, S. Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis. In Proceedings of the 2021 International Conference on Multimodal Interaction, Montreal, QC, Canada, 18–22 October 2021. [Google Scholar]
Cai, Y.; Li, X.; Zhang, Y.; Li, J.; Zhu, F.; Rao, L. Multimodal sentiment analysis based on multi-layer feature fusion and multi-task learning. Sci. Rep. 2025, 15, 2126. [Google Scholar] [CrossRef] [PubMed]
Hazarika, D.; Zimmermann, R.; Poria, S. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1122–1131. [Google Scholar]
Wu, Y.; Lin, Z.; Zhao, Y.; Qin, B.; Zhu, L.-N. A Text-Centered Shared-Private Framework via Cross-Modal Prediction for Multimodal Sentiment Analysis. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021. [Google Scholar]
Xu, G.; Xue, J.; Liu, Y.; Sun, H.; Wang, H. Semantic-Guided Multimodal Sentiment Decoding with Adversarial Temporal-Invariant Learning. arXiv 2024, arXiv:2409.00143. [Google Scholar]
Lai, S.; Li, J.; Guo, G.; Zhang, Y.; Wang, S. Shared and Private Information Learning in Multimodal Sentiment Analysis with Deep Modal Alignment and Self-Supervised Multi-Task Learning. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024. [Google Scholar]
Wang, S.; Liu, C.; Liu, Q. Multi-Modality Collaborative Learning for Sentiment Analysis. arXiv 2025, arXiv:2501.12424. [Google Scholar] [CrossRef]
Wang, P.; Zhou, Q.; Wu, Y.; Chen, T.; Hu, J. DLF: Disentangled-language-focused multimodal sentiment analysis. Proc. AAAI Conf. Artif. Intell. 2025, 39, 20. [Google Scholar] [CrossRef]
Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.-P. MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. arXiv 2016, arXiv:1606.06259. [Google Scholar] [CrossRef]
Zadeh, A.; Liang, P.P.; Poria, S.; Vij, P.; Cambria, E.; Morency, L.-P. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018. [Google Scholar]
Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou, J.; Yang, K. CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-Grained Annotation of Modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020. [Google Scholar]
Yu, W.; Xu, H.; Yuan, Z.; Wu, J. Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021. [Google Scholar]
Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.-P. Tensor Fusion Network for Multimodal Sentiment Analysis. arXiv 2017, arXiv:1707.07250. [Google Scholar] [CrossRef]
Han, W.; Chen, H.; Poria, S. Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online, 7–11 November 2021. [Google Scholar]
Sun, H.; Wang, H.; Liu, J.; He, L.; Liu, L. CubeMLP: An MLP-Based Model for Multimodal Sentiment Analysis and Depression Estimation. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 3722–3729. [Google Scholar]
Zhang, H.; Wang, Y.; Yin, G.; Yang, H.; Wang, Z.; Li, D. Learning Language-Guided Adaptive Hyper-Modality Representation for Multimodal Sentiment Analysis. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 756–767. [Google Scholar]
Feng, X.; Lin, Y.; He, L.; Liu, J.; Wang, H. Knowledge-Guided Dynamic Modality Attention Fusion Framework for Multimodal Sentiment Analysis. arXiv 2024, arXiv:2410.04491. [Google Scholar] [CrossRef]
Shi, H.; Pu, Y.; Zhao, Z.; Huang, J.; Zhou, D.; Xu, D.; Cao, J. Co-space representation interaction network for multimodal sentiment analysis. Knowl.-Based Syst. 2024, 283, 111149. [Google Scholar] [CrossRef]

Figure 1. Modality Enhanced Multimodal Integrated Fusion Attention architecture diagram. Purple modules represent the text modality, where BERT is used for contextual feature extraction. Blue modules represent the speech modality, including speech feature extraction and enhancement. Orange modules represent the visual modality, including visual feature extraction and enhancement. Violet module (Gated Fusion Module) indicates the cross-modal feature interaction, selectively injecting speech and visual information into the text modality. Green module (Scale Feature Extraction) represents multi-scale semantic encoding for capturing features at different granularity levels. Gray blocks represent the Combined Attention Fusion layers that integrate multimodal information, followed by a Mul-tilayer Perceptron (MLP) for final sentiment prediction.

Figure 2. Schematic diagram of the modality enhancement module structure. The colored blocks denote the attention weights and interaction patterns between modalities: Green blocks represent the feature embeddings of the target modality Xt; Blue blocks represent the feature embeddings of the auxiliary modality Xa; Mixed color blocks (yellow–pink grid) in the attention map illustrate the dynamically learned attention weights computed by the Softmax(Xt, Xa) function, indicating the degree of semantic correlation between the two modalities; The iterative loop (×k) denotes the multi-step refinement process, where the enhanced representation Xenhanced is obtained after k iterations.

Figure 3. Schematic diagram of the gated fusion module. The color coding represents the functional role of each component: Blue blocks represent audio modality features (Xa),purple blocks represent visual modality features (Xv), green blocks represent text modality features (Xt), orange blocks indicate linear transformation layers. Cyan block (Gating) denotes the adaptive gating mechanism, gray blocks labeled concat indicate feature concatenation operations, purple blocks labeled L2 and Normalization represent regularization and normalization processes. The arrows show the information flow direction.

Figure 4. Schematic diagram of the combinatorial attention fusion module structure.

Figure 5. The distribution of sentiment and emotions in the CMU-MOSEI.

Figure 6. Bar chart of Acc-2 scores for different models on the dataset. The chart visually compares how high the Acc-2 score is (represented by the height of the bars, corresponding to the “Time” value) for different models across different datasets.

Figure 7. Line chart of model performance trends under different modality combinations.

Table 1. Partitioning of the training set, validation set, and test set of different datasets.

Datasets	Train	Valid	Test	Total	Language
MOSEI	16,326	1871	4659	22,856	English
MOSI	1284	229	686	2199	English
CH-SIMS	1368	456	457	2281	Chinese

Table 2. Experimental parameter settings for different datasets.

Descriptions	MOSEI	MOSI	CH-SIMS
Batch Size	64	64	64
Initial Learning Rate	1 × 10⁻⁴	1 × 10⁻⁴	1 × 10⁻⁴
Epochs	100	100	100
Optimizer	AdamW	AdamW	AdamW
Head of Enhanced Block	3	3	4
Iteration of Enhanced block	3	3	4
Integrated Fusion Attention Block	4	4	4
Kernel Size of Conv1D	3	3	4

Table 3. The experimental results of the model on MOSI. Bolding of “Ours” in the footer: The bolding here is important because it emphasizes the experimental results of the pro-posed model, which are the focus of this table. It contrasts “Ours” with the results of other models like TFN, MulT, and MAG-BERT.

Models	MOSI
Models	Acc-7↑	Acc-2↑	F1↑	MAE↓	Corr↑
TFN	34.9	-/80.8	-/80.7	0.90	0.70
MulT	40.0	-/82.0	-/82.8	0.87	0.70
MISA	42.3	81.1/83.4	81.7/83.6	0.78	0.76
MAG-BERT	43.62	81.37/82.43	81.75/82.61	0.77	0.78
BBFN	43.88	80.32/82.47	80.21/82.44	0.80	0.74
MMIM	44.35	82.51/83.30	82.38/84.23	0.74	0.78
Self-MM	45.79	82.54/83.77	82.28/83.91	0.75	0.79
CubeMLP	45.80	82.76/82.32	81.77/84.23	0.76	0.77
ALMT	45.79	81.63/83.38	81.57/83.38	0.76	0.77
KuDA	45.08	82.40/83.71	82.48/83.46	0.74	0.78
DLF	47.08	85.06	-/85.06	0.73	0.78
CRNET	47.40	-/86.4	-/86.40	0.71	0.71
TMBL	36.30	82.41/84.29	82.41/84.29	0.87	0.76
Ours	45.91	82.86/84.60	82.70/84.56	0.734	0.79

Table 4. The experimental results of the model on MOSEI. Bolding of “Ours” in the footer: The bolding here is important because it emphasizes the experimental results of the pro-posed model, which are the focus of this table. It contrasts “Ours” with the results of other models like TFN, MulT, and MAG-BERT.

Models	MOSEI
Models	Acc-7↑	Acc-2↑	F1↑	MAE↓	Corr↑
TFN	50.20	-/82.5	-/82.10	0.59	0.70
MulT	51.80	-/82.5	-/82.30	0.58	0.70
MAG-BERT	52.67	82.51/84.82	82.77/84.71	0.54	0.75
MISA	52.2	82.6/84.5	82.8/84.3	0.55	0.75
BBFN	52.88	82.87/85.73	82.13/84.56	0.54	0.76
MMIM	53.24	82.24/85.97	82.66/85.94	0.53	0.77
Self-MM	53.46	82.68/84.96	82.95/84.93	0.52	0.76
CubeMLP	53.35	82.36/85.23	82.61/85.04	0.53	0.76
ALMT	53.72	81.09/85.09	81.65/85.10	0.52	0.77
KuDA	53.23	81.28/85.25	82.74/85.24	0.53	0.77
DLF	53.90	-/85.42	85.27	0.53	0.76
CRNET	53.8	-/86.2	-/86.1	0.54	0.77
TMBL	52.4	84.23/85.84	84.87/85.92	0.55	0.77
Ours	54.17	83.69/86.02	83.22/86.01	0.52	0.78

Table 5. The experimental results of the model on CH-SIMS. Bolding of “Ours” in the footer: The bolding here is important because it emphasizes the experimental results of the pro-posed model, which are the focus of this table. It contrasts “Ours” with the results of other models like TFN, MulT, and MAG-BERT. This table presents the performance of various models on the CH-SIMS dataset for multimodal sentiment analysis. The following is a guide to the column headers and symbols: Models: The names of the different models being compared. Acc-5: Accuracy for 5-class classification. Higher is better. Acc-3: Accuracy for 3-class classification. Higher is better. CH-SIMS: This is the dataset name, and the following columns are specific metrics reported on it. Acc-2: Accuracy for 2-class classification. Higher is better. F1: The F1 score, which is the harmonic mean of precision and recall. Higher is better. MAE↓: Mean Absolute Error. The ↓ symbol indicates that a lower value is better.

Models	CH-SIMS
Models	Acc-5↑	Acc-3↑	Acc-2↑	F1↑	MAE↓	Corr↑
TFN	39.30	65.12	78.38	78.62	0.43	0.59
MulT	37.94	64.77	78.56	79.66	0.45	0.56
MAG-BERT	-	-	74.44	71.75	0.49	0.39
MISA	-	-	76.54	76.59	0.44	0.56
BBFN	40.92	61.5	78.12	77.88	0.43	0.56
Self-MM	41.53	65.47	80.04	80.44	0.42	0.57
CubeMLP	40.67	64.53	77.68	77.59	0.41	0.59
ALMT	39.17	64.11	78.34	78.44	0.41	0.57
KuDA	41.58	65.86	77.9	80.71	0.41	0.58
Ours	41.88	66.52	77.68	77.85	0.45	0.59

Table 6. The impact of removing different modules on model performance.

Methods	CH-SIMS			MOSI
Methods	MAE↓	Corr↑	Acc-5↑	MAE↓	Corr↑	Acc-7↑
Ours	0.45	0.59	41.88	0.73	0.79	45.91
w/o Visual MEB	0.46	0.58	40.85	0.74	0.78	45.08
w/o Audio MEB	0.47	0.57	40.44	0.75	0.77	44.89
w/o GFM	0.46	0.57	41.21	0.74	0.78	45.32
w/o CAF	0.57	0.47	35.14	0.82	0.76	40.26
w/o SFE	0.47	0.57	39.12	0.75	0.77	44.16

Table 7. The impact of removing different modalities on the model. The rows are grouped by the model being tested, and within each group, different experimental settings are compared. V+A, T+V, T+A: These represent the ablated settings where one modality is removed from the input. T stands for Text (language modality).V stands for Vision (video/image modality).A stands for Audio (sound modality).

Methods	CH-SIMS			MOSI
Methods	MAE↓	Corr↑	Acc-5↑	MAE↓	Corr↑	Acc-7↑
MEMMI	0.45	0.59	41.88	0.73	0.79	45.91
V+A	0.60	0.27	30.16	1.26	0.32	26.17
T+V	0.58	0.47	34.46	0.82	0.72	39.17
T+A	0.58	0.47	36.49	0.82	0.73	38.69
ALMT	0.42	0.58	39.17	0.76	0.77	45.79
V+A	0.64	0.16	25.69	1.53	0.24	20.49
T+V	0.59	0.48	32.79	0.84	0.70	38.59
T+A	0.58	0.48	34.68	0.83	0.71	39.48
CubeMLP	0.42	0.59	40.67	0.75	0.77	45.80
V+A	0.62	0.21	27.28	1.39	0.30	23.59
T+V	0.57	0.46	33.95	0.84	0.71	40.67
T+A	0.57	0.39	32.77	0.84	0.72	41.56

Table 8. The impacts of different fusion techniques on the model performance.

Models	MOSI
Models	Acc-7↑	Acc-2↑	F1↑	MAE↓	Corr↑
TFN	34.9	-/80.8	-/80.7	0.90	0.69
MulT	40.0	-/82.0	-/82.8	0.87	0.69
BBFN	43.88	80.32/82.47	80.21/82.44	0.79	0.74
KuDA	45.08	82.40/83.71	82.48/83.46	0.73	0.77
Ours	45.91	82.86/84.60	82.70/84.56	0.73	0.79

Table 9. Impact of Removing Specific Encoder Layers on Modal Performance. The headers are divided into two main groups: the model’s configuration and its performance. Encoder Layer Configuration (L₁ to L₄).L₁, L₂, L₃, L₄: These symbols represent four distinct layers within the model’s encoder. ✓ (Checkmark): Indicates that the specific layer is included in the model for that experiment. Empty Cell: Indicates that the specific layer is removed or disabled in the model for that experiment.

$L_{1}$	$L_{2}$	$L_{3}$	$L_{4}$	MOSI		CH-SIMS
$L_{1}$	$L_{2}$	$L_{3}$	$L_{4}$	Acc-7↑	MAE↓	Acc-5↑	MAE↓
✓	✓			44.77	0.74	40.16	0.46
		✓	✓	44.52	0.74	40.88	0.46
✓	✓	✓		45.78	0.73	41.55	0.45
✓	✓		✓	45.24	0.74	41.11	0.45
✓		✓	✓	45.36	0.74	41.28	0.45
	✓	✓	✓	45.15	0.74	41.54	0.46
✓	✓	✓	✓	45.91	0.73	41.88	0.45

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Z.; Wu, W.; Yuan, T.; Feng, G. Modality-Enhanced Multimodal Integrated Fusion Attention Model for Sentiment Analysis. Appl. Sci. 2025, 15, 10825. https://doi.org/10.3390/app151910825

AMA Style

Zhang Z, Wu W, Yuan T, Feng G. Modality-Enhanced Multimodal Integrated Fusion Attention Model for Sentiment Analysis. Applied Sciences. 2025; 15(19):10825. https://doi.org/10.3390/app151910825

Chicago/Turabian Style

Zhang, Zhenwei, Wenyan Wu, Tao Yuan, and Guang Feng. 2025. "Modality-Enhanced Multimodal Integrated Fusion Attention Model for Sentiment Analysis" Applied Sciences 15, no. 19: 10825. https://doi.org/10.3390/app151910825

APA Style

Zhang, Z., Wu, W., Yuan, T., & Feng, G. (2025). Modality-Enhanced Multimodal Integrated Fusion Attention Model for Sentiment Analysis. Applied Sciences, 15(19), 10825. https://doi.org/10.3390/app151910825

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Modality-Enhanced Multimodal Integrated Fusion Attention Model for Sentiment Analysis

Abstract

1. Introduction

2. Related Works

3. Methods

3.1. Overall Architecture

3.2. Feature Extraction and Symbol Definition

3.3. Modality Interaction Enhancement Stage

3.3.1. Modality Enhancement Module

3.3.2. Gated Fusion Module

3.4. Combined Attention Fusion Stage

3.4.1. Scale Feature Extraction Module

3.4.2. Combined Attention Fusion Module

3.5. Output and Loss

4. Experiments

4.1. Dataset Preparation

4.2. Baseline Models

4.3. Experimental Environment and Parameters

4.4. Evaluation Metrics

4.5. Results Analysis

4.6. Ablation Experiment

4.6.1. Impact of Different Modules

4.6.2. Impact of Different Modalities

4.6.3. Impact of Different Fusion Techniques

4.6.4. Impact of the Number of Different Scale Feature Extraction Layers

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI