Sentiment Analysis Based on Enhanced Feature Decoupling and Multimodal Logical Reasoning

Yang, Hua; Zhao, Ming; Qiu, Yuanhao; Li, Yuanyuan; Guo, Junying; Zhang, Ziran; Chen, Baozhou; He, Mingzhe; Hong, Yu

doi:10.3390/mti10050050

Open AccessArticle

Sentiment Analysis Based on Enhanced Feature Decoupling and Multimodal Logical Reasoning

by

Hua Yang

,

Ming Zhao

^*

,

Yuanhao Qiu

,

Yuanyuan Li

,

Junying Guo

,

Ziran Zhang

,

Baozhou Chen

^*

,

Mingzhe He

and

Yu Hong

School of Mathematics and Computer Science, Wuhan Polytechnic University, Wuhan 430023, China

^*

Authors to whom correspondence should be addressed.

Multimodal Technol. Interact. 2026, 10(5), 50; https://doi.org/10.3390/mti10050050

Submission received: 7 April 2026 / Revised: 26 April 2026 / Accepted: 30 April 2026 / Published: 3 May 2026

Download

Browse Figures

Versions Notes

Abstract

Despite significant advances, multimodal sentiment analysis still faces critical challenges in modeling complex cross-modal interactions and extracting discriminative sentiment features. To address these limitations, this paper proposes a hierarchical multimodal sentiment analysis framework. Specifically, a cross-modal feature enhancement module is first introduced to capture deep correlations among textual, visual, and acoustic modalities via cross-attention mechanisms, thereby obtaining context-aware fused representations. Subsequently, an attention-gated feature disentanglement approach is employed to effectively separate sentiment-relevant information from content-specific features within the fused representations; an independence loss is further imposed to enforce orthogonality between these two feature subsets, thereby mitigating noise induced by repetitive visual frames and textual stop words. Finally, all disentangled features are integrated to facilitate high-level sentiment reasoning through a multimodal logical inference module, where supervised contrastive loss is incorporated to enhance the discriminability of sentiment expressions. Extensive experiments conducted on two public benchmarks, CMU-MOSI and CMU-MOSEI, demonstrate that the proposed framework achieves improvements of 2–6% across multiple evaluation metrics compared with state-of-the-art methods.

Keywords:

sentiment analysis; multimodal learning; feature disentanglement; logical inference; attention mechanism

1. Introduction

In recent years, multimodal sentiment analysis (MSA) has emerged as an interdisciplinary research direction integrating natural language processing, computer vision, and speech processing, primarily focusing on the automatic recognition of sentiment from textual, visual, and acoustic modalities [1,2]. Given the inherently multimodal nature of human emotional expression, MSA has garnered substantial attention from both academia and industry [3], demonstrating significant application value and promising prospects in artificial intelligence, affective computing, and human–computer interaction [4].

Current research methodologies can be broadly categorized into traditional fusion approaches [5] and deep learning-based methods [6]; however, several critical limitations persist. Traditional fusion methods primarily rely on simple feature-level or decision-level fusion strategies, lacking effective modeling capabilities for complex cross-modal interactions and failing to fully exploit complementary information among modalities. Although deep learning-based approaches have achieved remarkable progress in feature representation, they still exhibit deficiencies in capturing fine-grained sentiment features, addressing modal imbalance, and mitigating information redundancy [7]. Furthermore, existing methods generally lack effective logical reasoning mechanisms, resulting in limited capabilities for contextual understanding and inference of complex emotional expressions [8], which constrains their performance in practical application scenarios.

To address these challenges, this paper proposes a novel hierarchical framework based on enhanced feature disentanglement and multimodal logical reasoning. Specifically, a cross-modal feature enhancement module is first introduced to facilitate deep information sharing and mutual learning across different modalities via cross-attention mechanisms, generating context-aware enhanced modality features. Building upon this foundation, an attention-gated feature disentanglement module is designed to address feature purity concerns, effectively separating pure sentiment component features from content component features within the enhanced representations; an independence loss is further incorporated to ensure statistical orthogonality between these two feature subsets. Subsequently, all disentangled modality-specific sentiment and content features are fed into a core multimodal logical reasoning module, which leverages powerful self-attention mechanisms to capture high-order logical relationships among the disentangled features from all modalities, forming a unified global reasoning representation. Finally, supervised contrastive loss is employed to strengthen the inferred sentiment features, enhancing their discriminative capability to directly guide sentiment classification tasks. The main contributions of this work are summarized as follows:

(1): An innovative cross-modal feature enhancement method is proposed, which leverages cross-attention mechanisms to facilitate information complementarity among different modalities, thereby effectively enriching context-aware features required for subsequent processing.
(2): An attention-gated feature disentanglement module is developed to separate sentiment and content information within fused representations; an independence loss is introduced to further enhance the purity of sentiment features and improve model interpretability.
(3): A multimodal logical reasoning module is presented, which employs Transformer Encoder to perform deep reasoning over disentangled multi-source features, generating a global sentiment-discriminative representation.
(4): Comprehensive experiments are conducted on two public benchmarks, CMU-MOSI and CMU-MOSEI, and the experimental results demonstrate that the proposed model outperforms existing baseline approaches, validating the effectiveness of the proposed methodology.

2. Related Work

Multimodal sentiment analysis methods primarily focus on how to efficiently fuse information derived from heterogeneous modalities. Zadeh et al. [9] concatenated features from individual modalities at the feature level and subsequently fed them into a unified model. Li et al. [10] trained modality-specific models separately and performed integration at the decision level. Baltrušaitis et al. [11] combined the advantages of both approaches to obtain complementary and joint representations through multimodal fusion, thereby enhancing sentiment analysis performance.

In the multimodal domain, although the application of disentangled representation learning remains relatively limited, its potential is substantial [12]. By decomposing the latent representations of data into mutually independent factors with explicit semantic interpretations, both model interpretability and generalization capability can be significantly improved. Han et al. [13] fused and separated pairwise modality representations to exploit both independence and correlation among modalities. Wu et al. [14] proposed a fine-grained video multimodal fusion denoising bottleneck model capable of eliminating noisy and redundant information while capturing salient features from audio and textual inputs. Zhang et al. [15] introduced an adaptive language-guided multimodal transformer that obtains complementary and joint representations through multimodal fusion, consequently improving sentiment analysis performance.

In contrast to the aforementioned approaches, this work integrates cross-modal feature enhancement with attention-gated feature disentanglement to ensure the purity of sentiment features while simultaneously enhancing model robustness against non-sentiment interference. Building upon this foundation, a multimodal logical reasoning module based on the Transformer Encoder is introduced to capture high-order logical relationships, supplemented by supervised contrastive loss to strengthen sentiment representations. This hierarchical optimization achieves an end-to-end solution spanning from feature enhancement and disentanglement to logical reasoning, providing a more comprehensive and interpretable approach to multimodal sentiment analysis.

3. Methods

In this section, the proposed multimodal sentiment analysis framework is elaborated in detail. As illustrated in Figure 1, this framework adopts a hierarchical architecture that achieves precise multimodal sentiment recognition through refined feature processing, deep cross-modal interaction, and high-order logical reasoning. The overall framework comprises four core components: a cross-modal feature enhancement module, an attention-gated feature disentanglement module, a multimodal logical reasoning module, and a sentiment classification module. Subsequently, the design specifics of each module are presented in detail.

3.1. Cross-Modal Feature Enhancement Module

This module consists of three independent enhancement units, each dedicated to one of the three modalities: text, vision, and audio. The core of each enhancement unit employs the Multi-Head Cross-Attention Mechanism [16]. Taking the textual modality as an example, the enhancement unit utilizes textual features such as Query (Q), while visual and acoustic features serve as Key (K) and Value (V), respectively [17]. Through this mechanism, the model can dynamically attend to and integrate relevant information from visual and acoustic modalities while updating textual features. The same process is applied to the enhancement of visual and acoustic modalities.

For each modality

m \in {t, v, a}

, its initial feature representation is denoted as

F_{m} \in R^{N \times L_{m} \times E_{m}}

, where N represents the batch size,

L_{m}

denotes the sequence length of modality m, and

E_{m}

indicates the feature dimension of modality m. To ensure compatibility in cross-attention computation, features from all modalities are uniformly projected to the same embedding dimension E prior to entering the enhancement module.

In the attention mechanism, we define the query matrix

Q \in R^{N \times L_{Q} \times d_{K}}

, key matrix

K \in R^{N \times L_{K} \times d_{K}}

, and value matrix

V \in R^{N \times L_{K} \times d_{V}}

. Its output is given by Equation (1):

α (Q, K, V) = σ (\frac{Q K^{T}}{\sqrt{d_{K}}}) V

(1)

where

d_{K}

denotes the dimension of the query and key vectors, and

\sqrt{d_{K}}

is a scaling factor used to prevent the attention scores from becoming excessively large, which would lead to vanishing or exploding gradients [18].

To enhance the model’s ability to focus on information from different modalities, we adopt a multi-head attention mechanism. Each head performs linear transformations and attention calculations on the query, key, and value respectively. Finally, the outputs of all heads are concatenated and then linearly projected, as shown in Equations (2) and (3):

h e a d^{(h)} = α^{(h)} (V^{(h)}) \in R^{n \times k}

(2)

U^{'} = [h e a d^{(1)} \oplus h e a d^{(2)} \oplus \dots \oplus h e a d^{(H)}] W^{O}

(3)

where

W^{O} \in R^{k H \times d}

denotes the output weight matrix, and H is the number of attention heads. Taking text enhancement as an example, the initial text feature is denoted as

F_{t}

. To fully integrate information from other modalities, the text feature first interacts with the visual feature to generate an intermediate enhanced feature

F_{t}^{'}

; it then interacts with the acoustic feature, finally outputting the enhanced text feature

Z_{t}

. The calculation formulas are given as follows:

F_{t}^{'} = LayerNorm (F_{t} + U_{t})

(4)

Z_{t} = LayerNorm (F_{t}^{'} + U_{F_{t}^{'}})

(5)

Similarly, the visual and acoustic enhancement units also adopt a similar dual cross-attention paradigm, outputting the enhanced visual feature

Z_{v}

and acoustic feature

Z_{a}

, respectively.

3.2. Feature Decoupling Module

This module consists of three parallel decoupling units, corresponding to the text, visual, and acoustic modalities, respectively. At the core of each decoupling unit is an attention decoupler based on attention gating mechanisms.

As illustrated in Figure 2, for the enhanced feature

Z_{m}

of each modality (i.e., the output from the previous cross-modal feature enhancement module), the decoupler simultaneously generates two sets of attention gating weights [19]: one for guiding the extraction of emotional features, and the other for guiding the extraction of content features. These two gating weights are generated by applying independent linear transformations and Sigmoid activation functions to

Z_{m}

, ensuring their values lie within the range

(0, 1)

to enable soft selection. Specifically, for modality m, the enhanced feature

Z_{m} \in R^{N \times L_{m} \times E_{m}}

serves as the input. Each decoupling unit maintains two independent linear projection layers internally, which are responsible for calculating and generating the emotional gating weights and content gating weights.

For modality m, the enhanced feature

Z_{m}

is decoupled into an emotional feature

E_{m}

and content feature

C_{m}

through the following approach. First, we generate gating vectors for emotional and content components separately. This is achieved by performing linear transformations and Sigmoid activation functions on

Z_{m}

:

G_{m}^{E} = σ (Z_{m} W_{m}^{G E} + b_{m}^{G E})

(6)

G_{m}^{C} = σ (Z_{m} W_{m}^{G C} + b_{m}^{G C})

(7)

where

W_{m}^{G E}

,

b_{m}^{G E}

,

W_{m}^{G C}

, and

b_{m}^{G C}

are learnable weight matrices and bias vectors for generating the emotional gate

G_{m}^{E}

and content gate

G_{m}^{C}

, respectively. Subsequently, we perform element-wise multiplication between these gating vectors and the original enhanced feature

Z_{m}

to extract the corresponding decoupled features:

E_{m} = Z_{m} ⊙ G_{m}^{E}

(8)

C_{m} = Z_{m} ⊙ G_{m}^{C}

(9)

To enable the subsequent logical reasoning module to process fixed-length features, we apply a pooling operation to these decoupled sequential features, converting them into fixed-size vector representations. In this model, we adopt the first token of the sequence as the pooling strategy to aggregate the decoupled sequential features of each modality into a single vector [20]. After pooling, we obtain six fixed-dimensional vectors. These features serve as the output of this module and are passed to the subsequent logical reasoning module and loss function computation.

3.3. Logical Reasoning Module

In multimodal sentiment analysis tasks, judgments based on a single modality are often susceptible to being misled by superficial information. Furthermore, simply concatenating features from multiple modalities also fails to fully exploit the deep correlations and inferential relationships between them [21,22]. As illustrated in Figure 3, the text “You are really a genius” literally conveys positive sentiment, but when paired with the speaker’s sarcastic expression and impatient speech rate, the actual sentiment becomes negative sarcasm. Similarly, for the utterance “I’m really fine”, the text appears calm on the surface, yet when combined with a painful facial expression and a trembling, downcast tone of voice, the true sentiment tends to be negative.

To this end, this paper designs a logical reasoning module based on the Transformer Encoder. The pooled emotional and content features of each modality are concatenated with a learnable CLS token, and then fed into a multi-layer Transformer Encoder. Through global modeling and reasoning, the output of the CLS token is finally used as the global sentiment discriminative feature.

The concatenated sequence is:

X = [C, T_{e}, T_{c}, A_{e}, A_{c}, V_{e}, V_{c}]

(10)

We use the self-attention mechanism in Equation (1) to globally model and reason about the complex relationships between different modalities and their components.

Here, Q, K, and V denote the query, key, and value, respectively, and

d_{K}

is the feature dimension. Each layer of the Transformer Encoder consists of multi-head self-attention and a feed-forward neural network, equipped with residual connections and layer normalization operations to ensure the effective flow of information and sufficient fusion of features.

X^{'} = LayerNorm (X + σ (X))

(11)

X^{'} = LayerNorm (X^{'} + FFN (X^{'}))

(12)

After passing through L layers of the Transformer Encoder, the shape of the output sequence remains

[B, 7, D]

. Finally, we take the first token of the output sequence from the Transformer Encoder (i.e., the CLS token) as the global reasoning feature. This feature integrates information from all modalities and their components, serving as the input for the final sentiment classification.

3.4. Loss Function

In multimodal sentiment analysis tasks, relying on a single modality or simple feature concatenation is insufficient. To address this, this paper introduces two additional loss functions on top of the traditional cross-entropy classification loss: the Supervised Contrastive Loss (SC) for emotional components, and the independence loss for emotional and content components. This enables more fine-grained constraint and optimization of multimodal features [23,24].

First, to enhance the discriminability of emotional features, we adopt the Supervised Contrastive Loss. For the pooled emotional component features of each sample, samples of the same category are encouraged to be as close as possible in the feature space, while samples of different categories are pushed as far apart as possible.

L_{S C} = \sum_{i \in B} \frac{- 1}{| P (i) |} \sum_{p \in P (i)} log \frac{exp (z_{i} \cdot z_{p} / τ)}{\sum_{a \in A (i)} exp (z_{i} \cdot z_{a} / τ)}

(13)

where

P (i)

denotes the set of positive samples belonging to the same category as sample i,

A (i)

denotes the set of all samples except i itself, and

τ

is the temperature coefficient. For the sample set B in a batch, the distance between each anchor feature

z_{i}

and its positive sample

z_{p}

of the same category is minimized, while the distance between

z_{i}

and its negative sample

z_{n}

of a different category is maximized.

Meanwhile, to further improve the model’s interpretability and feature decoupling ability, we introduce an independence loss between emotional and content components. By minimizing the covariance between emotional features and content features, we encourage the model to learn mutually independent feature representations.

L_{i n d e p} = \frac{1}{D} \sum_{d = 1}^{D} C o v {(Z_{e}^{(d)}, Z_{c}^{(d)})}^{2}

(14)

Finally, the main task of the model is sentiment classification, which is optimized using the standard cross-entropy loss.

L = L_{C E} + λ_{1} L_{S C} + λ_{2} L_{i n d e p}

(15)

Through the joint optimization of multiple loss functions, we not only enhance the discriminability and robustness of emotional features but also strengthen the independence between emotional and content features.

4. Experiments and Results

4.1. Dataset Description

This paper evaluates the performance of our model using two publicly available multimodal sentiment analysis datasets: CMU-MOSI [25] and CMU-MOSEI [26] (Table 1).

The CMU-MOSI dataset consists of 2199 short video clips, covering visual, audio, and language modalities. MOSEI contains 22,852 annotated video segments (utterances) from 1000 distinct speakers collected from YouTube, spanning 250 topics from online video sharing. It includes intensity annotations for six basic emotion categories (ranging from −3 to +3), as well as binary sentiment polarity labels.

4.2. Experimental Setup

To ensure a fair comparison with other baseline methods, we adopt binary accuracy (Acc) and F1-score (F1) for evaluation in our experiments, which comprehensively reflect the actual performance of the model. Specifically, we use two variants of binary accuracy (Acc-2) and 7-class accuracy (Acc-7) as defined below: Acc-2 (Binary Accuracy) measures the performance on binary sentiment classification following two standard evaluation protocols in multi-modal sentiment analysis: the first is Negative vs. Non-negative, which treats neutral and positive samples as a single “non-negative” class and compares them against negative samples; the second is Negative vs. Positive, which excludes neutral samples and evaluates only the binary classification of positive versus negative sentiment. The two values reported in the Acc-2 and F₁ columns in Table 2 and Table 3 correspond to these two settings, respectively. In addition, Acc-7 (7-class Accuracy) measures the classification accuracy on the full 7-point sentiment scale, ranging from −3 (strongly negative) to +3 (strongly positive), reflecting the model’s fine-grained sentiment intensity prediction performance.

During the experimental setup, the main parameters of the model are configured as follows: The optimizer is set to Adam, with the learning rate of the main network initialized to 1 × 10⁻⁴ and the learning rate of the BERT component set to 2 × 10⁻⁵. The weight decay values are 1 × 10⁻⁵ for the main network and 1 × 10⁻² for the BERT part, respectively. The batch size is fixed at 32, and the number of training epochs is set to 20. Regarding the loss function, the temperature coefficient for the contrastive loss is set to 0.07.

4.3. Evaluated Models

To comprehensively evaluate the performance of our proposed model, we compare it with several baseline methods for multimodal sentiment analysis, including:

TFN [9]: Captures interactions among unimodal, bimodal, and trimodal data by computing outer product-based multidimensional tensors.
MFN [27]: Establishes a modal interaction model by continuously modeling specific views and cross views, and summarizing their temporal variations using a multi-view gating mechanism.
MuIT [28]: Leverages bidirectional cross-modal attention to focus on interactions between multimodal sequences across different time steps, potentially adjusting information flow from one modality to another.
MISA [29]: Projects each modality into two subspaces (modality-invariant and modality-specific) to learn commonalities and characteristics across modalities. Its loss function includes distribution similarity, orthogonality loss, reconstruction loss, and task prediction loss.
Graph [30]: Models unaligned multimodal sequences using graph-based neural models and Capsule Network. Converting sequential data into graphs avoids the gradient exploding or vanishing issues of RNNs.
MTSA [31]: Performs sentiment prediction by translating video and audio modalities into the text modality.
TETFN [32]: Enhances the text modality through interactions among the three modalities to more accurately extract emotional information from non-text modalities.
TMRN [33]: Employs a text-oriented multimodal fusion network, which prioritizes the text modality and obtains higher-quality modality representations by strengthening interactions with the audio and visual modalities.
FRFDIN [34]: Utilizes dynamic routing technology to realize intra-modal feature interaction and learn the inherent information of single modalities. It also learns consistent multimodal information through cross-modal interactions.
CRNet [35]: Projects features from different modalities into modality-invariant and modality-specific subspaces, and improves the quality of features in these two subspaces via a gradient-based feature enhancement mechanism, thereby enhancing the accuracy of multimodal sentiment analysis.
TSPMG [36]: TSPMG uses two-stage primary-modality supervision to improve multimodal sentiment analysis by addressing modality heterogeneity.

The results of baseline methods are taken from their original papers or existing benchmark studies, ensuring that all models are evaluated under the same experimental protocols for a fair comparison.

4.4. Results

It can be observed from the experimental results that our proposed method achieves the best performance across all evaluation metrics.

On the CMU-MOSI dataset (Table 2), the MAE is reduced to 0.709, representing a significant improvement over other models—outperforming the state-of-the-art model TMRN by 0.005 and the latest TSPMG by 0.003. The Corr is improved by 0.002 compared with CRNet. For Acc-2 and F1-score, consistent improvements are observed across both evaluation settings: under the “Negative vs. Non-negative” protocol, our method achieves an Acc-2 of 87.6% and an F₁-score of 87.5%; under the “Negative vs. Positive” protocol, it reaches 89.6% Acc-2 and 89.7% F1-score, achieving a 2.51% improvement.

On the CMU-MOSEI dataset (Table 3), the MAE of EFDMR reaches 0.522, representing a relative reduction of 2.4% compared to the strongest baseline FRFDIN. The Corr is improved to 0.794, which is 0.02 higher than that of MTSA (0.774). This indicates that our model can more accurately capture continuous emotional intensity and achieve better alignment with human annotations.

4.5. Ablation Experiment

In order to verify the validity of the different modules of the methodology, this paper was developed using the CMU-MOSI and CMU-MOSEI datasets, and ablation experiments were performed.

As can be observed from the ablation study results in Table 4, the complete EFDMR model achieves the lowest MAE and the highest Corr on both the MOSI and MOSEI datasets. Removing the decoupling module leads to a significant increase in MAE by 0.12, indicating that the decoupling module is the most critical component. The Corr is significantly improved to 0.898, which demonstrates that emotion/content aliasing is the most detrimental factor to performance. Removing the cross-modal enhancement also results in a substantial performance degradation, highlighting the importance of cross-modal verification. The degradation is moderate when the reasoning module is omitted, while contrastive learning and independence loss contribute stable but relatively minor gains. Overall, the decoupling module appears to be the most important component.

5. Discussion

We investigate two principal aspects of our findings: Whether feature decoupling produces an effect, and an ablation study that details the contribution of logical reasoning. We address these elements in the following subsections.

5.1. Feature Decoupling Verification

To verify the effectiveness of the feature decoupling module, we conduct Pearson correlation analysis on the emotional features

(e_{v}, e_{a}, e_{t})

and content features

(c_{v}, c_{a}, c_{t})

output by the module, and the results are illustrated in Figure 4.

The correlation coefficients between emotional and content features within the same modality are generally low, where the values of

e_{v}

vs.

c_{v}

,

e_{a}

vs.

c_{a}

, and

e_{t}

vs.

c_{t}

are 0.06, 0.08 and 0.11, respectively. This phenomenon demonstrates that emotional information is effectively separated from content noise, and the purity of emotional features is well guaranteed. Meanwhile, reasonable correlations exist among cross-modal emotional features (e.g., the correlation coefficients of

e_{v}

with

e_{a}

and

e_{t}

are 0.28 and 0.34) and among cross-modal content features (e.g., the correlation coefficients of

c_{v}

with

c_{a}

and

c_{t}

are 0.23 and 0.33), indicating that the model captures shared common emotional and content information across different modalities. In addition, the overall correlation between emotional features and content features of other modalities remains at a low level, which further verifies that the decoupling module avoids the aliasing of emotional and content information in the global feature space. The above results prove that the proposed feature decoupling module realizes effective separation between emotional and content features. It provides low-redundancy and high-purity feature inputs for the subsequent logical reasoning module, and serves as a critical guarantee for the performance improvement of our model.

5.2. Logical Reasoning Verification

To validate the effectiveness of the logical reasoning module, we construct a baseline model (w/o Reasoning) by removing this module, which directly performs sentiment classification on the decoupled emotional and content features. Comparative experiments are conducted between the baseline and the complete proposed model (Ours), and the quantitative results are presented in Table 5.

Under all modality missing scenarios, our model achieves stable improvements in terms of Acc-2 and F1-score. In the text-missing (T) scenario, Acc-2 is increased by 2.81% and F1-score is improved by 1.81%. In the vision-missing (V) scenario, Acc-2 and F1-score gain increments of 1.91% and 1.37%, respectively. For the audio-missing (A) scenario, the improvements reach 0.54% in Acc-2 and 0.82% in F1-score. These results demonstrate that the logical reasoning module can effectively capture the deep correlations and reasoning relationships among emotional and content features across different modalities. It alleviates the prediction deviation caused by modality inconsistency and missing information, and remarkably enhances the robustness and overall performance in multimodal sentiment analysis. The above analyses fully verify the necessity and effectiveness of the designed reasoning module.

To further analyze the effectiveness of our enhanced feature decoupling and multimodal logical reasoning framework, we visualize the multimodal feature distributions on the CMU-MOSI test set using t-SNE [37]. Figure 5 and Figure 6 present the feature distributions before and after the proposed modeling, respectively.

As shown in Figure 5, the original multimodal features exhibit significant overlap and mixing across sentiment categories (negative, neutral, and positive). Positive and negative samples are not well-separated, and neutral samples are scattered among them, indicating that raw multimodal features lack sufficient discriminative power and are prone to confounding effects across modalities.

In contrast, the features obtained after our proposed modeling (Figure 6) show a substantially improved separation pattern. Samples corresponding to different sentiment polarities form distinct, compact clusters with clear boundaries: negative samples are predominantly clustered on the left, positive samples on the right, and neutral samples are gathered in a separate region. This demonstrates that our enhanced feature decoupling mechanism effectively disentangles modality-specific noise and irrelevant information, while the multimodal logical reasoning module strengthens the discriminative representation of sentiment-related features. The improved inter-class separability directly explains the performance gains observed in our sentiment classification experiments, confirming that the learned features are more discriminative and aligned with sentiment labels.

Notably, the clearer separation also implies that our method alleviates the problem of modality interference, enabling the model to better capture the core sentiment cues from heterogeneous multimodal inputs. This visualization validates the effectiveness of our proposed framework in learning discriminative multimodal representations, which is critical for robust sentiment analysis in real-world scenarios.

6. Conclusions

This paper proposes a sentiment analysis framework based on enhanced feature decoupling and multimodal logical reasoning, which achieves accurate recognition of multimodal emotions through a hierarchical optimization design. The framework innovatively integrates three core modules: cross-modal feature enhancement, attention-gated feature decoupling, and high-order logical reasoning, forming an end-to-end solution. On two public datasets, the proposed method outperforms the state-of-the-art baseline methods in all evaluation metrics. Further ablation experiments verify the significant contribution of each module to the overall performance. Future research directions will explore more complex emotional expression patterns [38] and verify the generalization ability of the method on larger and more diverse datasets [39,40].

Author Contributions

Conceptualization, H.Y. and M.Z.; methodology, H.Y., M.Z. and Y.Q.; software, H.Y.; validation, M.Z. and Y.Q.; formal analysis, H.Y., M.Z., J.G. and Z.Z.; investigation, H.Y. and M.Z.; resources, H.Y.; data curation, H.Y.; writing—original draft preparation, H.Y., B.C. and Y.H.; writing—review and editing, M.Z., Y.Q., Y.L. and M.H.; visualization, H.Y., M.Z. and Y.Q.; supervision, M.Z.; project administration, M.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number U1833119; the Natural Science Foundation of Hubei Province, grant number 2025AFC122; the Key Scientific and Technological Research Project of Henan Provincial Department of Science and Technology, grant number 252102210130; the Humanities and Social Sciences Research Fund of the Ministry of Education, grant number 22YJAZH038; and the Industry–University Cooperation Educational Project of the Ministry of Education, grant number 231106627155856.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available from the corresponding authors upon reasonable request and with appropriate justification.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Das, R.; Singh, T.D. Multimodal sentiment analysis: A survey of methods, trends, and challenges. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
Yang, X.; Feng, S.; Wang, D.; Zhang, Y.; Poria, S. Few-shot multimodal sentiment analysis based on multimodal probabilistic fusion prompts. In Proceedings of the 31st ACM International Conference on Multimedia; ACM: New York, NY, USA, 2023; pp. 6045–6053. [Google Scholar]
Zhu, Q.; Jiang, F.; Li, C. Time- varying interval prediction and decision-making for short-term wind power using convolutional gated recurrent unit and multi-objective elephant clan optimization. Energy 2023, 271, 127006. [Google Scholar] [CrossRef]
Pei, G.; Li, H.; Lu, Y.; Wang, Y.; Hua, S.; Li, T. Affective computing: Recent advances, challenges, and future trends. Intell. Comput. 2024, 3, 0076. [Google Scholar] [CrossRef]
Pawłowski, M.; Wróblewska, A.; Sysko-Romańczuk, S. Effective techniques for multimodal data fusion: A comparative analysis. Sensors 2023, 23, 2381. [Google Scholar] [CrossRef] [PubMed]
Jiang, F.; Zhu, Q.; Yang, J.; Chen, G.; Tian, T. Clustering-based interval prediction of electric load using multi-objective pathfinder algorithm and Elman neural network. Appl. Soft Comput. 2022, 129, 109602. [Google Scholar] [CrossRef]
Zhang, Q.; Wei, Y.; Han, Z.; Fu, H.; Peng, X.; Deng, C.; Hu, Q.; Xu, C.; Wen, J.; Hu, D.; et al. Multimodal fusion on low-quality data: A comprehensive survey. arXiv 2024, arXiv:2404.18947. [Google Scholar] [CrossRef]
Wankhade, M.; Rao, A.C.S.; Kulkarni, C. A survey on sentiment analysis methods, applications, and challenges. Artif. Intell. Rev. 2022, 55, 5731–5780. [Google Scholar] [CrossRef]
Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor fusion network for multimodal sentiment analysis. arXiv 2017, arXiv:1707.07250. [Google Scholar] [CrossRef]
Li, T.; Zhang, L.; Liu, S.; Shen, S. Multi-modal integrated prediction and decision-making with adaptive interaction modality explorations. arXiv 2024, arXiv:2408.13742. [Google Scholar]
Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Albrecht, C.M.; Braham, N.A.A.; Liu, C.; Xiong, Z.; Zhu, X.X. Decoupling common and unique representations for multimodal self-supervised learning. In Proceedings of the European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 286–303. [Google Scholar]
Han, W.; Chen, H.; Poria, S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural; ACL: Stroudsburg, PA, USA, 2021. [Google Scholar]
Wu, S.X.; Dai, D.M.; Qin, Z.W.; Liu, T.Y.; Lin, B.; Cao, Y.B.; Sui, Z.F. Denoising bottleneck with mutual information maximization for video multimodal fusion. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; ACL: Stroudsburg, PA, USA, 2023; pp. 756–767. [Google Scholar]
Zhang, H.; Wang, Y.; Yin, G.; Liu, K.; Liu, Y.; Yu, T. Learning language-guided adaptive hyper-modality representation for multimodal sentiment analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics; ACL: Stroudsburg, PA, USA, 2023; pp. 2231–2243. [Google Scholar]
Yang, H.; Zhang, S.; Shen, H.; Zhang, G.; Deng, X.; Xiong, J.; Feng, L.; Wang, J.; Zhang, H.; Sheng, S. A multi-layer feature fusion model based on convolution and attention mechanisms for text classification. Appl. Sci. 2023, 13, 8550. [Google Scholar] [CrossRef]
Wu, Q.; Wang, M.; Zhou, G.; Ji, W. A study of progressive data flow knowledge tracing based on reconstructed attention mechanism. Neural Comput. Appl. 2025, 37, 7675–7689. [Google Scholar] [CrossRef]
Singh, P.; Raman, B. Transformer Architecture: Encoder and Decoder. In The Geometry of Intelligence: Foundations of Transformer Networks in Deep Learning; Springer Nature: Singapore, 2025. [Google Scholar]
Lu, W.; Chen, S.B.; Shu, Q.L.; Tang, J.; Luo, B. Decouplenet: A lightweight backbone network with efficient feature decoupling for remote sensing visual tasks. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4414613. [Google Scholar] [CrossRef]
Wang, Y.; Chen, X.; Cao, L.; Huang, W.; Sun, F.; Wang, Y. Multimodal token fusion for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 12186–12195. [Google Scholar]
Liang, P.P.; Zadeh, A.; Morency, L.P. Foundations trends in multimodal machine learning: Principles, challenges, and open questions. ACM Comput. Surv. 2024, 56, 1–42. [Google Scholar] [CrossRef]
Li, W.; Li, K.; Chen, S. A multimodal fusion method for semantic segmentation of high-resolution remote sensing images. J. South-Cent. Univ. Natl. (Nat. Sci. Ed.) 2020, 39, 405–412. [Google Scholar]
Hu, Y.; Huang, X.; Wang, X.; Lin, H.; Zhang, R. Transformer- based adaptive contrastive learning for multimodal sentiment analysis. Multimed. Tools Appl. 2025, 84, 1385–1402. [Google Scholar] [CrossRef]
Huan, R.; Zhong, G.; Chen, P.; Liang, R. Trisat: Trimodal representation learning for multimodal sentiment analysis. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 4105–4120. [Google Scholar] [CrossRef]
Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.P. MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion video. arXiv 2016, arXiv:1606.06259. [Google Scholar] [CrossRef]
Zadeh, A.A.B.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.P. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); ACL: Stroudsburg, PA, USA, 2018; pp. 2236–2246. [Google Scholar]
Tsai, Y.H.H.; Liang, P.P.; Zadeh, A.; Morency, L.P.; Salakhutdinov, R. Learning factorized multimodal representations. arXiv 2018, arXiv:1806.06176. [Google Scholar]
Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; ACL: Stroudsburg, PA, USA, 2019; pp. 6558–6569. [Google Scholar]
Hazarika, D.; Zimmermann, R.; Poria, S. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia; ACM: New York, NY, USA, 2020; pp. 1122–1131. [Google Scholar]
Wu, J.; Mai, S.; Hu, H. Graph capsule aggregation for unaligned multimodal sequences. In Proceedings of the 2021 International Conference on Multimodal Interaction; ACM: New York, NY, USA, 2021; pp. 521–529. [Google Scholar]
Yang, B.; Shao, B.; Wu, L.; Lin, X. Multimodal sentiment analysis with unidirectional modality translation. Neurocomputing 2022, 467, 130–137. [Google Scholar] [CrossRef]
Wang, D.; Guo, X.; Tian, Y.; Liu, J.; He, L.; Luo, X. TETFN: A text enhanced transformer fusion network for multimodal sentiment analysis. Pattern Recognit. 2023, 136, 109259. [Google Scholar] [CrossRef]
Lei, Y.; Yang, D.; Li, M.; Wang, S.; Chen, J.; Zhang, L. Text-oriented modality reinforcement network for multimodal sentiment analysis from unaligned multimodal sequences. In Proceedings of the CAAI International Conference on Artificial Intelligence; Springer Nature: Singapore, 2023; pp. 189–200. [Google Scholar]
Zeng, Y.; Li, Z.; Chen, Z.; Ma, H. A feature-based restoration dynamic interaction network for multimodal sentiment analysis. Eng. Appl. Artif. Intell. 2024, 127, 107335. [Google Scholar] [CrossRef]
Shi, H.; Pu, Y.; Zhao, Z.; Huang, J.; Zhou, D.; Xu, D.; Cao, J. Co-space representation interaction network for multimodal sentiment analysis. Knowl. Based Syst. 2024, 283, 111149. [Google Scholar] [CrossRef]
Ma, G.; Ren, X.; Jiang, Y.; Guan, H.; Xu, B. From Feature Alignment to Multimodal Fusion: A Two-Stage Primary Modality-Guided Approach for MSA. In Proceedings of the 7th ACM International Conference on Multimedia in Asia; ACM: New York, NY, USA, 2025. [Google Scholar]
Wang, Q.; Xia, W.; Tao, Z.; Gao, Q.; Cao, X. Deep self- supervised t-SNE for multi-modal subspace clustering. In Proceedings of the 29th ACM International Conference on Multimedia; ACM: New York, NY, USA, 2021; pp. 1748–1755. [Google Scholar]
Jia, N.; Zheng, C.; Sun, W. A multimodal emotion recognition model integrating speech, video and MoCAP. Multimed. Tools Appl. 2022, 81, 32265–32286. [Google Scholar] [CrossRef]
Kumar, H.; Aruldoss, M.; Wynn, M. Cross-Modal Attention Fusion: A Deep Learning and Affective Computing Model for Emotion Recognition. Multimodal Technol. Interact. 2025, 9, 116. [Google Scholar] [CrossRef]
Wang, Z.; Luo, Y.; Qiu, R.; Huang, Z.; Baktashmotlagh, M. Learning to diversify for single domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 834–843. [Google Scholar]

Figure 1. Overall architecture of the EFDMR model.

Figure 2. Feature Decoupling Module.

Figure 3. Inconsistent modal representation.

Figure 4. Pearson correlation analysis.

Figure 5. Original multimodal characterization.

Figure 6. Multimodal characteristics after the proposed modeling.

Table 1. Statistics of the different datasets.

Datasets	Train	Valid	Test	All
MOSI	1284	229	686	2199
MOSEI	16,329	1871	4659	228,856

Table 2. Comparison of experimental results on the CMU-MOSI dataset.

Method	MAE	Corr	Acc-2	Acc-7	F₁
TFN₂₀₁₇	0.901	0.698	-/80.8	34.9	-/80.7
MFN₂₀₁₈	0.877	0.706	-/81.6	35.40	-/81.7
MuIT₂₀₁₉	0.871	0.698	-/83.0	40.0	-/82.8
MISA₂₀₂₀	0.783	0.761	81.8/83.4	42.3	81.7/83.6
Graph₂₀₂₁	0.982	0.669	80.18/-	37.46	80.27
MTSA₂₀₂₂	0.696	0.806	-/86.8	46.4	-/86.8
TETFN₂₀₂₃	0.717	0.801	84.05/86.10	-	83.83
TMRN₂₀₂₃	0.704	0.784	83.67/85.67	48.68	85.3/87.5
FRDIN₂₀₂₄	0.682	0.813	85.8/87.4	46.59	83.45/87
CRNet₂₀₂₄	0.712	0.897	-/86.4	47.40	-/86.4
TSPMG₂₀₂₅	0.712	0.799	-/85.98	46.33	-/85.96
EFDMR	0.709	0.898	87.6/89.6	53.36	87.5/89.7

Table 3. Comparison of experimental results on the CMU-MOSEI dataset.

Method	MAE	Corr	Acc-2	Acc-7	F₁
TFN₂₀₁₇	0.593	0.700	-/82.5	50.2	-/82.1
MFN₂₀₁₈	0.717	0.706	-/84.4	51.3	-/84.3
MuIT₂₀₁₉	0.580	0.703	-/82.5	51.8	-/82.3
MISA₂₀₂₀	0.555	0.756	83.6/85.5	52.2	83.8/85.3
Graph₂₀₂₁	0.535	0.760	80.19/-	52.4	-/82.4
MTSA₂₀₂₂	0.541	0.774	-/85.1	52.9	-/85.3
TETFN₂₀₂₃	0.551	0.748	84.25/85.18	-	84.18/85.27
TMRN₂₀₂₃	0.535	0.762	83.39/86.19	53.65	83.67/86.08
FRDIN₂₀₂₄	0.525	0.778	83.30/86.30	54.40	83.70/86.20
CRNet₂₀₂₄	0.541	0.771	-/86.20	53.80	-/86.10
TSPMG₂₀₂₅	0.539	0.764	-/85.50	53.13	-/85.45
EFDMR	0.522	0.794	83.0/86.8	54.5	85.5/86.8

Table 4. Comparison of experimental results on the CMU-MOSI dataset.

Models	MOSI		MOSEI
Models	MAE	Corr	MAE	Corr
w/o Decoupling	0.829	0.821	0.536	0.794
w/o cross-modal	0.755	0.874	0.519	0.768
w/o Reasoning	0.762	0.877	0.54	0.772
w/o Contrastive	0.753	0.882	0.523	0.784
w/o Loss of independence	0.781	0.879	0.429	0.778
Ours	0.709	0.898	0.512	0.794

Table 5. Performance comparison under different modality missing scenarios.

Model	Missing	ACC-2	F1
w/o Reasoning	T	0.7618	0.7736
w/o Reasoning	V	0.7710	0.7688
w/o Reasoning	A	0.7873	0.7846
ours	T	0.7899	0.7917
ours	V	0.7901	0.7825
ours	A	0.7927	0.7928

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, H.; Zhao, M.; Qiu, Y.; Li, Y.; Guo, J.; Zhang, Z.; Chen, B.; He, M.; Hong, Y. Sentiment Analysis Based on Enhanced Feature Decoupling and Multimodal Logical Reasoning. Multimodal Technol. Interact. 2026, 10, 50. https://doi.org/10.3390/mti10050050

AMA Style

Yang H, Zhao M, Qiu Y, Li Y, Guo J, Zhang Z, Chen B, He M, Hong Y. Sentiment Analysis Based on Enhanced Feature Decoupling and Multimodal Logical Reasoning. Multimodal Technologies and Interaction. 2026; 10(5):50. https://doi.org/10.3390/mti10050050

Chicago/Turabian Style

Yang, Hua, Ming Zhao, Yuanhao Qiu, Yuanyuan Li, Junying Guo, Ziran Zhang, Baozhou Chen, Mingzhe He, and Yu Hong. 2026. "Sentiment Analysis Based on Enhanced Feature Decoupling and Multimodal Logical Reasoning" Multimodal Technologies and Interaction 10, no. 5: 50. https://doi.org/10.3390/mti10050050

APA Style

Yang, H., Zhao, M., Qiu, Y., Li, Y., Guo, J., Zhang, Z., Chen, B., He, M., & Hong, Y. (2026). Sentiment Analysis Based on Enhanced Feature Decoupling and Multimodal Logical Reasoning. Multimodal Technologies and Interaction, 10(5), 50. https://doi.org/10.3390/mti10050050

Article Menu

Sentiment Analysis Based on Enhanced Feature Decoupling and Multimodal Logical Reasoning

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Cross-Modal Feature Enhancement Module

3.2. Feature Decoupling Module

3.3. Logical Reasoning Module

3.4. Loss Function

4. Experiments and Results

4.1. Dataset Description

4.2. Experimental Setup

4.3. Evaluated Models

4.4. Results

4.5. Ablation Experiment

5. Discussion

5.1. Feature Decoupling Verification

5.2. Logical Reasoning Verification

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI