Sentiment Analysis Based on Enhanced Feature Decoupling and Multimodal Logical Reasoning
Abstract
1. Introduction
- (1)
- An innovative cross-modal feature enhancement method is proposed, which leverages cross-attention mechanisms to facilitate information complementarity among different modalities, thereby effectively enriching context-aware features required for subsequent processing.
- (2)
- An attention-gated feature disentanglement module is developed to separate sentiment and content information within fused representations; an independence loss is introduced to further enhance the purity of sentiment features and improve model interpretability.
- (3)
- A multimodal logical reasoning module is presented, which employs Transformer Encoder to perform deep reasoning over disentangled multi-source features, generating a global sentiment-discriminative representation.
- (4)
- Comprehensive experiments are conducted on two public benchmarks, CMU-MOSI and CMU-MOSEI, and the experimental results demonstrate that the proposed model outperforms existing baseline approaches, validating the effectiveness of the proposed methodology.
2. Related Work
3. Methods
3.1. Cross-Modal Feature Enhancement Module
3.2. Feature Decoupling Module
3.3. Logical Reasoning Module
3.4. Loss Function
4. Experiments and Results
4.1. Dataset Description
4.2. Experimental Setup
4.3. Evaluated Models
- TFN [9]: Captures interactions among unimodal, bimodal, and trimodal data by computing outer product-based multidimensional tensors.
- MFN [27]: Establishes a modal interaction model by continuously modeling specific views and cross views, and summarizing their temporal variations using a multi-view gating mechanism.
- MuIT [28]: Leverages bidirectional cross-modal attention to focus on interactions between multimodal sequences across different time steps, potentially adjusting information flow from one modality to another.
- MISA [29]: Projects each modality into two subspaces (modality-invariant and modality-specific) to learn commonalities and characteristics across modalities. Its loss function includes distribution similarity, orthogonality loss, reconstruction loss, and task prediction loss.
- Graph [30]: Models unaligned multimodal sequences using graph-based neural models and Capsule Network. Converting sequential data into graphs avoids the gradient exploding or vanishing issues of RNNs.
- MTSA [31]: Performs sentiment prediction by translating video and audio modalities into the text modality.
- TETFN [32]: Enhances the text modality through interactions among the three modalities to more accurately extract emotional information from non-text modalities.
- TMRN [33]: Employs a text-oriented multimodal fusion network, which prioritizes the text modality and obtains higher-quality modality representations by strengthening interactions with the audio and visual modalities.
- FRFDIN [34]: Utilizes dynamic routing technology to realize intra-modal feature interaction and learn the inherent information of single modalities. It also learns consistent multimodal information through cross-modal interactions.
- CRNet [35]: Projects features from different modalities into modality-invariant and modality-specific subspaces, and improves the quality of features in these two subspaces via a gradient-based feature enhancement mechanism, thereby enhancing the accuracy of multimodal sentiment analysis.
- TSPMG [36]: TSPMG uses two-stage primary-modality supervision to improve multimodal sentiment analysis by addressing modality heterogeneity.
4.4. Results
4.5. Ablation Experiment
5. Discussion
5.1. Feature Decoupling Verification
5.2. Logical Reasoning Verification
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Das, R.; Singh, T.D. Multimodal sentiment analysis: A survey of methods, trends, and challenges. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
- Yang, X.; Feng, S.; Wang, D.; Zhang, Y.; Poria, S. Few-shot multimodal sentiment analysis based on multimodal probabilistic fusion prompts. In Proceedings of the 31st ACM International Conference on Multimedia; ACM: New York, NY, USA, 2023; pp. 6045–6053. [Google Scholar]
- Zhu, Q.; Jiang, F.; Li, C. Time- varying interval prediction and decision-making for short-term wind power using convolutional gated recurrent unit and multi-objective elephant clan optimization. Energy 2023, 271, 127006. [Google Scholar] [CrossRef]
- Pei, G.; Li, H.; Lu, Y.; Wang, Y.; Hua, S.; Li, T. Affective computing: Recent advances, challenges, and future trends. Intell. Comput. 2024, 3, 0076. [Google Scholar] [CrossRef]
- Pawłowski, M.; Wróblewska, A.; Sysko-Romańczuk, S. Effective techniques for multimodal data fusion: A comparative analysis. Sensors 2023, 23, 2381. [Google Scholar] [CrossRef] [PubMed]
- Jiang, F.; Zhu, Q.; Yang, J.; Chen, G.; Tian, T. Clustering-based interval prediction of electric load using multi-objective pathfinder algorithm and Elman neural network. Appl. Soft Comput. 2022, 129, 109602. [Google Scholar] [CrossRef]
- Zhang, Q.; Wei, Y.; Han, Z.; Fu, H.; Peng, X.; Deng, C.; Hu, Q.; Xu, C.; Wen, J.; Hu, D.; et al. Multimodal fusion on low-quality data: A comprehensive survey. arXiv 2024, arXiv:2404.18947. [Google Scholar] [CrossRef]
- Wankhade, M.; Rao, A.C.S.; Kulkarni, C. A survey on sentiment analysis methods, applications, and challenges. Artif. Intell. Rev. 2022, 55, 5731–5780. [Google Scholar] [CrossRef]
- Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor fusion network for multimodal sentiment analysis. arXiv 2017, arXiv:1707.07250. [Google Scholar] [CrossRef]
- Li, T.; Zhang, L.; Liu, S.; Shen, S. Multi-modal integrated prediction and decision-making with adaptive interaction modality explorations. arXiv 2024, arXiv:2408.13742. [Google Scholar]
- Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef] [PubMed]
- Wang, Y.; Albrecht, C.M.; Braham, N.A.A.; Liu, C.; Xiong, Z.; Zhu, X.X. Decoupling common and unique representations for multimodal self-supervised learning. In Proceedings of the European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 286–303. [Google Scholar]
- Han, W.; Chen, H.; Poria, S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural; ACL: Stroudsburg, PA, USA, 2021. [Google Scholar]
- Wu, S.X.; Dai, D.M.; Qin, Z.W.; Liu, T.Y.; Lin, B.; Cao, Y.B.; Sui, Z.F. Denoising bottleneck with mutual information maximization for video multimodal fusion. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; ACL: Stroudsburg, PA, USA, 2023; pp. 756–767. [Google Scholar]
- Zhang, H.; Wang, Y.; Yin, G.; Liu, K.; Liu, Y.; Yu, T. Learning language-guided adaptive hyper-modality representation for multimodal sentiment analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics; ACL: Stroudsburg, PA, USA, 2023; pp. 2231–2243. [Google Scholar]
- Yang, H.; Zhang, S.; Shen, H.; Zhang, G.; Deng, X.; Xiong, J.; Feng, L.; Wang, J.; Zhang, H.; Sheng, S. A multi-layer feature fusion model based on convolution and attention mechanisms for text classification. Appl. Sci. 2023, 13, 8550. [Google Scholar] [CrossRef]
- Wu, Q.; Wang, M.; Zhou, G.; Ji, W. A study of progressive data flow knowledge tracing based on reconstructed attention mechanism. Neural Comput. Appl. 2025, 37, 7675–7689. [Google Scholar] [CrossRef]
- Singh, P.; Raman, B. Transformer Architecture: Encoder and Decoder. In The Geometry of Intelligence: Foundations of Transformer Networks in Deep Learning; Springer Nature: Singapore, 2025. [Google Scholar]
- Lu, W.; Chen, S.B.; Shu, Q.L.; Tang, J.; Luo, B. Decouplenet: A lightweight backbone network with efficient feature decoupling for remote sensing visual tasks. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4414613. [Google Scholar] [CrossRef]
- Wang, Y.; Chen, X.; Cao, L.; Huang, W.; Sun, F.; Wang, Y. Multimodal token fusion for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 12186–12195. [Google Scholar]
- Liang, P.P.; Zadeh, A.; Morency, L.P. Foundations trends in multimodal machine learning: Principles, challenges, and open questions. ACM Comput. Surv. 2024, 56, 1–42. [Google Scholar] [CrossRef]
- Li, W.; Li, K.; Chen, S. A multimodal fusion method for semantic segmentation of high-resolution remote sensing images. J. South-Cent. Univ. Natl. (Nat. Sci. Ed.) 2020, 39, 405–412. [Google Scholar]
- Hu, Y.; Huang, X.; Wang, X.; Lin, H.; Zhang, R. Transformer- based adaptive contrastive learning for multimodal sentiment analysis. Multimed. Tools Appl. 2025, 84, 1385–1402. [Google Scholar] [CrossRef]
- Huan, R.; Zhong, G.; Chen, P.; Liang, R. Trisat: Trimodal representation learning for multimodal sentiment analysis. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 4105–4120. [Google Scholar] [CrossRef]
- Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.P. MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion video. arXiv 2016, arXiv:1606.06259. [Google Scholar] [CrossRef]
- Zadeh, A.A.B.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.P. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); ACL: Stroudsburg, PA, USA, 2018; pp. 2236–2246. [Google Scholar]
- Tsai, Y.H.H.; Liang, P.P.; Zadeh, A.; Morency, L.P.; Salakhutdinov, R. Learning factorized multimodal representations. arXiv 2018, arXiv:1806.06176. [Google Scholar]
- Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; ACL: Stroudsburg, PA, USA, 2019; pp. 6558–6569. [Google Scholar]
- Hazarika, D.; Zimmermann, R.; Poria, S. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia; ACM: New York, NY, USA, 2020; pp. 1122–1131. [Google Scholar]
- Wu, J.; Mai, S.; Hu, H. Graph capsule aggregation for unaligned multimodal sequences. In Proceedings of the 2021 International Conference on Multimodal Interaction; ACM: New York, NY, USA, 2021; pp. 521–529. [Google Scholar]
- Yang, B.; Shao, B.; Wu, L.; Lin, X. Multimodal sentiment analysis with unidirectional modality translation. Neurocomputing 2022, 467, 130–137. [Google Scholar] [CrossRef]
- Wang, D.; Guo, X.; Tian, Y.; Liu, J.; He, L.; Luo, X. TETFN: A text enhanced transformer fusion network for multimodal sentiment analysis. Pattern Recognit. 2023, 136, 109259. [Google Scholar] [CrossRef]
- Lei, Y.; Yang, D.; Li, M.; Wang, S.; Chen, J.; Zhang, L. Text-oriented modality reinforcement network for multimodal sentiment analysis from unaligned multimodal sequences. In Proceedings of the CAAI International Conference on Artificial Intelligence; Springer Nature: Singapore, 2023; pp. 189–200. [Google Scholar]
- Zeng, Y.; Li, Z.; Chen, Z.; Ma, H. A feature-based restoration dynamic interaction network for multimodal sentiment analysis. Eng. Appl. Artif. Intell. 2024, 127, 107335. [Google Scholar] [CrossRef]
- Shi, H.; Pu, Y.; Zhao, Z.; Huang, J.; Zhou, D.; Xu, D.; Cao, J. Co-space representation interaction network for multimodal sentiment analysis. Knowl. Based Syst. 2024, 283, 111149. [Google Scholar] [CrossRef]
- Ma, G.; Ren, X.; Jiang, Y.; Guan, H.; Xu, B. From Feature Alignment to Multimodal Fusion: A Two-Stage Primary Modality-Guided Approach for MSA. In Proceedings of the 7th ACM International Conference on Multimedia in Asia; ACM: New York, NY, USA, 2025. [Google Scholar]
- Wang, Q.; Xia, W.; Tao, Z.; Gao, Q.; Cao, X. Deep self- supervised t-SNE for multi-modal subspace clustering. In Proceedings of the 29th ACM International Conference on Multimedia; ACM: New York, NY, USA, 2021; pp. 1748–1755. [Google Scholar]
- Jia, N.; Zheng, C.; Sun, W. A multimodal emotion recognition model integrating speech, video and MoCAP. Multimed. Tools Appl. 2022, 81, 32265–32286. [Google Scholar] [CrossRef]
- Kumar, H.; Aruldoss, M.; Wynn, M. Cross-Modal Attention Fusion: A Deep Learning and Affective Computing Model for Emotion Recognition. Multimodal Technol. Interact. 2025, 9, 116. [Google Scholar] [CrossRef]
- Wang, Z.; Luo, Y.; Qiu, R.; Huang, Z.; Baktashmotlagh, M. Learning to diversify for single domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 834–843. [Google Scholar]






| Datasets | Train | Valid | Test | All |
|---|---|---|---|---|
| MOSI | 1284 | 229 | 686 | 2199 |
| MOSEI | 16,329 | 1871 | 4659 | 228,856 |
| Method | MAE | Corr | Acc-2 | Acc-7 | F1 |
|---|---|---|---|---|---|
| TFN2017 | 0.901 | 0.698 | -/80.8 | 34.9 | -/80.7 |
| MFN2018 | 0.877 | 0.706 | -/81.6 | 35.40 | -/81.7 |
| MuIT2019 | 0.871 | 0.698 | -/83.0 | 40.0 | -/82.8 |
| MISA2020 | 0.783 | 0.761 | 81.8/83.4 | 42.3 | 81.7/83.6 |
| Graph2021 | 0.982 | 0.669 | 80.18/- | 37.46 | 80.27 |
| MTSA2022 | 0.696 | 0.806 | -/86.8 | 46.4 | -/86.8 |
| TETFN2023 | 0.717 | 0.801 | 84.05/86.10 | - | 83.83 |
| TMRN2023 | 0.704 | 0.784 | 83.67/85.67 | 48.68 | 85.3/87.5 |
| FRDIN2024 | 0.682 | 0.813 | 85.8/87.4 | 46.59 | 83.45/87 |
| CRNet2024 | 0.712 | 0.897 | -/86.4 | 47.40 | -/86.4 |
| TSPMG2025 | 0.712 | 0.799 | -/85.98 | 46.33 | -/85.96 |
| EFDMR | 0.709 | 0.898 | 87.6/89.6 | 53.36 | 87.5/89.7 |
| Method | MAE | Corr | Acc-2 | Acc-7 | F1 |
|---|---|---|---|---|---|
| TFN2017 | 0.593 | 0.700 | -/82.5 | 50.2 | -/82.1 |
| MFN2018 | 0.717 | 0.706 | -/84.4 | 51.3 | -/84.3 |
| MuIT2019 | 0.580 | 0.703 | -/82.5 | 51.8 | -/82.3 |
| MISA2020 | 0.555 | 0.756 | 83.6/85.5 | 52.2 | 83.8/85.3 |
| Graph2021 | 0.535 | 0.760 | 80.19/- | 52.4 | -/82.4 |
| MTSA2022 | 0.541 | 0.774 | -/85.1 | 52.9 | -/85.3 |
| TETFN2023 | 0.551 | 0.748 | 84.25/85.18 | - | 84.18/85.27 |
| TMRN2023 | 0.535 | 0.762 | 83.39/86.19 | 53.65 | 83.67/86.08 |
| FRDIN2024 | 0.525 | 0.778 | 83.30/86.30 | 54.40 | 83.70/86.20 |
| CRNet2024 | 0.541 | 0.771 | -/86.20 | 53.80 | -/86.10 |
| TSPMG2025 | 0.539 | 0.764 | -/85.50 | 53.13 | -/85.45 |
| EFDMR | 0.522 | 0.794 | 83.0/86.8 | 54.5 | 85.5/86.8 |
| Models | MOSI | MOSEI | ||
|---|---|---|---|---|
| MAE | Corr | MAE | Corr | |
| w/o Decoupling | 0.829 | 0.821 | 0.536 | 0.794 |
| w/o cross-modal | 0.755 | 0.874 | 0.519 | 0.768 |
| w/o Reasoning | 0.762 | 0.877 | 0.54 | 0.772 |
| w/o Contrastive | 0.753 | 0.882 | 0.523 | 0.784 |
| w/o Loss of independence | 0.781 | 0.879 | 0.429 | 0.778 |
| Ours | 0.709 | 0.898 | 0.512 | 0.794 |
| Model | Missing | ACC-2 | F1 |
|---|---|---|---|
| w/o Reasoning | T | 0.7618 | 0.7736 |
| w/o Reasoning | V | 0.7710 | 0.7688 |
| w/o Reasoning | A | 0.7873 | 0.7846 |
| ours | T | 0.7899 | 0.7917 |
| ours | V | 0.7901 | 0.7825 |
| ours | A | 0.7927 | 0.7928 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yang, H.; Zhao, M.; Qiu, Y.; Li, Y.; Guo, J.; Zhang, Z.; Chen, B.; He, M.; Hong, Y. Sentiment Analysis Based on Enhanced Feature Decoupling and Multimodal Logical Reasoning. Multimodal Technol. Interact. 2026, 10, 50. https://doi.org/10.3390/mti10050050
Yang H, Zhao M, Qiu Y, Li Y, Guo J, Zhang Z, Chen B, He M, Hong Y. Sentiment Analysis Based on Enhanced Feature Decoupling and Multimodal Logical Reasoning. Multimodal Technologies and Interaction. 2026; 10(5):50. https://doi.org/10.3390/mti10050050
Chicago/Turabian StyleYang, Hua, Ming Zhao, Yuanhao Qiu, Yuanyuan Li, Junying Guo, Ziran Zhang, Baozhou Chen, Mingzhe He, and Yu Hong. 2026. "Sentiment Analysis Based on Enhanced Feature Decoupling and Multimodal Logical Reasoning" Multimodal Technologies and Interaction 10, no. 5: 50. https://doi.org/10.3390/mti10050050
APA StyleYang, H., Zhao, M., Qiu, Y., Li, Y., Guo, J., Zhang, Z., Chen, B., He, M., & Hong, Y. (2026). Sentiment Analysis Based on Enhanced Feature Decoupling and Multimodal Logical Reasoning. Multimodal Technologies and Interaction, 10(5), 50. https://doi.org/10.3390/mti10050050

