MGMR-Net: Mamba-Guided Multimodal Reconstruction and Fusion Network for Sentiment Analysis with Incomplete Modalities

Yang, Chengcheng; Liang, Zhiyao; Liu, Tonglai; Hu, Zeng; Yan, Dashun

doi:10.3390/electronics14153088

Open AccessArticle

MGMR-Net: Mamba-Guided Multimodal Reconstruction and Fusion Network for Sentiment Analysis with Incomplete Modalities

by

Chengcheng Yang

^1,2

,

Zhiyao Liang

^1,*

,

Tonglai Liu

^2,*

,

Zeng Hu

²

and

Dashun Yan

²

¹

School of Computer Science and Engineering, Macau University of Science and Technology, Avenida Wai Long, Taipa, Macau 999078, China

²

College of Artificial Intelligence, Zhongkai University of Agriculture and Engineering, Zhongkai Road 501, Guangzhou 510225, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(15), 3088; https://doi.org/10.3390/electronics14153088

Submission received: 3 July 2025 / Revised: 21 July 2025 / Accepted: 31 July 2025 / Published: 1 August 2025

(This article belongs to the Special Issue Application of Data Mining in Decision Support Systems (DSSs))

Download

Browse Figures

Versions Notes

Abstract

Multimodal sentiment analysis (MSA) faces key challenges such as incomplete modality inputs, long-range temporal dependencies, and suboptimal fusion strategies. To address these, we propose MGMR-Net, a Mamba-guided multimodal reconstruction and fusion network that integrates modality-aware reconstruction with text-centric fusion within an efficient state-space modeling framework. MGMR-Net consists of two core components: the Mamba-collaborative fusion module, which utilizes a two-stage selective state-space mechanism for fine-grained cross-modal alignment and hierarchical temporal integration, and the Mamba-enhanced reconstruction module, which employs continuous-time recurrence and dynamic gating to accurately recover corrupted or missing modality features. The entire network is jointly optimized via a unified multi-task loss, enabling simultaneous learning of discriminative features for sentiment prediction and reconstructive features for modality recovery. Extensive experiments on CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets demonstrate that MGMR-Net consistently outperforms several baseline methods under both complete and missing modality settings, achieving superior accuracy, robustness, and generalization.

Keywords:

multimodal sentiment analysis; Mamba-collaborative fusion module; state-space mechanism; Mamba-enhanced reconstruction module

1. Introduction

Multimodal sentiment analysis (MSA), which integrates information from textual, visual, and acoustic modalities, aims to achieve a comprehensive and accurate understanding of human emotions [1,2,3,4,5,6]. The availability of benchmark datasets such as CMU-MOSI [7], CMU-MOSEI [8], and CH-SIMS [9] has significantly propelled research in this field, encouraging the development of advanced cross-modal fusion techniques. Despite these advances, several critical challenges remain, limiting the robustness and generalizability of MSA systems in real-world applications. One of the greatest challenges is the prevalence of incomplete modalities. Incomplete data caused by sensor failures, signal loss, or annotation omissions is common in practical scenarios [10,11,12,13]. To address this, Transformer-based models have been employed to reconstruct incomplete modalities. For example, TFR-Net [10] uses an encoder–decoder structure with inter- and intra-modal attention to restore missing features and enhance robustness. However, Transformers suffer from quadratic computational complexity

O (L^{2})

, where L denotes the input sequence length, making them inefficient for processing long sequences [14]. Moreover, many existing approaches treat modality reconstruction and fusion as separate tasks, resulting in semantic misalignment and suboptimal fusion performance. Efforts such as MFMB-Net [15] attempt to capture emotional cues at multiple granularities using macro- and micro-fusion branches. However, limited interaction between these branches hinders the exploitation of hierarchical semantics. Accurately modeling long-range temporal dependencies is also crucial, as emotions evolve over time. Traditional Transformers are constrained in this regard. Compared to traditional Transformer architectures that rely on global self-attention and suffer from quadratic time complexity

O (L^{2})

, Mamba exhibits a linear time complexity

O (L)

, making it significantly more efficient for processing long sequences [16]. Moreover, Mamba leverages a state-space model with selective memory updates, introducing a strong inductive bias for sequence modeling and facilitating more effective long-range temporal reasoning. These characteristics are particularly advantageous in multimodal sentiment analysis, where emotion evolves gradually and data may be incomplete or noisy. In this work, we incorporate Mamba into both the alignment and reconstruction modules to improve robustness and enhance semantic consistency across modalities.

Several models have addressed modality-specific challenges. MISA [17] improves cross-modal alignment by disentangling modality-invariant and modality-specific representations but sacrifices fine-grained emotional nuances. SELF_MM [18] enhances unimodal learning through pseudo-labeling but lacks coherent multimodal integration. TETFN [19] employs text-centric attention and pre-trained visual Transformers but still struggles with long-range temporal alignment. ALMT [20] mitigates modality conflicts via a language-guided hyper-modality framework, achieving competitive results. However, its Transformer-based architecture involves numerous parameters, limiting its performance on smaller datasets, particularly in fine-grained regression tasks. Discrepancies in modality representations and semantic inconsistency further hinder model generalization. Many models rely on decoupled designs for reconstruction and fusion, resulting in semantic disjunctions that degrade both missing modality recovery and downstream predictions. LNLN [21] improves robustness under missing conditions by using language as the dominant modality via a correction and alignment mechanism. However, its generalization under real-world conditions remains constrained, and hyperparameter tuning for loss balancing remains nontrivial.

To address these limitations, we propose MGMR-Net (Mamba-guided multimodal reconstruction and fusion network), a unified framework that integrates text-centric cross-modal alignment, Mamba-enhanced modality reconstruction, and joint optimization for robust multimodal sentiment analysis. The main contributions of this work are summarized as follows:

We propose a text-centric two-stage collaborative fusion framework based on the Mamba architecture, which first performs language-guided cross-modal alignment via multi-layer bi-directional Mamba modules, and then conducts efficient multimodal integration using a time-prioritized Mamba-based fusion mechanism. This design significantly enhances the representational capacity and inference accuracy in multimodal sentiment analysis.
We further introduce a Mamba-enhanced modality reconstruction module, which integrates stacked Mamba layers with gated fusion to recover corrupted or incomplete features. This design restores temporal and semantic consistency within each modality, yielding more robust representations for downstream alignment and fusion.
We design a joint optimization strategy that couples sentiment prediction with modality reconstruction via a unified loss, promoting robust and generalizable representations under incomplete multimodal conditions.

Extensive experiments on CMU-MOSI, CMU-MOSEI, and CH-SIMS demonstrate that MGMR-Net consistently outperforms state-of-the-art methods across various levels of modality completeness and cultural diversity. Notably, it achieves superior generalization in missing-modality and cross-cultural scenarios, demonstrating its practical applicability to real-world multimodal sentiment analysis.

2. Related Work

2.1. Multimodal Sentiment Analysis

Multimodal sentiment analysis (MSA) has achieved significant progress in recent years, driven by the need to integrate and align heterogeneous information from diverse modalities—namely text, audio, and visual signals—for improved sentiment prediction. Early approaches primarily adopted straightforward feature concatenation, wherein unimodal features were directly merged into a joint representation before model training. A representative example is the tensor fusion network (TFN) [22], which employed tensor decomposition to capture inter-modal interactions, leading to improved sentiment classification. However, such concatenation-based strategies often fail to effectively model the intricate and dynamic interdependencies among modalities, thereby limiting their ability to capture subtle and context-dependent emotional expressions in real-world scenarios.

To address these shortcomings, more advanced architectures such as the multimodal Transformer (MulT) [14] have been proposed. MulT utilizes cross-modal attention mechanisms to dynamically align and fuse features from different modalities, while self-attention enables it to capture long-range temporal dependencies and cross-modal correlations. This facilitates a richer contextual understanding of emotional content. Nonetheless, Transformer-based models encounter scalability issues due to their quadratic time complexity with respect to sequence length, which poses challenges when processing long video or audio streams—key sources for modeling evolving emotional states. Moreover, these models are susceptible to incomplete or noisy modality inputs (e.g., sensor failures or background noise), resulting in degraded performance and limited robustness.

Recent efforts have attempted to mitigate these issues. For instance, MMIM [23] introduces a hierarchical mutual information maximization framework that enhances MSA by maximizing mutual information not only between unimodal inputs but also between the fused representation and its corresponding unimodal features. This design helps preserve task-relevant information during fusion. However, these methods may still suppress modality-specific nuances, which are crucial for capturing fine-grained emotional cues—especially in culturally diverse contexts where expression styles differ significantly.

To further improve robustness and modality alignment, CENet [24] proposes a cross-modal enhancement network that enriches textual representations by integrating long-range visual and acoustic emotional cues into a pre-trained language model. Additionally, it employs a feature transformation strategy to reduce distributional discrepancies among modalities, facilitating more effective fusion. TeFNA [25] introduces a text-centered fusion framework that leverages cross-modal attention to align unaligned inputs and incorporates a text-centered aligned fusion (TCA) strategy to preserve modality-specific characteristics while maximizing mutual information for task-relevant emotional signal retention.

Despite these advancements, several key challenges remain. Current MSA models still struggle to robustly handle incomplete modalities, efficiently process long sequential data, and balance the fusion of modality-specific and modality-invariant information. Furthermore, the generalizability of existing models across cultural domains is limited, as most datasets predominantly reflect Western emotional expression patterns. This hampers their effectiveness in recognizing subtler and culturally nuanced emotional expressions. Addressing these challenges is essential for developing more resilient and universally applicable MSA systems.

2.2. State-Space Models and Mamba

State-space models (SSMs) have recently garnered significant attention as powerful frameworks for modeling sequential data, especially in tasks involving intricate temporal dependencies [26]. By efficiently capturing both short- and long-range temporal patterns, SSMs have achieved promising results across a variety of domains, including time-series forecasting, speech recognition, and, more recently, multimodal sentiment analysis [27,28,29]. Fundamentally, SSMs model observed data as noisy emissions from latent dynamic processes that evolve over time via recursive state transition equations.

In the context of multimodal learning, SSMs offer a principled way to model modality-specific temporal structures and align them across different sources of information. Their capacity to capture long-range dependencies is especially beneficial for analyzing emotion progression in multimodal sentiment analysis, where cues from text, audio, and video may manifest asynchronously or over different time scales.

A notable advancement in this field is the Mamba architecture [30], which extends conventional SSMs through a bi-directional design—referred to as Bi-Mamba—to facilitate global context modeling. In contrast to Transformer-based architectures that rely on self-attention mechanisms with quadratic complexity, Bi-Mamba achieves linear time complexity, making it more scalable for long sequences. Within the Mamba framework, modality-specific convolutional layers are first used to extract local features. These features are then fed into the Bi-Mamba module, which captures multi-scale temporal interactions across modalities. This pipeline is particularly well-suited for multimodal tasks, as it enables the model to effectively synchronize temporally dispersed signals and build holistic cross-modal representations.

Despite these advantages, applying SSMs and the Mamba architecture to multimodal sentiment analysis remains challenging [31]. A primary issue lies in their limited compatibility with existing multimodal fusion strategies, particularly under incomplete modality conditions. The absence of one or more modalities can severely degrade model performance, as the remaining modalities often lack sufficient complementary information.

To address this, recent approaches have begun incorporating closed-loop mechanisms to reconstruct and refine incomplete modalities. For instance, our proposed MGMR-Net extends the Mamba architecture with a progressive reconstruction loop that iteratively restores the missing modality features based on observed inputs and previously predicted representations. This recursive feedback not only strengthens temporal coherence within each modality but also fosters better cross-modal alignment, thereby improving model robustness in real-world MSA settings.

2.3. Handling Incomplete Modalities

Incomplete modalities pose a significant challenge in multimodal sentiment analysis (MSA), as real-world data often suffer from incomplete or unavailable modalities due to sensor failures, transmission errors, or environmental constraints [32,33]. Early solutions typically ignored incomplete modalities or employed matrix completion techniques for estimation, but these approaches often led to suboptimal performance due to oversimplified assumptions and limited modeling capacity.

With the advent of deep learning, more advanced strategies have emerged, broadly categorized into generative methods and joint learning methods. Generative approaches focus on imputing incomplete modalities by synthesizing plausible data that approximates the distribution of the missing modality. Techniques such as variational autoencoders (VAEs) and cascaded residual autoencoders have been employed to infer latent representations and reconstruct missing signals [34]. Some studies further adopt adversarial learning frameworks to conditionally generate realistic modality views based on available inputs. In contrast, joint learning methods seek to leverage the correlations among present modalities to infer shared representations, enabling the prediction or reconstruction of missing features through encoder–decoder or Transformer-based architectures.

Beyond this general classification, recent research has introduced more fine-grained categories for handling incomplete modalities, including GAN-based [35,36], correlation-based [37], cycle-consistency-based [38], and encoder-based approaches. GAN-based methods utilize adversarial training to generate absent modality features but may neglect fine-grained correlations with observed modalities. Correlation-based techniques explicitly model statistical dependencies between modalities, though they often struggle when multiple modalities are simultaneously missing. Cycle-consistency methods enforce bi-directional reconstruction to preserve shared semantics, yet they frequently assume static missing patterns. Encoder-based strategies leverage autoencoders or Transformer encoders to reconstruct missing information, offering flexibility but sometimes lacking the ability to capture shared semantic components across modalities.

In summary, despite considerable progress, effectively handling incomplete modalities in MSA remains an open problem. Many existing methods fail to model complex cross-modal dependencies or suffer from degraded performance under random missing patterns. To address these limitations, we propose MGMR-Net, which incorporates a cycle-consistent Mamba-enhanced reconstruction module based on selective state-space modeling. By leveraging Mamba’s continuous-time recurrence and dynamic gating mechanisms, the module captures long-range dependencies while preserving semantic coherence across modalities. The cycle-consistency constraint further ensures accurate reconstruction by enforcing alignment between observed and reconstructed modalities. Coupled with a unified multi-task loss, MGMR-Net achieves robust performance in scenarios with randomly incomplete modalities, balancing discriminative sentiment prediction with faithful modality recovery.

3. Methodology

In this section, we formalize the problem and present the proposed MGMR-Net model. The task is to develop a multimodal sentiment analysis system that operates on input sequences extracted from the same video segment, denoted as

I = {I_{t}, I_{a}, I_{v}}

. Here,

I_{m} \in R^{l_{m} \times d_{m}}

represents the raw input sequence of modality

m \in {t, a, v}

, where

l_{m}

denotes the temporal length and

d_{m}

is the feature dimension for text, audio, and visual modalities, respectively. The model is parameterized by

θ

and defined as

M (θ; I)

, which aims to predict the sentiment intensity

\hat{y} \in R

. In real-world applications, modality sequences may be partially or entirely missing due to sensor failures, noise, or transmission errors. To address this, the model receives as input a set of incomplete but pre-extracted modality features, denoted as

X_{m}^{'} \in R^{l_{m} \times d_{m}}

. These features are derived from the raw inputs

I_{m}

via a modality-specific feature extractor, and they serve as the effective inputs to the model:

{X_{t}^{'}, X_{a}^{'}, X_{v}^{'}}

. The model is trained to perform robust sentiment prediction even under missing modality conditions. During training, the original complete modality features

{I_{t}, I_{a}, I_{v}}

and their corresponding missing position masks are used as auxiliary supervision to guide representation learning. This strategy improves the model’s robustness and generalization in scenarios with incomplete multimodal information.

3.1. Overall Architecture

Our model, MGMR-Net, illustrated in Figure 1, is specifically designed to tackle key challenges in multimodal sentiment analysis (MSA), including incomplete modalities, long-sequence processing, and efficient modality fusion. This section provides a detailed overview of the main components of MGMR-Net, which comprise a unimodal encoder, a Mamba-collaborative fusion module, and a Mamba-enhanced reconstruction module.

3.2. Unimodal Encoder

To preserve the sequential structure and model intra-modal temporal dependencies, we apply a unidirectional LSTM to each non-text modality independently:

X_{a}^{lstm} = sLSTM (X_{a}^{'}; θ_{a}), X_{v}^{lstm} = sLSTM (X_{v}^{'}; θ_{v})

(1)

where

θ_{a}

and

θ_{v}

denote the trainable parameters of the LSTM networks, and

X_{a}^{lstm}

,

X_{v}^{lstm}

represent the encoded audio and visual sequences, respectively.

For the textual modality, tokenized utterances—including special tokens such as [CLS] and [SEP]—are processed by a 12-layer BERT model [39,40]. The contextual embedding of the [CLS] token from the final layer is used as the sentence-level representation:

X_{t}^{bert} = BERT (X_{t}^{'}; θ_{t})

(2)

where

θ_{t}

represents the parameters of the BERT model.

To unify the feature dimensions and capture localized temporal correlations, the encoded sequences from all modalities are passed through independent 1D convolutional layers:

X_{m} = Conv1D (X_{m}^{w}; {kernel}_{m}), (m, w) \in {(a, lstm), (v, lstm), (t, bert)}

(3)

where

X_{m}

denotes the refined unimodal feature sequence and

{kernel}_{m}

is the size of the convolution kernel specific to the modality. This unified encoding strategy ensures that each modality is temporally contextualized and projected into a shared latent space, facilitating effective downstream cross-modal fusion.

3.3. Mamba-Collaborative Fusion Module

3.3.1. Text-Centric Two-Stage Mamba Modeling

To enhance cross-modal representation learning in multimodal sentiment analysis, we propose a text-centric two-stage framework built upon the Mamba architecture. Mamba is an advanced deep sequence modeling approach based on the selective state-space model (SSM), which replaces traditional self-attention with a dynamic recurrence mechanism. This allows for linear-time sequence processing, effectively overcoming the quadratic complexity limitations of Transformers. The core state-space formulation of Mamba is based on the classical state-space model:

\begin{matrix} h (t + 1) & = A h (t) + B x (t), \\ y (t) & = C h (t) + D_{out} x (t) \end{matrix}

(4)

where

h (t)

is the hidden state,

x (t)

is the input vector, and A, B, C,

D_{out}

are learnable parameters. Note that in the Mamba architecture, the term

D_{out} x (t)

is omitted, as the input effect is incorporated through the recurrent dynamics.

During inference, Mamba employs a simplified recurrence for real-time sequence modeling,

h (t + 1) = λ ⊙ h (t) + γ ⊙ x (t), λ = exp (- D_{decay}), γ = B

(5)

where ⊙ denotes the Hadamard product (element-wise multiplication), which computes the product between corresponding elements of two vectors with the same dimensionality.

λ, γ \in R^{d}

are learnable gating vectors that enable low-latency real-time sequence updates and

D_{decay} \in R^{d}

are learnable vectors that control the element-wise decay rate in the gating mechanism. d denotes the feature dimension of the input and hidden vectors. This decay vector is conceptually and functionally distinct from the output projection matrix

D_{out}

in the classical SSM.

We define the n-th Mamba layer as a function

{MambaLayer}^{(n)} (\cdot)

, which maps an input sequence from the previous layer

X^{(n - 1)} = [x_{1}^{(n - 1)}, \dots, x_{T}^{(n - 1)}]

to an output sequence

X^{(n)} = [x_{1}^{(n)}, \dots, x_{T}^{(n)}]

:

X^{(n)} = {MambaLayer}^{(n)} (X^{(n - 1)}) .

(6)

At each time step t, the hidden state

h_{t}^{(n)}

and output

x_{t}^{(n)}

are computed as follows:

h_{t}^{(n)} = λ^{(n)} ⊙ h_{t - 1}^{(n)} + γ^{(n)} ⊙ x_{t}^{(n - 1)},

(7)

x_{t}^{(n)} = ϕ (W^{(n)} h_{t}^{(n)} + b^{(n)}),

(8)

where

λ^{(n)}, γ^{(n)} \in R^{d}

are learnable modulation vectors,

W^{(n)} \in R^{d \times d}

and

b^{(n)} \in R^{d}

are the linear transformation weights and biases, and

ϕ (\cdot)

is a nonlinear activation function, such as ReLU.

This structure allows each Mamba layer to model long-range temporal dependencies efficiently with linear complexity. As the foundational building block in our text-centric architecture, stacked Mamba layers empower subsequent cross-modal alignment and fusion stages by capturing temporally rich and semantically consistent features anchored on the textual modality.

Stage 1: Cross-Modal Alignment via Multi-Layer Collaborative Bi-Mamba

To fully leverage the dominant role of the language modality in multimodal sentiment analysis, as illustrated in Figure 2, we propose a multi-layer collaborative Bi-Mamba architecture to dynamically align textual representations with both audio and visual modalities. Specifically, we align the text modality with both the audio and visual modalities through multiple layers of bi-directional Mamba modules, thus enhancing interactions and understanding between modalities.

First, for each modality, we use modality-specific encoders to convert raw features into unified-dimensional temporal feature sequences in the Unimodal Encoder Section. For modality

m \in {t, a, v}

, the feature sequence generated is represented as

X_{m} = {[x_{m}^{1}, x_{m}^{2}, \dots, x_{m}^{T}]}^{⊤},

(9)

where

x_{m}^{t} \in R^{d}

represents the feature vector of modality m at time step t.

Next, for the text–audio and text–video branches, we use stacked bi-directional Mamba modules for forward and backward temporal modeling. In each layer n, the forward pass for modality m is computed recursively via MambaLayer. The forward pass starts with the modality feature sequence

X_{m}^{t}

, and, at each layer, the output from the previous layer is used as the input to update the current layer. The forward recursion is given by

H_{m, f}^{t, (0)} = X_{m}^{t}, H_{m, f}^{t, (n)} = {MambaLayer}^{(n)} (H_{m, f}^{t - 1, (n)}), n = 1, \dots, N .

(10)

where

H_{m, f}^{t, (0)}

is the initial input, and

H_{m, f}^{t, (n)}

is the hidden state at layer n. In each layer, the hidden state is updated based on the previous layer’s output and the current input.

For the backward pass, we first apply temporal flipping to the input sequence and then execute the same computation as in the forward pass. Specifically, for time step t, the backward recursion is given by

H_{m, b}^{t, (0)} = Flip (X_{m}^{t}), H_{m, b}^{t, (n)} = {MambaLayer}^{(n)} (H_{m, b}^{t - 1, (n)}), n = 1, \dots, N .

(11)

The temporal flip operation restores the reversed sequence to the correct temporal order, and the final backward hidden state sequence

H_{m, b}^{t, (N)}

is generated.

After completing both the forward and backward passes, the bi-directional output for modality m at time step t is obtained by fusing the forward and backward outputs:

C_{m}^{t} = \frac{1}{2} (H_{m, f}^{t, (N)} + Y_{m, b}^{t, (N)}) .

(12)

Here,

C_{m}^{t}

is the bi-directional aligned feature for modality m at time step t, where the forward and backward outputs are averaged to preserve their respective temporal information and features.

For the text modality, since it participates in both the text–audio and text–video branches, its final aligned feature

C_{t}^{t}

is obtained by averaging the outputs from the two branches:

C_{t}^{t} = \frac{1}{2} (C_{t}^{(a), t} + C_{t}^{(v), t}) .

(13)

In general, the resulting aligned features

C_{t}^{t}, C_{a}^{t}, C_{v}^{t}

not only preserve the temporal characteristics of each modality but also improve cross-modal interactions and understanding, providing a solid foundation for subsequent multimodal fusion and prediction of sentiment.

Stage 2: Mamba-Based Multimodal Fusion

Following cross-modal alignment in stage 1, we obtain synchronized and refined feature sequences for audio, visual, and text modalities, denoted as

{C_{a}^{t}}_{t = 1}^{T}

,

{C_{v}^{t}}_{t = 1}^{T}

, and

{C_{t}^{t}}_{t = 1}^{T}

, respectively. These modality-specific aligned features are subsequently integrated via a multimodal fusion module based on the Mamba architecture, which is specifically designed to model expressive temporal dynamics and enable effective cross-modal interactions.

To preserve temporal causality while maintaining computational efficiency, we adopt Mamba’s time-priority selective scanning mechanism, which achieves linear time complexity in sequence length, as opposed to the quadratic complexity of traditional self-attention. At each time step, the features from the three modalities are interleaved to form a unified multimodal sequence.

X_{mm} = [C_{a}^{1}; C_{v}^{1}; C_{t}^{1}; C_{a}^{2}; C_{v}^{2}; C_{t}^{2}; \dots; C_{a}^{T}; C_{v}^{T}; C_{t}^{T}] \in R^{3 T \times d} .

(14)

This concatenated sequence

X_{mm}

is processed by a stack of N Mamba layers:

F_{mm}^{(0)} = X_{mm}, F_{mm}^{(n)} = {MambaLayer}^{(n)} (F_{mm}^{(n - 1)}), n = 1, \dots, N,

(15)

producing the following final fused multimodal representation:

F^{'} = F_{mm}^{(N)} \in R^{3 T \times d} .

(16)

To aggregate temporal information into a fixed-size representation for sentiment prediction, we apply temporal max pooling across all

3 T

tokens:

F_{final}^{'} = max_{t = 1, \dots, 3 T} F^{'} [t] \in R^{d},

(17)

where the max operation is performed element-wise along the temporal dimension.

The pooled vector

F_{final}^{'}

is then fed into a fully connected layer to generate the final sentiment prediction:

\hat{y} = FC (F_{final}^{'}) \in R^{c},

(18)

where c denotes the number of sentiment categories.

This fusion design effectively leverages Mamba’s linear temporal modeling to maintain causality and computational efficiency, while the interleaved token arrangement promotes fine-grained cross-modal interactions. Consequently, the hierarchical processing yields temporally coherent and semantically enriched multimodal representations, thereby enhancing robustness and accuracy in sentiment inference.

3.4. Mamba-Enhanced Reconstruction Module

In this module, we perform the reconstruction of modality feature sequences for each modality using the Mamba architecture, with the goal of recovering corrupted or incomplete modality features. Let the encoded feature sequence for each modality be denoted as

X_{m} \in R^{T \times d_{m}}

, where

m \in {t, a, v}

represents text, audio, and visual modalities, respectively. Specifically,

X_{m} = {[x_{m}^{1}, x_{m}^{2}, \dots, x_{m}^{T}]}^{⊤}

, where

x_{m}^{t} \in R^{d_{m}}

represents the feature of modality m at time step t. These features are extracted through unimodal encoders designed to capture modality-specific characteristics.

We apply modality-specific reconstruction pipelines composed of stacked Mamba layers to produce the refined set of representations. This process is defined as

{\hat{X}}_{m}^{(0)} = X_{m}, {\hat{X}}_{m}^{(n)} = {MambaLayer}^{(n)} (X_{m}^{(n - 1)}), n = 1, \dots, N,

(19)

After passing through N stacked layers, we obtain the final output feature at each layer. By combining the outputs from each layer across time steps t, we obtain the reconstructed feature sequence

{\hat{X}}_{m}

:

{\hat{X}}_{m} = {[x_{m}^{1 (N)}, x_{m}^{2 (N)}, \dots, x_{m}^{T (N)}]}^{⊤}

(20)

where

x_{m}^{t (N)}

is the output feature after N stacked layers, representing the unimodal feature at time step t.

To adaptively fuse the original features with the reconstructed features, we introduce a gating mechanism

Γ_{m}

, which controls the fusion ratio through learned gating parameters:

Γ_{m} = σ (W_{g}^{⊤} \cdot LayerNorm (X_{m}))

(21)

Then, the fused features are obtained as

X_{m}^{*} = Γ_{m} ⊙ X_{m} + (1 - Γ_{m}) ⊙ {\hat{X}}_{m}

(22)

where

W_{g} \in R^{d_{m}}

is the learnable gating parameter, controlling the fusion ratio between the original and reconstructed features, and

X_{m}^{*}

is the fused feature.

Finally, to ensure dimensional consistency across modalities, we apply feature normalization and projection to the fused feature, resulting in a unified feature space representation:

X_{m}^{''} = W_{proj} \cdot ReLU (LayerNorm (X_{m}^{*}))

(23)

where

W_{proj}

is the learnable projection matrix, ensuring that the fused feature has consistent dimensionality across modalities.

This reconstruction module seamlessly integrates stacked MambaLayer blocks with state-space modeling and gated fusion mechanisms. It effectively restores the temporal dynamics and semantic coherence of each modality, providing robust and enriched representations for the subsequent cross-modal alignment and fusion stages.

3.5. Model Optimization

MGMR-Net employs a unified multi-task learning framework that jointly optimizes sentiment intensity regression and modality reconstruction objectives. Unlike prior works, which optimize prediction and feature recovery separately, our framework explicitly couples these tasks via a composite loss function, enabling synergistic learning of both discriminative and reconstructive representations.

For sentiment intensity prediction, the mean absolute error (MAE) loss is employed, as follows:

L_{task} = \frac{1}{N} \sum_{i = 1}^{N} |{\hat{y}}_{i} - y_{i}|,

(24)

where

{\hat{y}}_{i}

and

y_{i}

denote the predicted and ground-truth sentiment intensities for the i-th sample, respectively.

To ensure accurate recovery of corrupted modality features, the reconstruction loss is applied on the outputs of the Mamba-enhanced reconstruction module

X_{m}^{'}

with modality-specific masking focused on corrupted segments:

L_{rec}^{(m)} = \sum_{i = 1}^{l_{m}} \{\begin{matrix} 0.5 {(x_{i} - y_{i})}^{2}, & | x_{i} - y_{i} | < 1, \\ | x_{i} - y_{i} | - 0.5, & otherwise, \end{matrix}

(25)

where

x = X_{m}^{''} ⊙ Mask

,

y = X_{m} ⊙ Mask

, and

Mask

being a binary mask isolating corrupted parts in modality m.

The final loss integrates the task and reconstruction objectives across all modalities as

L = L_{task} + \sum_{m \in {t, a, v}} α_{m} L_{rec}^{(m)},

(26)

where

α_{m}

are modality-specific weights balancing the reconstruction loss contributions.

This joint optimization scheme encourages the model to learn representations that are both semantically discriminative for sentiment regression and robustly reconstruct incomplete modalities. The explicit gradient coupling between prediction and reconstruction facilitates improved generalization and robustness for real-world multimodal sentiment analysis with incomplete data.

4. Experiments

In this section, we will introduce the datasets, metrics, feature extraction, baselines, and implementation details.

4.1. Datasets and Metrics

We evaluate the proposed model on three widely used benchmarks for multimodal sentiment analysis (MSA): CMU-MOSI [7], CMU-MOSEI [8], and SIMS [9]. All experiments are conducted under the unaligned modality setting, which better reflects real-world conditions by removing artificially imposed synchronization across modalities. To ensure fair comparison with prior work, we use the officially released pre-extracted features for all datasets. We adopt a comprehensive set of evaluation metrics. For CMU-MOSI and CMU-MOSEI, we report five-class (Acc-5) and seven-class (Acc-7) classification accuracies; for SIMS, three-class (Acc-3) and five-class (Acc-5) accuracies are reported. In addition, we evaluate all datasets using binary accuracy (Acc-2), mean absolute error (MAE), Pearson’s correlation coefficient (Corr), and F1-score (F1). For CMU-MOSI and CMU-MOSEI, Acc-2 and F1 are reported under two binary configurations: negative vs. positive (left of “/”) and negative vs. non-negative (right of “/”). Except for the MAE, where lower values indicate better performance, higher values on all other metrics reflect improved model effectiveness. Dataset statistics are summarized in Table 1.

4.2. Feature Extraction

To ensure consistent evaluation and reproducibility across benchmarks, we utilize the officially released pre-processed multimodal features for all datasets, following the standardized MMSA protocol proposed by Mao et al. [41]. Each modality—textual, acoustic, and visual—is encoded using well-established pre-trained models or signal processing toolkits, thereby avoiding additional learning bias during feature extraction.

For the textual modality, utterance-level representations are obtained using pre-trained BERT encoders. Specifically, we adopt Bert-base-uncased [40] for the English-language datasets (CMU-MOSI and CMU-MOSEI) and Bert-base-Chinese [41] for the Mandarin-language SIMS dataset. All embeddings are 768-dimensional. To ensure computational efficiency while preserving semantic coverage, input sequences are truncated or padded to a fixed length of 50 tokens for MOSI and MOSEI and 39 tokens for SIMS.

Acoustic features are extracted using established audio processing libraries specific to each dataset. For CMU-MOSI and CMU-MOSEI, we employ the COVAREP toolkit [42] to extract low-level descriptors such as pitch, glottal source parameters, and cepstral coefficients, resulting in 5- and 74-dimensional features, respectively, with frame counts of 375 and 500. For SIMS, 33-dimensional features are extracted over 400 frames using Librosa [43].

Visual representations are obtained from facial expression and movement analysis tools. For CMU-MOSI and CMU-MOSEI, we use Facet [44,45] to extract 20 and 35 attributes per frame, including action units and head pose estimations, standardized to 500 frames per video. For SIMS, high-dimensional (709D) visual features are extracted over 55 frames using OpenFace 2.0 [46].

Each dataset provides continuous sentiment annotations on different scales. CMU-MOSI and CMU-MOSEI adopt a seven-point scale ranging from

- 3

(strongly negative) to

+ 3

(strongly positive), while SIMS uses a normalized interval

[- 1, 1]

to represent sentiment polarity. These labels are used as regression targets in sentiment prediction tasks.

4.3. Baselines

To comprehensively evaluate the effectiveness of our proposed model, we benchmark it against a wide spectrum of MSA methods, ranging from early disentangled and mutual-information-based models to recent advances in self-supervised learning, cross-modal enhancement, and Transformer-based fusion.

MISA [17] is a flexible MSA framework that disentangles representations into modality-invariant and modality-specific subspaces, thereby reducing modality gaps while preserving unique modality characteristics. It leverages multiple loss functions to guide representation learning and adopts a simple yet effective fusion strategy, demonstrating strong performance on sentiment and humor recognition tasks.

SELF_MM [18] is a self-supervised MSA model that generates unimodal pseudo-labels to supervise modality-specific representation learning. It captures both cross-modal consistency and modality-specific variation without manual annotations. A momentum-based label refinement and dynamic weight adjustment further enhance its robustness, especially for samples with high modality discrepancy.

MMIM [18] adopts a hierarchical mutual information maximization strategy that preserves task-relevant information across modalities and between input and fused representations. It combines neural parametric models with non-parametric Gaussian mixture models to estimate mutual information, yielding improved performance on standard MSA benchmarks.

CENET [24] enhances text representations by integrating emotional cues from visual and acoustic modalities into a pre-trained language model. It employs a feature transformation mechanism that converts nonverbal features into token-like indices, reducing cross-modal distributional differences and improving fusion efficiency.

TETFN [19] is a text-enhanced Transformer fusion network that incorporates text-oriented multi-head attention and cross-modal mappings to integrate sentiment-related cues from all modalities while retaining modality-specific predictions. It also utilizes a Vision Transformer backbone to extract both global and local visual features, achieving competitive results on multiple MSA benchmarks.

TFR-Net [10] is a Transformer-based encoder–decoder architecture designed for unaligned multimodal inputs with random incomplete modalities. It reconstructs incomplete features via inter- and intra-modal attention mechanisms and employs reconstruction losses to produce semantically consistent representations, ensuring robustness under various missing conditions.

ALMT [20] introduces an adaptive hyper-modality learning (AHL) module to mitigate the effects of redundant and conflicting information in nonverbal modalities. By leveraging multi-scale language features, ALMT generates a sentiment-relevant and noise-suppressed hyper-modality representation, enabling effective and robust multimodal fusion.

LNLN [21] emphasizes the dominant role of the language modality in sentiment analysis and improves its reliability through a dominant modality correction (DMC) module and a dominant modality-based multimodal learning (DMML) module. This framework enhances model robustness under scenarios involving missing or noisy nonverbal modalities.

4.4. Implementation Details

We implement our model using the PyTorch 2.3.1 framework and conduct all experiments on a single NVIDIA GeForce RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA),which provides sufficient computational capacity for large-scale multimodal training. To simulate incomplete modality scenarios, we adopt a random masking strategy during training, where independent temporal masks are applied to each modality. For audio and video, masked segments are replaced with zero vectors, effectively introducing white Gaussian noise into the feature space. For text, masked tokens are substituted with a special token to simulate missing semantic information [17]. The model is optimized using the Adam optimizer. To ensure fair and consistent evaluation, we perform five independent runs for each predefined missing rate

r \in {0.0, 0.1, \dots, 0.9}

, with increments of 0.1. For example, at

r = 0.5

, 50% of the input information in each modality is randomly masked during testing. We report the average performance across all missing rates to comprehensively evaluate the model’s robustness under varying degrees of modality incompleteness. Additional details of the implementation, including the learning rate and batch size, are summarized in Table 2. The hyperparameter settings in Table 2 are empirically selected based on the characteristics of each dataset. Specifically, we adjust the Mamba depth configuration to align with the temporal complexity of different modalities. For example, CMU-MOSEI contains longer and more diverse utterances compared to CMU-MOSI, which motivates the use of deeper Mamba blocks ({2,2}) to better capture long-range dependencies. In contrast, CH-SIMS, with a shorter average sequence length and more balanced modality contributions, benefits from a moderate depth setting ({1,2}) to avoid overfitting. Other parameters such as sequence length L and batch size are chosen based on GPU memory constraints and validation performance during initial experiments. These tailored configurations help ensure optimal learning dynamics for each dataset.

5. Results and Analysis

5.1. Overall Results

MGMR-Net achieves consistently strong performance across three widely used MSA benchmarks—CMU-MOSI, CMU-MOSEI, and CH-SIMS—demonstrating its effectiveness in cross-modal alignment, modality reconstruction, and temporal modeling.

On CMU-MOSI (Table 3), MGMR-Net outperforms all baselines in key metrics, achieving the highest Acc-2 (73.50%), F1 (73.53%), and Pearson’s correlation coefficient (0.534), surpassing robust models such as LNLN and ALMT. This improvement stems from the proposed Mamba-collaborative fusion module, which enables effective alignment and fusion of multimodal sequences within a language-centric state-space modeling framework.

On CMU-MOSEI (Table 4), MGMR-Net maintains top-tier performance with Acc-2 reaching 77.42%, F1 at 78.21%, and MAE being reduced to 0.671. These results demonstrate the model’s strong generalization across diverse and large-scale datasets. The Mamba-enhanced reconstruction module further contributes by restoring incomplete modality information through continuous-time dynamics and adaptive feature fusion.

On CH-SIMS (Table 5), MGMR-Net achieves the best Acc-2 (73.65%) and a competitive F1-score (78.91%), closely matching LNLN’s 79.43%. These results reflect the model’s cross-lingual adaptability and robustness in the face of semantic and cultural variation. However, slight gaps in certain metrics suggest potential for further improvement in handling ambiguous or conflicting sentiment cues.

Additionally, as the proportion of incomplete modalities increases in Figure 3, representative baseline models exhibit a clear decline in performance, highlighting their sensitivity to incomplete modality inputs. In contrast, MGMR-Net maintains a relatively stable performance, with only marginal degradation in both the F1-score and MAE across varying missing rates. This resilience indicates that MGMR-Net is less susceptible to modality incompleteness, largely due to its ability to capture cross-modal dependencies and suppress noise from missing segments. Such robustness renders it particularly well-suited for real-world multimodal sentiment analysis scenarios, where incomplete or noisy data are common.

In summary, MGMR-Net successfully integrates multimodal temporal modeling, semantic alignment, and robustness enhancement in a unified architecture, achieving excellent performance, broad applicability, and strong generalization capabilities.

5.2. Ablation Study

To evaluate the contribution of each core component in MGMR-Net, we conduct an ablation study on the CMU-MOSI dataset by selectively removing the following modules: (1) the cross-modal alignment module via multi-layer collaborative Bi-Mamba (denoted as S-1), (2) the Mamba-based multimodal fusion stream (S-2), and (3) the Mamba-enhanced reconstruction module (Recon). The results are summarized in Table 6.

The complete MGMR-Net achieves the best overall performance, with Acc-2 scores of 73.50%/72.37% (negative vs. positive/negative vs. non-negative), F1-scores of 73.53%/72.39%, MAE of 1.038, and Pearson’s correlation coefficient of 0.534. These results demonstrate the effectiveness of the multi-branch architecture and stage-wise modeling for robust multimodal sentiment prediction.

Effect of removing S-1: Removing the S-1 module causes the largest drop in classification accuracy, with Acc-2 decreasing to 68.28%/67.96% and F1-scores to 70.32%/70.01%, representing relative declines of over 5.2% and 3.2%, respectively. MAE increases to 1.120, and correlation falls to 0.498, underscoring the critical role of alignment in maintaining temporal and semantic consistency across modalities.

Effect of removing S-2: Omission of the S-2 fusion stream leads to a moderate performance decline: Acc-2 drops to 71.20%/71.83%, F1 to 72.34%/71.78%, MAE rises to 1.073, and correlation slightly decreases to 0.516. This confirms that S-2 enhances global cross-modal interactions and enriches semantic representations.

Effect of removing Recon: Removing the reconstruction module significantly impairs performance, particularly in F1-scores, which decrease to 68.52%/67.59%, the lowest among all variants. Although Acc-2 remains close to the full model (73.10%/72.28%), MAE increases to 1.087 and correlation drops to 0.508, highlighting the importance of reconstruction for handling incomplete modalities and refining feature representations.

Overall, the ablation study confirms that each module contributes meaningfully to MGMR-Net’s performance. S-1 has the greatest impact on classification accuracy, while the reconstruction module is vital for robustness and precision in F1 and MAE metrics. Although S-2 plays a relatively smaller role individually, it provides complementary global semantic context. These findings validate the effectiveness of MGMR-Net’s multi-stage, multi-branch design for robust multimodal sentiment analysis.

5.3. Model Complexity

As shown in Table 7, MGMR-Net demonstrates competitive inference efficiency under both complete and incomplete modality settings, with runtimes of 3.65 s and 3.93 s, respectively. The slight increase at a 0.5 missing rate highlights the model’s robustness and computational efficiency in handling incomplete data. This advantage stems from the adoption of the Mamba architecture, which replaces traditional self-attention with a selective state-space model (SSM), reducing computational complexity from quadratic

O (N^{2})

to linear

O (N)

relative to sequence length. As a result, MGMR-Net is well-suited for long-range temporal modeling in multimodal sentiment analysis, enabling efficient cross-modal interaction and reconstruction with minimal inference overhead.

5.4. Case Study

To further illustrate the strengths and limitations of MGMR-Net, we present a qualitative analysis of three representative samples from the MOSI dataset (Figure 4), covering scenarios of subtle sentiment cues, modality consistency, and severe modality missingness.

Case 1 involves a challenging instance where the textual modality conveys neutral sentiment, while the visual modality reveals subtle negative facial expressions. MGMR-Net correctly predicts a negative label, benefiting from the Mamba-collaborative fusion module that enables fine-grained cross-modal alignment and temporal gating. In contrast, ALMT, which heavily relies on textual dominance, predicts a neutral label, and LNLN is misled by audio noise, producing an incorrect positive classification.

Case 2 depicts a sample with consistent negative sentiment across all three modalities: explicit negative language, low-pitched and flat acoustic tone, and clear negative facial cues. MGMR-Net, LNLN, and ALMT all successfully predict the negative label. The strong modality agreement renders this sample relatively straightforward. Although MGMR-Net does not exhibit a significant advantage here, its hierarchical fusion ensures robust multimodal integration. Similarly, LNLN and ALMT perform reliably under well-aligned and conflict-free modality inputs.

Case 3 represents a failure case characterized by severe modality missingness. The visual modality is heavily corrupted, and the acoustic signal lacks prosodic information, containing only rhythmic patterns. All models fail to predict correctly. Despite MGMR-Net’s Mamba-enhanced reconstruction module, extreme information sparsity limits reconstruction effectiveness, revealing a common limitation among current models when faced with severely incomplete inputs.

These examples suggest that MGMR-Net excels in capturing subtle or noisy sentiment cues through dynamic fusion and reconstruction. Nonetheless, like other models, it struggles with highly sparse or conflicting modality conditions.

5.5. Comparative Discussion with Prior Models

To better position MGMR-Net within the multimodal sentiment analysis landscape (MSA), we provide a comparative discussion with several representative models, including MISA, SELF_MM, TETFN, ALMT, and LNLN.

MISA and SELF_MM emphasize modality-specific representation learning through disentanglement or pseudo-label generation. While they demonstrate strong performance when all modalities are present, they exhibit notable performance degradation in the presence of missing inputs due to the absence of explicit cross-modal recovery mechanisms. For example, on the CMU-MOSI dataset, MISA achieves an Acc-2 of 71.49%, while MGMR-Net surpasses this with 73.50%. Furthermore, with increasing missing rates, SELF_MM shows a significant decline in the F1-score, as illustrated in Figure 3, while MGMR-Net maintains a relatively stable performance. These results highlight the effectiveness of the proposed Mamba-enhanced reconstruction module in addressing incomplete modality scenarios.

TETFN enhances textual representations via cross-modal attention and visual Transformers but lacks advanced temporal modeling capabilities. This limitation becomes evident on the CH-SIMS dataset, where its F1-score reaches only 68.67%, compared to MGMR-Net’s 78.91%. The superior performance of MGMR-Net can be attributed to the integration of Mamba state-space models, which facilitate long-range temporal dependency modeling with linear complexity.

ALMT and LNLN focus on leveraging the language modality as the dominant source of sentiment information, employing mechanisms for noise suppression and modality correction. Although these models perform well under balanced modality conditions, such as LNLN’s strong F1-score of 79.43% on CH-SIMS, they are less robust when confronted with substantial corruption or inconsistency in the modality. MGMR-Net, on the other hand, combines collaborative Bi-Mamba-based alignment with hierarchical fusion and temporal-aware reconstruction, yielding consistently high performance across CMU-MOSI, CMU-MOSEI, and CH-SIMS, as shown in the comprehensive results tables.

In summary, MGMR-Net integrates alignment, fusion, and reconstruction within a unified, text-centric architecture. Experimental evidence demonstrates its superiority in classification accuracy, F1-score, and robustness to incomplete modalities, thus validating the design choices and establishing its competitiveness among state-of-the-art MSA models.

6. Conclusions

This paper presents MGMR-Net, a unified framework that addresses key challenges in multimodal sentiment analysis, including incomplete modality inputs, long-range sequence modeling, and efficient multimodal fusion. The proposed model employs a text-centric two-stage architecture grounded in the Mamba framework, enabling effective cross-modal alignment and hierarchical fusion with linear computational complexity. Additionally, the Mamba-enhanced reconstruction module accurately restores corrupted modality features by capturing both short- and long-term temporal dependencies. A joint optimization strategy integrates sentiment prediction with reconstruction objectives, fostering robust and generalizable representation learning under modality-missing conditions. Extensive experiments demonstrate that MGMR-Net effectively learns temporally coherent and semantically rich multimodal representations, achieving superior generalization in real-world scenarios with incomplete data.

Author Contributions

C.Y.: conceptualization, methodology, software, and writing—original draft preparation. Z.L.: project administration, supervision, and writing—review and editing. T.L.: funding and writing—review and editing. Z.H.: formal analysis and validation. D.Y.: formal analysis and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Guangdong Basic and Applied Basic Research Foundation under Grant 2023A1515011230, the Science and Technology Program of Guangzhou under Grant 2023E04J0037, the Heyuan Social Science and Agriculture Project under Grant 2023015, the Science and Technology Planning Project of Yunfu under Grant 2023020205, the Key Construction Discipline Research Ability Enhancement Project of Guangdong Province under Grant 2022ZDJS022, and the Guangdong Province Science and Technology Innovation Strategy Special Fund (University Student Science and Technology Innovation Cultivation) Project for 2024 under Grant pdjh2024a199.

Data Availability Statement

The datasets used in this article can be obtained at the following URL: https://github.com/thuiar/MMSA (accessed on 30 July 2025).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.

References

Gandhi, A.; Adhvaryu, K.; Poria, S. Multimodal Sentiment Analysis: A Systematic Review of History, Datasets, Multimodal Fusion Methods, Applications, Challenges, and Future Directions. Inf. Fusion 2023, 91, 424–444. [Google Scholar] [CrossRef]
Zhu, L.; Zhu, Z.; Zhang, C.; Xu, Y.; Kong, X. Multimodal Sentiment Analysis Based on Fusion Methods: A Survey. Inf. Fusion 2023, 95, 306–325. [Google Scholar] [CrossRef]
Pham, H.; Liang, P.P.; Manzini, T. Found in Translation: Learning Robust Joint Representations by Cyclic Translations between Modalities. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6892–6899. [Google Scholar] [CrossRef]
Zeng, Y.; Mai, S.; Hu, H. Which Is Making the Contribution: Modulating Unimodal and Cross-Modal Dynamics for Multimodal Sentiment Analysis. In Findings of the Association for Computational Linguistics: EMNLP 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 1262–1274. [Google Scholar] [CrossRef]
Zeng, Y.; Mai, S.; Yan, W. Multimodal Reaction: Information Modulation for Cross-Modal Representation Learning. IEEE Trans. Multimed. 2024, 26, 2178–2191. [Google Scholar] [CrossRef]
Lu, Q.; Sun, X.; Gao, Z. Coordinated-Joint Translation Fusion Framework with Sentiment-Interactive Graph Convolutional Networks for Multimodal Sentiment Analysis. Inf. Process. Manag. 2024, 61, 103538. [Google Scholar] [CrossRef]
Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.P. MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. arXiv 2016, arXiv:1606.06259. [Google Scholar]
Zadeh, A.B.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.P. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Volume 1: Long Papers, pp. 2236–2246. [Google Scholar] [CrossRef]
Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Yang, K. Ch-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-Grained Annotation of Modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual, 5–10 July 2020; pp. 3718–3727. [Google Scholar] [CrossRef]
Yuan, Z.; Li, W.; Xu, H. Transformer-Based Feature Reconstruction Network for Robust Multimodal Sentiment Analysis. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, China, 20–24 October 2021; pp. 4400–4407. [Google Scholar] [CrossRef]
Goncalves, L.; Busso, C. Robust Audiovisual Emotion Recognition: Aligning Modalities, Capturing Temporal Information, and Handling Missing Features. IEEE Trans. Affect. Comput. 2022, 13, 2156–2170. [Google Scholar] [CrossRef]
Yuan, Z.; Liu, Y.; Xu, H. Noise Imitation Based Adversarial Training for Robust Multimodal Sentiment Analysis. IEEE Trans. Multimed. 2023, 26, 529–539. [Google Scholar] [CrossRef]
Kang, M.; Zhu, R.; Chen, D.; Xu, Y.; Li, W.; Liu, L. CM-GAN: A Cross-Modal Generative Adversarial Network for Imputing Completely Missing Data in Digital Industry. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 2917–2926. [Google Scholar] [CrossRef] [PubMed]
Tsai, Y.H.H.; Bai, S.; Liang, P.P. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6558–6569. [Google Scholar] [CrossRef]
Tao, C.; Li, J.; Zang, T.; Gao, P. A Multi-Focus-Driven Multi-Branch Network for Robust Multimodal Sentiment Analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 1547–1555. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
Hazarika, D.; Zimmermann, R.; Poria, S. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1122–1131. [Google Scholar] [CrossRef]
Yu, W.; Xu, H.; Yuan, Z.; Wu, J. Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, virtually, 2–9 February 2021; Volume 35, pp. 10790–10797. [Google Scholar] [CrossRef]
Wang, D.; Guo, X.; Tian, Y.; Liu, J.; He, L.; Luo, X. TETFN: A Text Enhanced Transformer Fusion Network for Multimodal Sentiment Analysis. Pattern Recognit. 2023, 136, 109259. [Google Scholar] [CrossRef]
Zhang, H.; Wang, Y.; Yin, G.; Liu, K.; Liu, Y.; Yu, T. Learning Language-Guided Adaptive Hyper-Modality Representation for Multimodal Sentiment Analysis. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 756–767. [Google Scholar] [CrossRef]
Zhang, H.; Wang, W.; Yu, T. Towards Robust Multimodal Sentiment Analysis with Incomplete Data. arXiv 2024, arXiv:2409.20012. [Google Scholar]
Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 1103–1114. [Google Scholar] [CrossRef]
Han, W.; Chen, H.; Poria, S. Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. arXiv 2021, arXiv:2109.00412. [Google Scholar]
Wang, D.; Liu, S.; Wang, Q.; Tian, Y.; He, L.; Gao, X. Cross-Modal Enhancement Network for Multimodal Sentiment Analysis. IEEE Trans. Multimed. 2022, 25, 4909–4921. [Google Scholar] [CrossRef]
Huang, C.; Zhang, J.; Wu, X.; Wang, Y.; Li, M.; Huang, X. TeFNA: Text-centered Fusion Network with Crossmodal Attention for Multimodal Sentiment Analysis. Knowl.-Based Syst. 2023, 269, 110502. [Google Scholar] [CrossRef]
Gu, A.; Johnson, I.; Goel, K.; Saab, K.; Dao, T.; Rudra, A.; Ré, C. Combining Recurrent, Convolutional, and Continuous-Time Models with Linear State Space Layers. Adv. Neural Inf. Process. Syst. 2021, 34, 572–585. [Google Scholar]
Qiao, J.; Liao, J.; Li, W.; Zhang, Y.; Guo, Y.; Wen, Y.; Lin, S. Hi-Mamba: Hierarchical Mamba for Efficient Image Super-Resolution. arXiv 2024, arXiv:2410.10140. [Google Scholar]
Jiang, X.; Li, Y.A.; Florea, A.N.; Han, C.; Mesgarani, N. Speech Slytherin: Examining the Performance and Efficiency of Mamba for Speech Separation, Recognition, and Synthesis. In Proceedings of the ICASSP 2025—IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–5. [Google Scholar] [CrossRef]
Chen, Q.; Tang, Y.; Liu, H. Mamba-Assisted Modality Subspace Complementary Fusion for Multimodal Sentiment Analysis. Pattern Recognit. Lett. 2025, 196, 31–37. [Google Scholar] [CrossRef]
Ye, J.; Zhang, J.; Shan, H. DepMamba: Progressive Fusion Mamba for Multimodal Depression Detection. In Proceedings of the ICASSP 2025—IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–5. [Google Scholar] [CrossRef]
Li, Y.; Xing, Y.; Lan, X.; Li, X.; Chen, H.; Jiang, D. AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-Modal Alignment. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 24774–24784. [Google Scholar]
Zeng, J.; Liu, T.; Zhou, J. Tag-Assisted Multimodal Sentiment Analysis under Uncertain Missing Modalities. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; ACM: New York, NY, USA, 2022; pp. 1545–1554. [Google Scholar] [CrossRef]
Li, M.; Yang, D.; Lei, Y.; Wang, S.; Wang, S.; Su, L.; Zhang, L. A Unified Self-Distillation Framework for Multimodal Sentiment Analysis with Uncertain Missing Modalities. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; AAAI Press: Palo Alto, CA, USA, 2024; Volume 38, No. 9. pp. 10074–10082. [Google Scholar] [CrossRef]
Chen, R.; Zhou, W.; Hu, H.; Fei, Z.; Fei, M.; Zhou, H. Disentangled Variational Auto-Encoder for Multimodal Fusion Performance Analysis in Multimodal Sentiment Analysis. Knowl.-Based Syst. 2024, 301, 112372. [Google Scholar] [CrossRef]
Mai, S.; Hu, H.; Xing, S. Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 164–172. [Google Scholar] [CrossRef]
Wu, Z.; Zhang, Q.; Miao, D.; Yi, K.; Fan, W.; Hu, L. HyDiscGAN: A Hybrid Distributed cGAN for Audio-Visual Privacy Preservation in Multimodal Sentiment Analysis. arXiv 2024, arXiv:2404.11938. [Google Scholar]
Li, M.; Yang, D.; Zhao, X.; Wang, S.; Wang, Y.; Yang, K.; Zhang, L. Correlation-Decoupled Knowledge Distillation for Multimodal Sentiment Analysis with Incomplete Modalities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; IEEE/CVF: Piscataway, NJ, USA, 2024; pp. 12458–12468. [Google Scholar] [CrossRef]
Liang, H.; Xie, W.; He, X.; Song, S.; Shen, L. Circular Decomposition and Cross-Modal Recombination for Multimodal Sentiment Analysis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 7910–7914. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the NeurIPS, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT 2018, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar] [CrossRef]
Liu, Y.; Yuan, Z.; Mao, H.; Liang, Z.; Yang, W.; Qiu, Y.; Gao, K. Make Acoustic and Visual Cues Matter: CH-SIMS v2.0 Dataset and AV-Mixup Consistent Module. In Proceedings of the 2022 International Conference on Multimodal Interaction, Bengaluru, India, 7–11 November 2022; pp. 247–258. [Google Scholar] [CrossRef]
Degottex, G.; Kane, J.; Drugman, T.; Raitio, T.; Scherer, S. COVAREP—A Collaborative Voice Analysis Repository for Speech Technologies. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 960–964. [Google Scholar] [CrossRef]
McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O. librosa: Audio and Music Signal Analysis in Python. In Proceedings of the SciPy, Austin, TX, USA, 6–12 July 2015; pp. 18–24. [Google Scholar] [CrossRef]
Li, S.Z.; Jain, A.K.; Tian, Y.L.; Kanade, T.; Cohn, J.F. Facial Expression Analysis. In Handbook of Face Recognition; Springer: New York, NY, USA, 2005; pp. 247–275. [Google Scholar] [CrossRef]
Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef]
Baltrusaitis, T.; Zadeh, A.; Lim, Y.C.; Morency, L.-P. OpenFace 2.0: Facial Behavior Analysis Toolkit. In Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 59–66. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of MGMR-Net consists primarily of three components: unimodal encoder, Mamba-collaborative fusion module, Mamba-enhanced reconstruction module.

Figure 2. An illustration of collaborative Bi-Mamba with text and audio/visual inputs. The symbol Flip denotes the temporal flip operation.

Figure 3. Comparative model performance under increasing modality missing rates on the CMU-MOSI, CMU-MOSEI, and CH-SIMS benchmarks.

Figure 4. Case analysis for multimodal language with our MGMR-Net on the CMU-MOSI dataset.

Table 1. Statistics of the three experimental datasets: CMU-MOSI, CMU-MOSEI and CH-SIMS.

Datasets	Train	Valid	Test	Total	Language
CMU-MOSI	1284	229	686	2199	English
CMU-MOSEI	16,326	1871	4659	22,856	English
CH-SIMS	1368	456	457	2281	Chinese

Table 2. Hyper-parameter settings on different datasets.

Descriptions	CMU-MOSI	CMU-MOSEI	CH-SIMS
Batch Size	32	32	16
Length L	50	50	39
Mamba State	12	12	16
Mamba Expansion	4	4	2
Mamba Depth	{1,1}	{2,2}	{1,2}
Attention Head	8	8	8
Learning Rate	$1 \times 10^{- 3}$	$1 \times 10^{- 3}$	$1 \times 10^{- 3}$
Seed	1111	1111	1111
Early Stop	8	8	8
Warm Up	✓	✓	✓

Table 3. Overall performance comparison on the CMU-MOSI dataset for the MSA benchmark.

Method	Acc-7	Acc-5	Acc-2	F1	MAE	Corr
MISA	29.85	33.08	71.49/70.33	71.28/70.00	1.085	0.524
Self-MM	29.55	34.67	70.51/69.26	66.60/67.54	1.070	0.512
MMIM	31.30	33.77	69.14/67.06	66.65/64.04	1.077	0.507
CENET	30.38	37.25	71.46/67.73	68.61/64.85	1.080	0.504
TETFN	30.30	34.34	69.76/67.68	65.69/63.29	1.087	0.507
TFR-Net	29.54	34.67	68.15/66.35	67.03/66.35	1.200	0.459
ALMT	30.30	33.42	70.40/68.39	72.57/71.80	1.083	0.492
LNLN	34.26	38.27	72.55/70.94	72.73/71.25	1.046	0.527
MGMR-Net	35.38	39.46	73.50/72.37	73.53/72.39	1.038	0.534

Table 4. Overall performance comparison on the CMU-MOSEI dataset for the MSA benchmark.

Method	Acc-7	Acc-5	Acc-2	F1	MAE	Corr
MISA	40.84	39.39	71.27/75.82	63.85/68.73	0.780	0.503
Self-MM	44.70	45.38	73.89/77.42	68.92/72.31	0.695	0.498
MMIM	40.75	41.74	73.32/75.89	68.72/70.32	0.739	0.489
CENET	47.18	43.01	73.66/77.34	70.04/74.04	0.685	0.535
TETFN	30.40	47.70	69.76/67.68	65.69/63.29	1.087	0.508
TFR-Net	46.83	34.67	73.62/77.23	68.60/71.19	0.697	0.489
ALMT	46.41	41.64	76.64/77.54	77.14/78.03	0.674	0.481
LNLN	45.42	46.17	76.30/78.19	77.77/79.95	0.692	0.530
MGMR-Net	46.10	46.92	77.42/77.83	77.90/78.21	0.671	0.541

Table 5. Overall performance comparison on the CH-SIMS dataset for the MSA benchmark.

Method	Acc-5	Acc-3	Acc-2	F1	MAE	Corr
MISA	31.53	56.87	72.71	66.30	0.539	0.348
Self-MM	32.28	56.75	72.81	68.43	0.508	0.376
MMIM	31.81	52.76	69.86	66.21	0.544	0.339
CENET	22.29	53.17	68.13	57.90	0.589	0.107
TETFN	33.42	56.91	73.58	68.67	0.505	0.387
TFR-Net	26.52	52.89	68.13	58.70	0.661	0.169
ALMT	34.16	56.47	71.85	76.21	0.509	0.372
LNLN	34.64	57.14	72.73	79.43	0.514	0.397
MGMR-Net	34.81	56.72	73.65	78.91	0.511	0.389

Table 6. Ablation study on the CMU-MOSI dataset.

Model	Acc-2	F1-Score	MAE	Corr
MGMR-Net	73.50/72.37	73.53/72.39	1.038	0.534
MGMR-Net w/o S-1	68.28/67.96	70.32/70.01	1.120	0.498
MGMR-Net w/o S-2	71.20/71.83	72.34/71.78	1.073	0.516
MGMR-Net w/o Recon	73.10/72.28	68.52/67.59	1.087	0.508

Table 7. Performance comparison of model inference time on the CMU-MOSI dataset.

Complete Modality		Incomplete Modality with 0.5 Missing Rate
Method	Time (s)	Method	Time (s)
MISA	4.02	MISA	4.03
Self_MM	3.79	Self_MM	4.08
MMIM	4.17	MMIM	4.69
ALMT	4.01	ALMT	4.56
LNLN	4.06	LNLN	4.05
MGMR-Net	3.65	MGMR-Net	3.93

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, C.; Liang, Z.; Liu, T.; Hu, Z.; Yan, D. MGMR-Net: Mamba-Guided Multimodal Reconstruction and Fusion Network for Sentiment Analysis with Incomplete Modalities. Electronics 2025, 14, 3088. https://doi.org/10.3390/electronics14153088

AMA Style

Yang C, Liang Z, Liu T, Hu Z, Yan D. MGMR-Net: Mamba-Guided Multimodal Reconstruction and Fusion Network for Sentiment Analysis with Incomplete Modalities. Electronics. 2025; 14(15):3088. https://doi.org/10.3390/electronics14153088

Chicago/Turabian Style

Yang, Chengcheng, Zhiyao Liang, Tonglai Liu, Zeng Hu, and Dashun Yan. 2025. "MGMR-Net: Mamba-Guided Multimodal Reconstruction and Fusion Network for Sentiment Analysis with Incomplete Modalities" Electronics 14, no. 15: 3088. https://doi.org/10.3390/electronics14153088

APA Style

Yang, C., Liang, Z., Liu, T., Hu, Z., & Yan, D. (2025). MGMR-Net: Mamba-Guided Multimodal Reconstruction and Fusion Network for Sentiment Analysis with Incomplete Modalities. Electronics, 14(15), 3088. https://doi.org/10.3390/electronics14153088

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MGMR-Net: Mamba-Guided Multimodal Reconstruction and Fusion Network for Sentiment Analysis with Incomplete Modalities

Abstract

1. Introduction

2. Related Work

2.1. Multimodal Sentiment Analysis

2.2. State-Space Models and Mamba

2.3. Handling Incomplete Modalities

3. Methodology

3.1. Overall Architecture

3.2. Unimodal Encoder

3.3. Mamba-Collaborative Fusion Module

3.3.1. Text-Centric Two-Stage Mamba Modeling

Stage 1: Cross-Modal Alignment via Multi-Layer Collaborative Bi-Mamba

Stage 2: Mamba-Based Multimodal Fusion

3.4. Mamba-Enhanced Reconstruction Module

3.5. Model Optimization

4. Experiments

4.1. Datasets and Metrics

4.2. Feature Extraction

4.3. Baselines

4.4. Implementation Details

5. Results and Analysis

5.1. Overall Results

5.2. Ablation Study

5.3. Model Complexity

5.4. Case Study

5.5. Comparative Discussion with Prior Models

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI