MISA-GMC: An Enhanced Multimodal Sentiment Analysis Framework with Gated Fusion and Momentum Contrastive Modality Relationship Modeling

Du, Zheng; Wang, Yapeng; Yang, Xu; Im, Sio-Kei; Wang, Zhiwen

doi:10.3390/math14010115

Open AccessArticle

MISA-GMC: An Enhanced Multimodal Sentiment Analysis Framework with Gated Fusion and Momentum Contrastive Modality Relationship Modeling

by

Zheng Du

¹

,

Yapeng Wang

^1,*

,

Xu Yang

¹

,

Sio-Kei Im

¹

and

Zhiwen Wang

²

¹

Faculty of Applied Sciences, Macao Polytechnic University, Macao SAR 999078, China

²

Peking University School of Nursing, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(1), 115; https://doi.org/10.3390/math14010115 (registering DOI)

Submission received: 3 December 2025 / Revised: 25 December 2025 / Accepted: 26 December 2025 / Published: 28 December 2025

(This article belongs to the Special Issue Applications of Machine Learning and Pattern Recognition)

Download

Browse Figures

Versions Notes

Abstract

Multimodal sentiment analysis jointly exploits textual, acoustic, and visual signals to recognize human emotions more accurately than unimodal models. However, real-world data often contain noisy or partially missing modalities, and naive fusion may allow unreliable signals to degrade overall performance. To address this, we propose an enhanced framework named MISA-GMC, a lightweight extension of the widely used MISA backbone that explicitly accounts for modality reliability. The core idea is to adaptively reweight modalities at the sample level while regularizing cross-modal representations during training. Specifically, a reliability-aware gated fusion module down-weights unreliable modalities, and two auxiliary training-time regularizers (momentum contrastive learning and a lightweight correlation graph) help stabilize and refine multimodal representations without adding inference-time overhead. Experiments on three benchmark datasets—CMU-MOSI, CMU-MOSEI, and CH-SIMS—demonstrate the effectiveness of MISA-GMC. For instance, on CMU-MOSI, the proposed model improves 7-class accuracy from 43.29 to 45.92, reduces the mean absolute error (MAE) from 0.785 to 0.712, and increases the Pearson correlation coefficient (Corr) from 0.764 to 0.795. This indicates more accurate fine-grained sentiment prediction and better sentiment-intensity estimation. On CMU-MOSEI and CH-SIMS, MISA-GMC also achieves consistent gains over MISA and strong baselines such as LMF, ALMT, and MMIM across both classification and regression metrics. Ablation studies and missing-modality experiments further verify the contribution of each component and the robustness of MISA-GMC under partial-modality settings.

Keywords:

multimodal sentiment analysis; multimodal fusion; gated fusion; momentum contrastive learning; robustness to missing modalities; MISA

MSC:

68T07

1. Introduction

Multimodal sentiment analysis (MSA) studies how to infer people’s attitudes and emotions by looking at textual, acoustic, and visual signals together [1]. Rather than relying only on what is said, MSA models also take into account how it is spoken and how the speaker looks—spoken content, prosody, and facial expressions—so they capture affective cues that text-only models often miss. Such models are now used in applications including opinion mining, intelligent virtual assistants, and social media analysis [2].

Developing models that can reliably analyze information from multiple modalities in real-world settings remains challenging. Different modalities exhibit markedly different characteristics: textual streams usually provide clear semantic content, whereas acoustic and visual signals are often noisy, weak, or partially corrupted. When all modalities are combined in a simple or static manner, low-quality signals can dominate the fusion process and degrade overall performance [3]. In addition, real-world datasets frequently contain incomplete or missing modalities due to sensor failures, occlusions, or speaker behavior, yet many models are still trained and evaluated under the idealized assumption that all modalities are always present [4]. At the representation level, the model must disentangle shared and modality-specific information and align heterogeneous feature spaces, while avoiding spurious correlations that are unrelated to sentiment labels [5]. These factors make it difficult to design multimodal fusion schemes that are both accurate and robust. Prior work in MSA is often broadly grouped into fusion-centric methods that emphasize cross-modal interaction design and representation-centric methods that learn structured shared/private embeddings, sometimes with information-theoretic or contrastive objectives [6,7,8].

Overall, despite these advances, a common practical bottleneck is that many models implicitly assume modalities are equally reliable and do not explicitly control modality trustworthiness, leading to brittle behavior under noisy or missing modalities.

Among representation-centric approaches, the Modality-Invariant and Modality-Specific Representations (MISA) framework has become particularly influential [9]. MISA decomposes each modality into shared and private components, enforces a clear separation between these subspaces, and encourages the shared components to follow similar distributions across modalities. As a result, it has been widely adopted as a strong baseline for multimodal sentiment analysis [2]. Nevertheless, the original MISA architecture exhibits several limitations when the modality inputs are noisy or of unequal quality. In this paper, we primarily target MISA’s limited robustness under varying modality quality and missing-modality settings, while keeping the extension lightweight. Specifically, MISA does not provide a sample-adaptive mechanism to adjust the contribution of each modality during fusion, making it difficult to down-weight unreliable modalities. Its training process is dominated by supervised losses and relatively simple cross-modal consistency constraints, without leveraging stronger momentum-based contrastive regularization schemes.

To address these limitations, this work builds on MISA and proposes MISA-GMC, a lightweight extension for multimodal sentiment analysis. We retain the original shared–private decomposition and introduce three additional components that are jointly designed to address the above robustness bottleneck. First, a reliability-aware gated fusion module adaptively adjusts the contribution of each modality on a per-sample basis so that unreliable modalities are explicitly down-weighted during fusion and produces intuitive modality-wise gating visualizations. Second, an MOCO-style momentum contrastive learning strategy is employed to regularize cross-modal representations during training, without introducing additional inference-time costs, which stabilizes representation learning by leveraging a large set of negatives and reduces spurious correlations that arise from noisy modalities. Third, we design a global modality contrast (GMC) term to model inter-modality dependencies by directly aligning embeddings across all modality pairs in the feature space. Experiments on public MSA benchmarks show that MISA-GMC consistently outperforms the vanilla MISA baseline and several representative multimodal models on most classification and regression metrics [10]. Furthermore, ablation studies and missing-modality experiments demonstrate that each proposed component yields a measurable performance gain and that the overall framework exhibits improved robustness under various missing-modality scenarios while remaining competitive in terms of efficiency (implemented as pairwise cross-modal contrastive terms; no explicit graph neural network/message passing is introduced). These three components are complementary: gating handles sample-wise reliability at fusion time, MoCo strengthens cross-sample representation regularization, and GMC ties modalities together through a global pairwise correlation structure [11].

In summary, this work makes the following contributions. First, we design a reliability-aware gated fusion mechanism on top of MISA that adaptively modulates the contribution of each modality on a per-sample basis and provides intuitive modality-wise gating visualizations. Second, we combine an MOCO-style cross-modal contrastive regularization with the GMC term to strengthen global cross-modal interactions during training. The contrastive branch is discarded at inference, and therefore it introduces zero additional inference-time cost. Third, we conduct extensive experiments, including ablation studies, missing-modality simulations, and engineering-cost analysis, to demonstrate that the proposed MISA-GMC framework achieves a favorable trade-off between accuracy, robustness to modality corruption or absence, and practical deployment efficiency.

2. Related Work

2.1. Fusion-Centric Multimodal Sentiment Analysis

Many MSA approaches focus on how to effectively combine heterogeneous modalities. Early tensor-based fusion architectures explicitly model unimodal, bimodal, and trimodal interactions via high-order outer products, achieving strong performance but at the cost of substantial computational and memory overhead [12]. Such high-capacity fusion schemes are often sensitive to modality noise because they do not explicitly control the per-sample contribution of each modality. Subsequent low-rank and bilinear fusion methods factorize these high-order structures into compact components, reducing the number of parameters while preserving most of the cross-modal interaction information [13]. Models with multi-stage fusion and dedicated memory units further refine cross-modal interactions over time, helping capture complex temporal patterns in spoken language scenarios [14,15]. Building on the success of attention mechanisms, cross-modal Transformer-style models use stacked attention layers to directly learn interactions between unaligned language, acoustic, and visual sequences, and have become common baselines for sequence-level multimodal understanding [16,17]. Still, these interaction-heavy designs typically assume full and reliable modalities at both training and inference, and therefore may degrade when some modalities are corrupted or missing [18].

Beyond generic interaction architectures, several studies explicitly examine modality reliability and language-dominant fusion. In such approaches, text is typically treated as the primary modality, while acoustic and visual cues are injected into high-level textual representations through modality adaptation gates or related gating mechanisms [19]. The aim is to emphasize sentiment-relevant acoustic and visual patterns while down-weighting noisy or conflicting signals. Nevertheless, many of these gates are designed under a text-dominant assumption and do not provide a symmetric, sample-adaptive reliability control for all modalities, nor do they explicitly preserve a disentangled shared–private representation structure. More recent work adopts boosting-style or reliability-aware reweighting schemes that estimate the contribution of each modality and adapt fusion weights accordingly, thereby improving robustness when some modalities are weak, corrupted, or partially missing [20,21,22]. Even so, these approaches are often built on relatively heavy backbones or introduce additional inference-time modules, which can be undesirable for deployment-oriented settings.

Overall, these fusion-centric approaches highlight the importance of carefully designed interaction patterns and explicit modeling of modality reliability. However, they are usually built on relatively heavy backbones and often lack an explicit shared–private representation structure, which makes it difficult to disentangle factors that are common across modalities from those that are modality-specific. In addition, most of these methods focus on full-modality settings and provide limited analysis of model behavior under missing-modality or deployment-oriented constraints [23]. In contrast to these limitations, our study keeps the lightweight MISA-style shared–private backbone and introduces a per-sample reliability-aware gating mechanism (without adding inference-time complexity), while systematically evaluating robustness under missing-modality scenarios. These gaps motivate our exploration of a lighter, MISA-based framework with explicit reliability modeling.

2.2. Shared–Private Representation and Contrastive Learning

Beyond fusion architectures, representation-centric approaches aim to construct structured and robust multimodal embeddings [24]. The MISA framework decomposes each modality into shared and private components and imposes constraints to separate common sentiment-relevant factors from modality-specific information [9]. In particular, it enforces independence conditions between shared and private subspaces to reduce interference between them. Subsequent work has introduced co-space interaction networks that allow information to flow between shared and private spaces, as well as hierarchical mutual-information maximization strategies that preserve informative signals at unimodal, bimodal, and trimodal levels during fusion [25,26]. These studies collectively suggest that explicitly structuring multimodal representations in this way can improve both performance and interpretability. Nevertheless, most shared–private frameworks still treat modalities as equally reliable during fusion and seldom provide an explicit, sample-wise mechanism to suppress unreliable modalities under noise or partial observation.

Information-theoretic and contrastive objectives have also become important tools for organizing multimodal representations. Information bottleneck-based methods impose constraints that encourage the model to learn minimal sufficient unimodal and multimodal representations, thereby reducing redundancy and noise in the joint feature space [27]. Contrastive frameworks encourage samples with similar labels to lie close in the representation space, which improves generalization when test distributions differ from those seen during training [28,29,30]. Other work uses contrastive feature decomposition to separate complementary and redundant components, helping to alleviate cross-modal redundancy and sharpen decision boundaries [31]. In related research on aspect-level sentiment analysis, momentum contrastive learning has been adapted via queue-based encoders, demonstrating that MOCO-style objectives can effectively leverage large sets of negative examples [32]. Active-learning-based frameworks further combine uncertainty sampling with hierarchical contrastive losses at unimodal and multimodal levels, enabling better use of limited labeled data and guiding fusion under modality inconsistency [33]. Despite their effectiveness, these contrastive or information-theoretic approaches are typically studied without jointly addressing (i) sample-adaptive modality reliability in fusion and (ii) robustness evaluation under missing-modality conditions within a lightweight shared–private backbone.

Our work is closely related to these shared–private and contrastive frameworks, but differs in how it combines them in a practical multimodal sentiment analysis setting. Instead of proposing an entirely new architecture, we retain the widely used MISA backbone and augment it with three lightweight components: a reliability-aware gating module that controls modality contributions in a sample-adaptive manner, an MOCO-style contrastive objective that regularizes cross-modal representations, and a modality correlation (GMC) module that explicitly models inter-modality relationships. This design allows the model to adjust the influence of each modality for different inputs and encourages more informative interactions within the shared–private representation space. Our experimental results show that this approach yields consistent, though moderate, improvements over the original MISA and several strong baselines, and provides a more systematic analysis of robustness under missing-modality scenarios as well as the associated engineering cost.

3. Methods

3.1. Task Definition and Notation

We focus on utterance-level multimodal sentiment analysis using text, audio, and visual modalities. Let the input sequences be:

X_{t} \in R^{T_{t} \times d_{t}}, X_{a} \in R^{T_{a} \times d_{a}}, X_{v} \in R^{T_{v} \times d_{v}}

(1)

where

T_{m}

and

d_{m}

denote the sequence length and feature dimension of modality

m \in {t, a, v}

, respectively. The target is the sentiment intensity

y \in R

associated with each utterance.

In the regression setting, the model predicts a continuous score

\hat{y} \in R

to approximate

y

. In the classification setting, the continuous score is mapped to a discrete label

c \in C

(e.g., binary, 5-class or 7-class sentiment). Unless otherwise stated, we use

m \in \{t, a, v\}

to index modalities and

d

to denote the unified hidden dimension after projection into a common space. Lowercase letters (e.g.,

h_{m}, g_{m}, p_{m}

) represent vectors, and uppercase letters (e.g.,

X_{m}, H

) represent matrices or sequences.

In practice, each sequence

X_{m}

is first encoded by a modality-specific encoder into an utterance-level vector: we use a pretrained BERT-based encoder with masked mean pooling for text, and two-layer bidirectional RNNs for audio and visual streams, from which we take the last hidden states. For notational simplicity, we still denote these utterance-level features (or their linear projections into the shared hidden size

d

) by

X_{m}

. Based on these utterance-level representations, we next introduce the overall MISA-GMC pipeline. A nomenclature list of the main symbols and abbreviations is provided in Appendix A (Table A1) for quick reference.

3.2. Overall Architecture of MISA-GMC

As illustrated in Figure 1a–d, the proposed MISA-GMC follows a three-stage pipeline consisting of (i) shared–private disentanglement, (ii) reliability-aware gating, and (iii) token-level fusion for prediction. Figure 1a presents the concise inference-time computation flow, while the remaining panels summarize auxiliary components used to regularize representation learning during training.

The processing pipeline operates in three stages:

Disentanglement: Given the unimodal input from modality m ∈ {T, A, V}, the encoder maps it into a latent space and decomposes the representation into a shared component S_m and a private component P_m. The shared–private separation is encouraged by auxiliary objectives (e.g., reconstruction/consistency constraints) so that S_m captures modality-invariant sentiment cues, whereas P_m preserves modality-specific information, reducing redundancy and improving interpretability.
Reliability-Aware Gating: To adaptively select reliable information for each sample, we introduce a lightweight gating unit that takes the shared/private pair (S_m, P_m) and outputs a scalar gate α_m ∈ (0, 1). The gate is used to mix shared and private cues into a gated modality representation g_m. A gate regularization term is further applied only during training to discourage ambiguous gate values and stabilize reliability estimation, making the sample-wise reweighting behavior more consistent.
Fusion and Prediction: After gating, the model forms a short token sequence from the gated features and private cues, i.e., {g_m} and {p_m}, and performs cross-modal interaction modeling with a lightweight Transformer-based fusion module. The fused representation is then fed into a prediction head to output the final sentiment score/class. This design keeps the fusion stage compact while still enabling effective cross-modal reasoning through self-attention over a fixed-length token sequence.

Although training employs additional regularization branches to improve representation quality, the deployment-time computation remains concise and follows only the main path in Figure 1a (encoding → shared/private decomposition → gating → token-level fusion → prediction). In particular, the contrastive regularization modules (including the momentum encoder, memory queue, and contrastive losses) are introduced as training-only objectives and are discarded at inference, thereby not increasing inference-time overhead. The added components are intentionally lightweight: the gating unit is a small MLP per modality, and the fusion module operates on a short token sequence.

Regarding scalability, the fusion stage naturally generalizes to M modalities by constructing a 2M-token sequence (one gated token and one private token per modality). The pairwise correlation modeling scales with the number of modality pairs (on the order of M(M − 1)/2) and can be reduced by sampling a subset of modality pairs when M is large.

3.3. Shared–Private Representation Learning

Based on the projected unimodal features described in Section 3.2, we follow the shared–private paradigm of MISA to decouple modality-invariant and modality-specific information [9]. Let

{\bar{x}}_{m} \in R^{d}

denote the projected representation of modality

m \in \{t, a, v\}

(text, acoustic, visual). Two lightweight feed-forward networks are applied to obtain a shared code

s_{m}

and a private code

p_{m}

for each modality:

s_{m} = f_{s} ({\bar{x}}_{m}), p_{m} = f_{p}^{(m)} ({\bar{x}}_{m})

(2)

where

f_{s} (\cdot)

is shared across modalities and

f_{p}^{(m)} (\cdot)

is modality-specific.

To encourage the shared codes of different modalities to lie in a common subspace, we minimize an alignment loss on the shared representations:

L_{sim} = \frac{1}{3} (∥ s_{t} - s_{v} ∥_{2}^{2} + ∥ s_{t} - s_{a} ∥_{2}^{2} + ∥ s_{v} - s_{a} ∥_{2}^{2})

(3)

which is implemented as the mean squared error between pairs of shared codes. This alignment term is used as a soft regularizer after projecting all modalities into the same hidden space, encouraging a coarse modality-invariant structure rather than enforcing identical representations across heterogeneous signals. In practice, it works together with the supervised task objective and the additional contrastive regularizers, so modality-invariant semantics are not assumed to be fully captured by this term alone.

Meanwhile, to prevent the shared and private codes from collapsing to similar directions, we adopt a simple orthogonality surrogate that penalizes their element-wise overlap:

L_{diff} = \frac{1}{3} \sum_{m \in t, a, v} \frac{1}{d} {∥ s_{m} ⊙ p_{m} ∥}_{1}

(4)

where

⊙

denotes the Hadamard product, and

∥ \cdot ∥_{1}

is the

l_{1}

norm. In implementation, this corresponds to the mean absolute value of

s_{m} ⊙ p_{m}

, which is a stable and easy-to-optimize proxy for strict orthogonality. While this element-wise overlap penalty is not a hard orthogonality constraint, it discourages co-activation between shared and private codes and empirically reduces redundancy. Moreover, the separation is reinforced by using distinct parameterizations for shared and private encoders and by the reconstruction objective, which together mitigate degenerate solutions where the two codes collapse to similar directions.

Finally, to ensure that the combination of shared and private components preserves the information in the projected modality embeddings, we reconstruct

{\bar{x}}_{m}

from their sum. Concretely, we define

{\tilde{x}}_{m} = g_{rec}^{(m)} (s_{m} + p_{m}), L_{recon} = \frac{1}{3} \sum_{m \in t, a, v} ∥ {\tilde{x}}_{m} - {\bar{x}}_{m} ∥_{2}^{2}

(5)

where

g_{rec}^{(m)} (\cdot)

is a single linear reconstruction head for modality

m

. Note that here we reconstruct the projected modality embeddings

{\bar{x}}_{m}

(i.e., the outputs of the projection layers), rather than the raw low-level acoustic/visual features. We reconstruct the projected embeddings (instead of raw low-level inputs) mainly for stability and efficiency: the upstream acoustic/visual features are high-dimensional and often pre-extracted, making raw reconstruction expensive and not directly aligned with the downstream prediction space. The goal of this reconstruction is to preserve information in the shared hidden space used by the model, while fine-grained modality-specific cues are retained in the private codes and further exposed to the fusion module.

The overall shared–private regularization term of our model is then

L_{MISA} = L_{diff} + L_{sim} + L_{recon}

(6)

which follows the spirit of the original MISA implementation [9], while using the above explicit and reproducible formulations.

3.4. Lightweight Gating Mechanism

Although the shared–private decomposition separates invariant and modality-specific information, directly feeding both components into the fusion module may not be optimal when some modalities are noisy or weakly relevant. We therefore introduce a lightweight gating mechanism that adaptively reweights the contributions of shared and private codes for each modality.

For modality

m

, we first concatenate its private and shared codes and normalize them:

h_{m}^{in} = L N ([p_{m}; s_{m}])

(7)

where

L N (\cdot)

denotes Layer Normalization and

[\cdot; \cdot]

is vector concatenation. The normalized vector is then processed by a small gating MLP:

u_{m} = W_{2}^{(m)} ϕ (W_{1}^{(m)} h_{m}^{in} + b_{1}^{(m)}) + b_{2}^{(m)}

(8)

where

W_{1}^{(m)}, W_{2}^{(m)}

and

b_{1}^{(m)}, b_{2}^{(m)}

are modality-specific parameters and

ϕ (\cdot)

is a ReLU activation. Layer Normalization before the MLP stabilizes the feature distribution of the concatenated codes, which we found important in practice. Although the gate introduces modality-specific parameters, the gating branch is deliberately low-capacity (a small two-layer MLP) and acts as a reweighting function rather than a new feature generator. In addition to Layer Normalization, standard regularization in training (e.g., dropout/weight decay and early stopping in the optimization pipeline) helps mitigate overfitting, especially on smaller datasets.

We obtain a soft gate for each modality by applying a temperature-scaled and bias-shifted sigmoid:

α_{m} = σ (\frac{u_{m} - b_{g}}{τ_{g}}), α_{m} \in {(0,1)}^{d}

(9)

where

τ_{g} > 0

is a temperature parameter and

b_{g}

is a scalar bias. The temperature controls how sharp the gating becomes: smaller values produce more binary-like decisions but may saturate the sigmoid under noisy inputs, whereas larger values yield smoother interpolation. The bias shifts the default preference between shared and private information. In practice, we use a moderate temperature and a simple bias setting so that the gate stays in a stable operating regime, and we tune these values within a narrow range to avoid extreme saturation. The gate

α_{m}

controls how much the shared code

s_{m}

versus the private code

p_{m}

contributes to the mixed representation:

{\tilde{h}}_{m} = α_{m} ⊙ s_{m} + (1 - α_{m}) ⊙ p_{m}

(10)

To avoid over-reliance on the newly introduced gating branch in the early stage of training, we further introduce a scalar gate strength

g_{s} \in [0,1]

that linearly interpolates between the original shared representation and the mixed code:

g_{m} = (1 - g_{s}) s_{m} + g_{s} {\tilde{h}}_{m}

(11)

In our experiments,

g_{s}

is linearly scheduled from 0.3 to 0.7 over training epochs, starting from a regime close to the vanilla MISA backbone and gradually shifting towards the gated representations. At inference time, we directly use the final learned value of

g_{s}

without additional tuning. To improve stability under severe noise or partial observation, the gate is applied on normalized codes and is encouraged to avoid ambiguous mid-range outputs through the gate regularizer, which reduces sensitivity to small perturbations. The scalar gate-strength schedule acts as a simple curriculum: it keeps the model closer to the shared representation early in training and gradually increases the influence of the gated mixture, which empirically stabilizes optimization. We also found that the behavior remains stable when the start/end values of the schedule are varied within a small range, suggesting that the mechanism does not rely on a single fragile scheduling choice. For missing-modality cases, the absent modality can be represented by a neutral placeholder in the hidden space (e.g., a masked/zero embedding), allowing the model to operate without architectural changes; the fusion module then naturally relies more on the remaining informative modalities. After gating, an optional LayerNorm is applied to

g_{m}

as a light-weight post-normalization.

To encourage decisive routing behavior, we regularize the gate activations via

L_{gate} = \sum_{m \in t, a, v} E [α_{m} ⊙ (1 - α_{m})]

(12)

which penalizes ambiguous gate values around 0.5 and pushes

α_{m}

towards more binary-like decisions. Intuitively, this mechanism allows the model to down-weight modalities whose private signals conflict with the shared sentiment while still preserving useful modality-specific cues through the residual path.

When the gating module is disabled (for ablation or parity with the MISA baseline), we simply set

g_{m} = s_{m}

and drop

L_{gate}

from the objective.

3.5. MOCO-Enhanced Cross-Modal Contrastive Learning (GMC)

To further exploit cross-sample and cross-modal relationships, we place an MOCO-style contrastive learning scheme on top of the gated representations. We first build an intra-modal MOCO backbone to regularize each modality separately, and subsequently extend it to a pairwise cross-modal objective, referred to as global modality contrast (GMC). Both objectives are used only during training as auxiliary losses and are discarded at inference time.

3.5.1. MOCO-Style Contrastive Framework

For each modality

m \in {t, a, v}

, we construct a query encoder

f_{q}^{(m)}

and a momentum-updated key encoder

f_{k}^{(m)}

. Given the gated codes

g_{m}

, we obtain

q_{m} = f_{q}^{(m)} (g_{m}), k_{m} = f_{k}^{(m)} (g_{m})

(13)

which are

l_{2}

-normalized to unit length. The key embeddings are enqueued into a modality-specific memory bank

K_{m}

and simultaneously dequeued to keep a fixed size. The memory banks are updated at every training step using the current mini-batch as keys. For a query–key pair

(q_{m}, k_{m})

and a negative set

K_{m}

, we use the standard InfoNCE loss:

l (q_{m}, k_{m}, K_{m}) = - l o g \frac{\exp (q_{m}^{⊤} k_{m} / τ)}{e x p (q_{m}^{⊤} k_{m} / τ) + \sum_{k^{-} \in K_{m}} e x p (q_{m}^{⊤} k^{-} / τ)}

(14)

where

τ

is a temperature. The intra-modal MOCO loss averages this objective over the three modalities:

L_{moco} = \frac{1}{3} \sum_{m \in t, a, v} l (q_{m}, k_{m}, K_{m})

(15)

This loss encourages each modality to form compact clusters in its own representation space, which stabilizes training and improves robustness, especially when some modalities are partially missing or noisy.

3.5.2. Modality Correlation via Pairwise Contrast (GMC)

Beyond intra-modal regularization, we would like to explicitly capture how different modalities interact. Conceptually, we can view the three modalities as nodes in an implicit fully connected directed graph, where each directed edge

m \to n

is not a message-passing operation but is simply associated with a cross-modal contrastive loss term. In other words, the ‘graph’ here only describes the optimization structure over modality pairs and we do not introduce any explicit GNN or adjacency-matrix computation. Note that in this work, the term “graph” is used in a loose sense to denote the fully connected pairwise structure among modalities. We do not employ any graph neural networks or explicit message passing; GMC is implemented purely as a set of pairwise contrastive terms.

Concretely, we reuse the same query and key encoders as above and consider all six ordered modality pairs:

(t \to a), (t \to v), (a \to t), (a \to v), (v \to t), (v \to a)

(16)

For a pair

m \to n

, we treat

q_{m} = f_{q}^{(m)} (g_{m})

as the query and

k_{n} = f_{k}^{(n)} (g_{n})

as the positive key, while reusing the queue

K_{n}

of modality

n

as negatives. The corresponding InfoNCE loss is defined in the same way as above. The GMC loss averages over all directed pairs:

L_{gmc} = \frac{1}{6} \sum_{(m, n) \in P} l (q_{m}, k_{n}, K_{n})

(17)

where

P

is the set of six ordered modality pairs. Intuitively, this encourages each modality to be predictive of the others in the embedding space, yielding a fully connected “graph” of pairwise alignment relations without explicitly running a GNN.

Both

L_{moco}

and

L_{gmc}

are only used during training as auxiliary regularizers; at inference time, the momentum encoder and queues are not involved, so the model does not incur additional computational cost compared with the backbone.

3.6. Multimodal Fusion and Objective

Given the gated representations and private codes, we construct a six-token sequence for each sample:

H = [g_{t}, p_{t}, g_{a}, p_{a}, g_{v}, p_{v}] \in R^{6 \times d}

(18)

where the tokens are ordered by modality and role (gated vs. private). This sequence is fed into a lightweight Transformer encoder with a single self-attention layer. This token-based fusion design naturally extends to

M

modalities by forming a

2 M

-token sequence (one gated token and one private token per modality). With a fixed number of modalities, the fusion cost remains bounded; when

M

increases, the training-time pairwise correlation regularization can be made more scalable by restricting to a sparse subset of modality pairs or sampling pairs per iteration, while keeping the inference-time path unchanged:

\tilde{H} = T r a n s f o r m e r E n c o d e r (H) \in R^{6 \times d}

(19)

We then flatten

\tilde{H}

to obtain a fused representation:

z = v e c (\tilde{H}) \in R^{6 d}

(20)

For the robust fusion head used in our main experiments, we apply a LayerNorm and dropout to

z

, followed by a small MLP. In the classification setting, the final prediction is

\hat{y} = s o f t m a x (W_{c} z + b_{c})

(21)

while in the regression setting we use a linear projection

\hat{y} = W_{r} z + b_{r}

(22)

Let

L_{task}

denote the standard task loss (cross-entropy for classification or mean squared error for regression). In practice, all loss terms are computed as mini-batch means and are jointly optimized end-to-end with a single optimizer and a shared learning-rate schedule. To ensure that the supervised objective remains the primary driving signal, we balance auxiliary regularizers using scalar coefficients and staged weight schedules. Specifically, the contrastive weights are gradually introduced (warm-up) and then decayed, while the gate regularization coefficient is kept small, which prevents auxiliary objectives from dominating early optimization and stabilizes convergence.

We further adopt validation-based early stopping and select the checkpoint with the best validation performance, which provides an additional safeguard against over-optimizing auxiliary losses at the expense of the task performance. The complete training objective of MISA-GMC combines the task loss, shared–private regularizers, contrastive losses, and gate regularization:

L_{total} = L_{task} + λ_{diff} L_{diff} + λ_{sim} L_{sim} + λ_{recon} L_{recon} + λ_{moco} L_{moco} + λ_{gmc} L_{gmc} + γ_{gate} L_{gate}

(23)

Here

λ_{diff}

,

λ_{sim}

, and

λ_{recon}

are fixed trade-off coefficients, while

λ_{moco}

and

λ_{gmc}

are epoch-dependent weights following a warmup–hold–decay schedule: they are gradually increased from a small initial value to a preset maximum, optionally decayed in later epochs using either a cosine or linear policy. This scheduling prevents the auxiliary contrastive objectives from dominating early optimization. The scalar

γ_{gate}

controls the strength of the gate regularization term and is set to a small value in our experiments.

Overall, the fusion Transformer, contrastive objectives, and gating mechanism work together to provide a flexible yet lightweight multimodal sentiment analysis framework that can adaptively exploit informative modalities while suppressing noisy ones.

4. Experiments

4.1. Datasets

This study evaluates the proposed model on three widely used multimodal sentiment analysis datasets: CMU-MOSI, CMU-MOSEI, and CH-SIMS [1,2,27]. CMU-MOSI contains 2199 opinion video clips annotated with sentiment intensity on a continuous scale from −3 to 3, and provides synchronized text, audio, and visual modalities for each sample [1]. CMU-MOSEI is a large-scale, cross-domain multimodal sentiment dataset with 22,856 annotated video segments, following the same continuous sentiment scale and offering tri-modal features, thus enabling more robust evaluation under diverse real-world conditions [2]. CH-SIMS is a Chinese multimodal sentiment analysis dataset comprising 2281 video clips with independently annotated text, audio, and visual modalities, making it particularly representative for sentiment understanding in Chinese linguistic and cultural contexts [34].

All experiments strictly follow the official training, validation, and test splits of each dataset to ensure fair comparison and reproducibility. Table 1 summarizes the partitioning of the training, validation, and test sets for the three datasets used in this study. For all datasets, we adopt the unaligned tri-modal features provided by the MMSA framework, so that the proposed MISA-GMC model is evaluated under the same experimental setting as the original MISA. Next, we introduce the baseline methods for a fair comparison.

4.2. Baseline Methods

To comprehensively evaluate the effectiveness of the proposed MISA-GMC model, we compare it against 11 representative baseline models that cover the major methodological paradigms in multimodal sentiment analysis. These paradigms include tensor fusion, low-rank factorization, cross-modal attention, mutual-information-based representation learning, adaptive multimodal transformation, modality-invariant learning, and temporal modeling. Specifically, TFN models high-order multimodal interactions through tensor outer-product fusion [4], whereas LMF and its extended variant MLMF employ low-rank factorization to approximate high-order interactions with significantly reduced computational cost [5]. MulT uses cross-modal attention to allow one modality to guide feature extraction in others, thereby alleviating modality misalignment [7].

ALMT enhances multimodal representation learning through adaptive hyper-modality transformation and guided cross-modal interaction [12], while MMIM improves multimodal fusion by maximizing hierarchical mutual information across modalities [11]. Self-MM leverages self-supervised reconstruction and consistency constraints to improve robustness in noisy or low-resource scenarios [10]. CENet introduces a cross-modal enhancement mechanism that strengthens inter-modality feature interactions [35], whereas TETFN integrates a text-enhanced Transformer-based fusion architecture with temporal modeling to better handle sequential multimodal inputs [36]. MTFN further extends multi-tensor fusion with cross-modal modeling to capture richer multimodal dynamics [37].

Finally, MISA explicitly separates modality-invariant and modality-specific feature subspaces to alleviate modality heterogeneity and serves as a strong modern baseline for multimodal sentiment analysis [9]. Together, these 11 baselines span the core methodological directions in multimodal sentiment analysis and provide a comprehensive benchmark for assessing the advantages of the proposed MISA-GMC model. All baseline results are reproduced using publicly available implementations and the official hyperparameter settings within the MMSA framework, under the same data splits, pre-extracted unimodal features, and evaluation pipeline as those used for our model. This setting ensures that the observed performance differences primarily reflect fusion and reliability modeling, rather than variations in upstream encoders.

Regarding the scope of comparisons, although we include Transformer-based fusion baselines that are compatible with this feature protocol (e.g., MulT and text-enhanced Transformer variants), many recent end-to-end Transformer-based approaches and large pretrained multimodal foundation models operate on raw modalities and rely on different backbone encoders and large-scale pretraining data. Incorporating them would require replacing the MMSA feature extractors and adopting substantially different training budgets, which would confound the comparison and make it not directly fair or feature-compatible in this setting. Therefore, we exclude such models from the main benchmark and leave a unified raw-input, end-to-end evaluation with pretrained multimodal encoders/foundation models as future work. Then, we detail the implementation settings to ensure reproducibility.

4.3. Experimental Environment and Parameters

All experiments were conducted on a workstation equipped with a single NVIDIA GeForce RTX 4090 GPU (24 GB VRAM), an Intel Xeon Platinum 8470Q CPU with 20 virtual cores, and 90 GB of system memory. The software environment was Ubuntu 20.04 with Python 3.8, PyTorch 2.0.0, and CUDA 11.8. The implementation is built upon the open-source MMSA framework, and we keep the procedures for data preprocessing, feature loading, and metric computation consistent with prior work for all models.

Unless otherwise specified, the Adam optimizer with an initial learning rate of

1 \times 10^{- 4}

is used. For the proposed MISA-GMC and the MISA baseline, the batch size is set to 16 on CMU-MOSI and 64 on both CMU-MOSEI and CH-SIMS. The maximum number of training epochs is 24, and an early-stopping strategy with a patience of eight epochs based on the validation loss is employed; the checkpoint that achieves the best validation performance is used for testing. The same training schedule is adopted for the ablation studies and missing-modality experiments.

The MISA-GMC-specific hyperparameters are tuned within a narrow range around a common default, and the detailed settings on different datasets are summarized in Table 2. The gating module uses a temperature parameter

τ_g = 1.4

and gate strength from 0.3 to 0.7 on all datasets, while the regularization coefficient

γ_{gate}

is set to 0.4 on MOSI and MOSEI and reduced to 0.1 on SIMS to account for the different label scales. The weight of the MOCO-style contrastive loss

λ_{MOCO}

is scheduled between 0.05 and 0.12: it is linearly warmed up during the first eight epochs, kept constant for the next eight epochs, and then decayed using a cosine schedule in the remaining epochs. The GMC loss weight

λ_{gmc}

is scheduled in the range [0, 0.05] (or [0.02, 0.05] on SIMS) with a six-epoch warm-up, a four-epoch plateau, and a subsequent linear decay. Reconstruction, similarity, and difference losses inherited from MISA use dataset-dependent coefficients (

λ_{diff} = 0.1

,

λ_{sim} = 0.3

on MOSI;

λ_{diff} = 0.3

,

λ_{sim} = 0.8

on MOSEI; and

λ_{diff} = 0.3

,

λ_{sim} = 1.0

on SIMS), while the weights for shared/private factors and reconstruction terms follow the original MISA settings.

For all baseline models, we adopt their officially released configurations in the MMSA framework and keep the input features, data splits, and evaluation pipeline identical to those used for MISA-GMC. Where applicable, common choices such as the optimizer type, maximum number of epochs, and early-stopping criterion are harmonized. Under this experimental setup, the observed performance differences are mainly attributed to the proposed architectural modifications rather than to implementation details.

4.4. Evaluation Metrics

Following standard practice in multimodal sentiment analysis, we report both regression-style and classification-style metrics. From the regression perspective, mean absolute error (MAE) measures the average deviation between the predicted sentiment scores and the ground-truth annotations, where smaller values indicate better performance. In addition, the Pearson correlation coefficient (Corr) is used to quantify the linear correlation between predictions and annotations, reflecting how well the model preserves the relative ordering of sentiment intensities.

From the classification perspective, we adopt accuracy and F1 scores under different label granularities. On CMU-MOSI and CMU-MOSEI, we follow the MMSA framework and report binary accuracy and F1 in two variants: Has0_Acc2/Has0_F1 are computed on all test samples, whereas Non0_Acc2/Non0_F1 exclude neutral (zero) instances to focus on non-zero sentiments. We further report Mult_Acc5 and Mult_Acc7, which correspond to five-class and seven-class sentiment classification derived from the continuous scores, thereby evaluating the models at finer-grained intensity levels.

On the CH-SIMS dataset, we use Mult_Acc2, Mult_Acc3, and Mult_Acc5 to measure binary, three-class, and five-class accuracies, respectively, together with the overall F1 score, MAE, and Corr. Taken together, these metrics provide a comprehensive evaluation of both coarse-grained polarity prediction and fine-grained sentiment intensity modeling under potentially imbalanced label distributions.

4.5. Overall Performance and Reliability Analysis

4.5.1. Benchmark Results on CMU-MOSI, CMU-MOSEI, and CH-SIMS

Table 3, Table 4 and Table 5 report the overall performance of all models on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets. For brevity, the Acc-2 metric is written as Has0-Acc2/Non0-Acc2, and the F1 metric as Has0-F1/Non0-F1. The results show that the proposed MISA-GMC consistently improves over the MISA baseline and remains competitive with other strong multimodal fusion methods.

On the CMU-MOSI dataset, MISA-GMC achieves an Acc-7 of 45.92, which is clearly higher than traditional models such as TFN (37.17) and MulT (33.24), and also surpasses the strongest baseline MMIM (45.04). Its binary accuracy and F1 scores reach 82.94/85.21 and 82.75/85.10, respectively, while MAE is reduced to 0.712 and Corr increases to 0.795. Compared with the original MISA model (Acc-7 of 43.29, Acc-2 of 81.34/83.08, F1 of 81.28/83.08, MAE of 0.785, and Corr of 0.764), MISA-GMC yields consistent gains across all metrics.

On the CMU-MOSEI dataset, the improvement is also evident. MISA-GMC attains an Acc-7 of 51.62, improving upon MISA’s 49.26. The Acc-2 and F1 scores increase to 84.18/85.50 and 84.25/85.27, respectively, while MAE decreases from 0.585 (MISA) to 0.560. The Pearson correlation coefficient remains at a comparable level (0.756 vs. 0.760), indicating that the proposed model enhances accuracy without sacrificing the correlation structure of the predictions.

On the CH-SIMS dataset (Table 5), the proposed model obtains the best five-class accuracy, with an Acc-5 of 42.67, slightly outperforming Self-MM (42.23) and clearly improving over MISA (41.14). At the same time, Acc-3, Acc-2, and F1 are improved to 64.11, 78.34, and 77.56, respectively. MAE and Corr are refined from 0.458/0.552 (MISA) to 0.437/0.571. These results confirm that MISA-GMC enhances the robustness and fine-grained sentiment modeling ability of the original MISA framework on all three benchmarks.

Overall, the results indicate that the proposed MISA-GMC exhibits good generalization ability across different datasets, languages, and label granularities. On the two English benchmark datasets (CMU-MOSI and CMU-MOSEI), MISA-GMC consistently outperforms the MISA baseline in terms of Acc-7/Acc-2 and F1, while simultaneously reducing MAE and maintaining comparable or even higher Corr. On the Chinese CH-SIMS dataset, the model still achieves the best or highly competitive performance on Acc-5, Acc-3, Acc-2, and F1, together with lower MAE and higher Corr. These consistent gains across three heterogeneous benchmarks suggest that the proposed framework does not merely overfit a single dataset, but instead provides a more robust and transferable solution for multimodal sentiment analysis.

4.5.2. Statistical Reliability via Repeated Runs with Fixed Seeds

To further confirm that the above improvements are not due to a favorable initialization, we conduct a repeated-run evaluation under a fixed set of seeds {1, 11, 111, 1111, 11111} while keeping all other settings unchanged. Following this protocol, we report the mean ± standard deviation (Std) over five runs on two representative benchmarks, CMU-MOSI and CH-SIMS (Table 6 and Table 7), where Acc-2/F1 on CMU-MOSI are reported under both Has0 and Non0 protocols (Non0 excludes neutral samples).

As shown in Table 6, MISA-GMC consistently outperforms the MISA backbone on CMU-MOSI, improving Acc-7 from 41.43 ± 2.25 to 44.42 ± 1.02 and reducing MAE from 0.796 ± 0.028 to 0.727 ± 0.021, while also yielding a higher Corr (0.784 ± 0.009 vs. 0.772 ± 0.011). Similar trends are observed on CH-SIMS (Table 7), where MISA-GMC achieves higher Acc-5/Acc-3/Acc-2 and improves Corr from 0.541 ± 0.027 to 0.574 ± 0.015 with a smaller MAE (0.442 ± 0.007 vs. 0.459 ± 0.013). Importantly, the variability across seeds is generally reduced for key metrics (e.g., Acc-7 and Corr on CMU-MOSI, Corr and MAE on CH-SIMS), indicating more stable optimization and improved reliability.

For a clearer visualization, Figure 2 plots the mean performance with Std as error bars (MAE and Corr are scaled by 100 for readability). Furthermore, Appendix B (Table A2 and Table A3) reports the corresponding 95% confidence intervals (CIs) of the mean based on the Student-t distribution (with

t_{0.975,4} = 2.776

), which provides an additional statistical quantification of the observed gains.

Overall, the repeated-run evaluation under fixed seeds confirms that the performance gains of MISA-GMC are consistent and reproducible, rather than arising from a single favorable run. In addition to higher average scores, the results exhibit comparable or lower variability on key metrics, indicating improved training stability. Building on these reliability results, we further provide qualitative/diagnostic analyses and robustness tests in Section 4.6.

4.6. Further Experimental Analysis and Interpretability

4.6.1. Confusion Matrix Analysis

To further examine the behavior of MISA-GMC, the 7-class confusion matrix and per-category accuracies on the CMU-MOSI test set are illustrated in Figure 3. MISA-GMC achieves relatively high accuracies on the central sentiment levels NG, WN, NE, WP, and PS, where most samples lie along the diagonal. In particular, the model shows strong performance on the positive side (WP and PS), with per-class accuracies around 0.4–0.6. Most remaining errors arise from confusion between adjacent intensity levels, such as SN↔NG, NG↔WN, and WP↔PS, and from neutral NE being misclassified as weakly negative or weakly positive. This pattern suggests that MISA-GMC has learned the overall ordinal structure of sentiment categories, while the main difficulty lies in distinguishing very subtle intensity differences, which partly explains why the Acc-7 score is lower than the binary metrics.

Compared with the vanilla MISA baseline, MISA-GMC achieves higher Acc-7 and F1 scores on the CMU-MOSI dataset (see Table 3). The confusion-matrix pattern in Figure 2 is consistent with these quantitative gains: most samples for the central sentiment levels stay close to the diagonal, and the per-category accuracies are relatively high, especially for WP and PS. This indicates that the proposed gating, contrastive-learning, and GMC modules help the model calibrate decision boundaries and suppress the tendency to over-predict dominant classes. As a result, the improvements observed in the aggregate Acc-7 and F1 metrics are also reflected at the level of individual sentiment categories, further confirming the effectiveness of MISA-GMC for fine-grained sentiment prediction.

4.6.2. Gating Heatmap Visualization

To better understand how the gating mechanism in MISA-GMC reweights different modalities, we visualize in Figure 4 the correlation matrix of the gated representations on the CMU-MOSI test set. The heatmap shows the average pairwise correlations among G-Text, G-Audio, and G-Visual, where all diagonal entries are close to 1.0, indicating strong internal consistency within each gated modality.

The off-diagonal correlations exhibit a clear pattern: the correlation between G-Text and G-Audio is around 0.25, while the correlations between G-Text, G-Visual, and G-Audio–G-Visual are close to zero. This suggests that the gating module maintains a moderate alignment between text and audio—two modalities that carry most of the sentiment signal—while keeping the visual stream relatively complementary rather than redundant. As a result, MISA-GMC learns coordinated yet diverse gated features, which benefits the subsequent fusion stage and contributes to the overall improvement in sentiment prediction performance.

4.6.3. Ablation Study

We conduct ablation studies on CMU-MOSI and CH-SIMS to examine the contribution of each component. Starting from the MISA baseline, we gradually add the Gating module, MOCO-based contrastive learning, and the proposed GMC correlation modeling; the results are reported in Table 8 and Table 9.

On the CMU-MOSI dataset, we observe a non-monotonic trend. Compared with the MISA baseline, adding either the Gating module or the MOCO-based contrastive learning alone generally improves most metrics, indicating that adaptive modality reweighting and contrastive learning can, respectively, suppress noisy signals and enhance the discriminability of the learned representations. However, when MOCO and Gating are enabled simultaneously, the binary metrics (Acc-2 and F1) are further improved, while the fine-grained Acc-7 as well as MAE and Corr exhibit a noticeable degradation. This suggests that simply stacking MOCO and Gating may introduce a trade-off between coarse-grained polarity classification and fine-grained sentiment intensity modeling. After introducing the proposed GMC module on top of them, the model achieves the best overall performance, with especially large gains in Acc-7 and a clear reduction in MAE, while maintaining competitive Acc-2 and F1 scores. This confirms that GMC effectively alleviates the above trade-off and allows Gating and MOCO to work in a more complementary manner while preserving the ordinal sentiment structure.

A similar phenomenon can be observed on the CH-SIMS dataset. Adding Gating or MOCO alone leads to steady improvements, whereas their combination yields slight fluctuations on some metrics. With the additional GMC correlation modeling, the full configuration achieves the best overall trade-off in terms of Acc-5, Acc-2, F1, MAE, and Corr, leading to a more balanced and stable performance. This indicates that the proposed GMC module not only works well on English MOSI, but also helps stabilize the effect of stacked modules and improve robustness and generalization on Chinese multimodal sentiment data.

Overall, the ablation results in Table 6 and Table 7 and Figure 5 show that while Gating and MOCO are individually beneficial, their naive combination may cause performance drops on certain metrics; adding GMC on top leads to the final MISA-GMC model, which provides the most stable and comprehensive improvement.

4.6.4. Robustness Under Missing Modalities

To evaluate how robust the models are when some modalities are unavailable, we simulate three missing-modality settings on CMU-MOSI and CH-SIMS: Missing V (T+A), Missing A (T+V) and Missing A&V (T). The detailed results are given in Table 10 and Table 11; Figure 6 and Figure 7 present radar plots that summarize the most representative metrics for each dataset.

On CMU-MOSI, both models experience performance drops once a modality is removed, but MISA-GMC generally keeps better accuracy and error than the vanilla MISA.

When the visual modality is removed (T+A), Acc-7 decreases from 43.29 to 40.27 for MISA and to 41.98 for MISA-GMC. MISA-GMC also has a smaller MAE (0.765 vs. 0.805), while the Corr of MISA is slightly higher (0.778 vs. 0.771).
When the acoustic modality is removed (T+V), MISA-GMC almost preserves the full-modality Acc-7 (45.04 vs. 45.92), whereas MISA drops to 39.65. Both MAE and Corr are better for MISA-GMC in this case (0.733 vs. 0.798, 0.786 vs. 0.766).
In the text-only setting (T), MISA-GMC still achieves higher Acc-7 (42.71 vs. 41.84) and lower MAE (0.748 vs. 0.794), and also slightly higher Corr (0.778 vs. 0.766).

Overall, on MOSI, the radar charts show that under all three missing-modality settings the MISA-GMC curve is larger on Non0-Acc-2 and Mult-Acc-7, and usually closer to the center on MAE, indicating milder degradation in most metrics.

On CH-SIMS, the robustness advantage of MISA-GMC is mainly reflected in the classification metrics.

With all three modalities, MISA-GMC improves Acc-5 (42.67 vs. 41.14), Acc-2 (78.34 vs. 76.59), and F1 (77.56 vs. 76.01), and reduces MAE (0.437 vs. 0.458).
Under Missing V (T+A) and Missing A (T+V), MISA-GMC consistently obtains higher Acc-5/Acc-2/F1 than MISA and lower MAE; Corr is also higher in these two cases.
In the most extreme text-only case (T), MISA-GMC still clearly improves Acc-5 (34.57 vs. 27.79), Acc-2 (74.62 vs. 66.74), and F1 (73.96 vs. 68.04), but at the cost of a slightly larger MAE (0.494 vs. 0.493) and a noticeably lower Corr (0.450 vs. 0.574).

The SIMS radar plots reflect exactly this pattern: the MISA-GMC polygon is larger on F1 and Mult-Acc-5 for all three missing-modality settings, while the Corr dimension becomes worse only in the text-only configuration.

Taken together, these results show that MISA-GMC is generally more robust than MISA when modalities are missing: in most settings it achieves higher accuracies and lower MAE, and comparable or better correlation. At the same time, the SIMS text-only case reveals a trade-off where MISA-GMC sacrifices some correlation in exchange for clearly better classification performance, providing a more nuanced view of its robustness.

Modality contribution analysis. Beyond robustness, the controlled missing-modality settings provide a quantitative proxy for each modality’s contribution: if removing one modality causes a larger performance drop, that modality is empirically more influential for sentiment prediction on the given dataset. On CMU-MOSI, removing audio results in only a small drop in Acc-7 (45.92 → 45.04, −0.88), whereas removing visual leads to a much larger drop (45.92 → 41.98, −3.94), suggesting that visual cues contribute more than acoustic cues in this benchmark under our setting. On CH-SIMS, the pattern differs: removing acoustic yields a larger Acc-5 degradation (42.67 → 37.64, −5.03) than removing visual (42.67 → 40.70, −1.97), indicating a stronger role of acoustic cues for Chinese sentiment recognition in this dataset. In all cases, the text-only configuration exhibits the largest overall degradation, highlighting the dominant role of language while also showing that auxiliary modalities provide complementary gains when available.

4.6.5. Efficiency and Engineering Cost Analysis

Table 12 compares the computational efficiency and resource usage of representative methods on the CMU-MOSI dataset, including the average training time per epoch, the peak GPU memory during training and inference (reported using the CUDA reserved-memory metric), the inference latency (milliseconds per batch with batch size = 64) and inference throughput (samples per second), as well as the FP32 weight size and checkpoint size. Overall, the compared methods exhibit clear trade-offs among inference speed, training time, and memory footprint. For instance, CENet and Self-MM achieve higher inference throughput (1971 and 1780 samples/s, respectively) with lower latency (31.60 and 35.03 ms/batch), while CENet shows higher memory usage. ALMT has the highest peak GPU memory in both training and inference (6.99 GB and 6.9941 GB), indicating greater resource demand. MMIM requires the longest training time (6.5 s/epoch) and also shows relatively lower inference efficiency (1452 samples/s with 42.93 ms/batch latency). MISA exhibits a moderate cost profile (3.4 s/epoch; 4.1816 GB inference peak; 1529 samples/s throughput; 39.14 ms/batch latency).

In a direct comparison with the baseline MISA, MISA-GMC largely preserves the model size (FP32 weights: 443 MB vs. 442 MB) and shows only a slight change in peak GPU memory (training peak: 4.26 GB vs. 4.18 GB; inference peak: 4.2636 GB vs. 4.1816 GB). Meanwhile, the additional modules introduce measurable computational overhead: the training time increases from 3.4 to 4.1 s/epoch, and the inference throughput decreases from 1529 to 1295 samples/s, with the corresponding latency increasing from 39.14 to 48.18 ms/batch. These results indicate that MISA-GMC maintains a stable memory footprint and model size with a moderate increase in training and inference time; together with the performance improvements reported in Table 3 and Table 5, the table provides a quantitative reference for the performance–efficiency trade-off.

5. Conclusions

In this paper, we proposed MISA-GMC, an enhanced multimodal sentiment analysis framework built on top of the widely used MISA backbone. On the basis of the original shared–private decomposition, MISA-GMC integrates three lightweight components: a reliability-aware gated fusion module that adjusts modality contributions on a per-sample basis, an MOCO-style momentum contrastive learning scheme that regularizes cross-modal representations, and a modality correlation (GMC) module that encourages informative global inter-modality dependencies via a compact correlation graph in the feature space. These designs aim to address several typical challenges in practical MSA, including noisy non-verbal signals, partial modality absence, and brittle fusion behavior when different modalities carry conflicting evidence.

Experiments on three heterogeneous benchmarks—CMU-MOSI, CMU-MOSEI, and CH-SIMS—demonstrate that MISA-GMC generally improves over the vanilla MISA and several representative baselines across both classification and regression metrics. On the two English datasets, the model generally achieves higher Acc-7/Acc-2 and F1 while reducing MAE and maintaining comparable or higher correlation; on the Chinese CH-SIMS dataset, it achieves the best or highly competitive Acc-5/Acc-3/Acc-2 and F1 with lower MAE and higher Corr. Analyses based on confusion matrices and modality-wise gating heatmaps show that MISA-GMC tends to concentrate errors on adjacent sentiment levels and learns more interpretable modality behaviors, where unreliable modalities are effectively down-weighted. Missing-modality experiments further indicate that the framework usually suffers milder performance degradation than MISA when one or more modalities are removed. However, we observe that the performance gains can be metric-dependent and occasionally come with trade-offs in extreme settings (e.g., slight degradation on certain fine-grained indicators or correlation under the most challenging modality-missing configurations), suggesting that robustness and fine-grained intensity modeling may not always improve simultaneously.

From an engineering-cost perspective, these gains are achieved with only modest additional overhead in training time and model size, while keeping peak GPU memory usage almost unchanged, suggesting that MISA-GMC remains practical for real deployment in resource-constrained settings where unimodal features are available. Since the momentum contrastive objectives are used only during training, they do not introduce extra inference-time computation; thus, the deployment cost is mainly determined by the unimodal encoders and the lightweight gating/fusion heads. In real-world applications (e.g., online social media monitoring or on-device human–computer interaction), practical constraints such as inference latency, throughput, and memory/compute budgets can be critical. While our design keeps the added inference footprint limited, a dedicated evaluation of latency/throughput under different hardware budgets and streaming conditions is still needed to fully characterize real-world practicality.

Despite these benefits, MISA-GMC still inherits several limitations of MISA-style frameworks. The current implementation relies on pre-extracted unimodal features, adopts relatively simple and fixed design choices for the GMC structure and gating schedule, and restricts reliability-aware control to the feature level, which may limit its adaptability to more complex, longer, or more diverse multimodal sequences. This feature-level setting also prevents a direct and fair comparison with large end-to-end pretrained multimodal foundation models, which we consider an important direction for future work. Moreover, our empirical evaluation focuses on three curated sentiment datasets, and the behavior of the model in truly open-domain or highly noisy real-world scenarios remains to be fully explored. In addition, we have not systematically studied scalability when extending to more modalities or very large-scale datasets, nor have we benchmarked end-to-end deployment latency in resource-limited environments.

In future work, we plan to combine MISA-GMC with end-to-end pretrained multimodal encoders to reduce reliance on fixed upstream features, and to investigate more principled reliability-aware gating and adaptive sparse correlation graphs, potentially under the lens of causal modeling and uncertainty estimation. We also aim to extend the proposed design principles—reliability-aware gating, MOCO-style cross-modal contrastive regularization, and explicit modality correlation modeling—to other multimodal understanding tasks beyond sentiment analysis, further evaluating their robustness and engineering cost in broader application scenarios. Motivated by the above deployment-related limitations, we will also explore scalability-oriented designs and latency-aware optimizations (e.g., lightweight encoders and compression-friendly implementations) to facilitate real-time or on-device applications.

Author Contributions

Conceptualization, Z.D.; methodology, Z.D. and Y.W.; software, Z.D. and Z.W.; validation, Z.D. and X.Y.; formal analysis, Z.W.; investigation, X.Y.; resources, X.Y.; data curation, Z.W.; writing—original draft preparation, Z.D. and Y.W.; writing—review and editing, Y.W. and S.-K.I.; visualization, Z.W.; supervision, Y.W. and S.-K.I.; project administration, S.-K.I.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by Macao Polytechnic University under grant No. RP/FCA-03/2022.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available from their official sources. CMU-MOSI and CMU-MOSEI are available via the CMU Multimodal Data SDK at https://github.com/CMU-MultiComp-Lab/CMU-MultimodalSDK (accessed on 25 December 2025). The CH-SIMS dataset is available from the THUIAR Group at https://thuiar.github.io/sims.github.io/chsims (accessed on 25 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

To improve readability and avoid interrupting the main technical flow, we provide a concise nomenclature table in Appendix A that summarizes potentially ambiguous symbols and frequently used abbreviations. This allows readers to quickly look up key terms (e.g., evaluation protocols and hyperparameters) while keeping the main text focused on methodology and experimental findings.

Table A1. Selected nomenclature and abbreviations used in this study.

Symbol/Abbrev.	Meaning
MMSA	Open-source multimodal sentiment analysis framework used for preprocessing and metric computation.
MOCO	Momentum Contrast; queue-based contrastive learning scheme used as a regularization component.
GMC	Cross-modal contrastive objective (the “GMC” correlation modeling term used in our method).
Has0_Acc2/Has0_F1	Binary accuracy/F1 computed on all test samples (including neutral “0”).
Non0_Acc2/Non0_F1	Binary accuracy/F1 excluding neutral (zero) instances.
Mult_AccK	K-class accuracy derived from discretizing continuous sentiment scores.
Acc-K	Shorthand used in some tables/figures for Mult_AccK (e.g., Acc-2/3/5 on CH-SIMS).
$α_{m}$	Soft gate for modality $m$ , produced by a temperature-scaled and bias-shifted sigmoid.
$g_{m}$	Final gated representation used for fusion, obtained by interpolating shared code and mixed code via gate strength $g_{s}$ .
$τ_{g}$ , $b_{g}$	Gating temperature and bias in Equation (9), controlling gate sharpness and default preference.
$g_{s}$	Gate strength (curriculum scalar), scheduled from 0.3 to 0.7 over epochs.
$λ_{M O C O}$ , $λ_{g m c}$ ; $q_{M O C O}$ , $m_{M O C O}$ , $τ_{M O C O}$ , $d_{M O C O}$ ; $γ_{g a t e}$	Loss weights and key MoCo hyperparameters (ranges/values summarized in Table 2).

Appendix B

To evaluate robustness, we repeat all experiments for MISA and MISA-GMC using five fixed seeds {1,11,111,1111,11111}. Table A2 and Table A3 report the 95% confidence intervals (CIs) of the mean, computed as

\bar{x} \pm t_{0.975,4} \cdot \frac{s}{\sqrt{5}}

with

t_{0.975,4} = 2.776

. These results quantify the variability across seeds and support the reliability of the reported improvements.

Table A2. 95% confidence intervals (5 Fixed Seeds) for MISA and MISA-GMC on CMU-MOSI. This table reports 95% CIs for the main CMU-MOSI metrics (Has0/Non0 Acc-2 and F1, Acc-7, MAE, and Corr) over five runs with fixed seeds {1,11,111,1111,11111}. Metrics marked with ↓ indicate that lower values are better, whereas metrics marked with ↑ indicate that higher values are better.

Metric	MISA	MISA-GMC
Acc-7↑	[38.64, 44.22]	[43.15, 45.69]
Acc-2 (Has0)↑	[80.42, 82.56]	[81.87, 83.31]
Acc-2 (Non0)↑	[82.17, 83.93]	[83.68, 85.64]
F1 (Has0)↑	[80.38, 82.58]	[81.82, 83.08]
F1 (Non0)↑	[82.23, 83.99]	[83.67, 85.51]
MAE↓	[0.761, 0.831]	[0.701, 0.753]
Corr↑	[0.758, 0.786]	[0.773, 0.795]

Table A3. 95% confidence intervals (5 Fixed Seeds) for MISA and MISA-GMC on CH-SIMS. This table reports 95% CIs for CH-SIMS metrics (Acc-2/3/5, F1, MAE, and Corr) over five runs with fixed seeds {1,11,111,1111,11111}. Metrics marked with ↓ indicate that lower values are better, whereas metrics marked with ↑ indicate that higher values are better.

Metric	MISA	MISA-GMC
Acc-5↑	[36.90, 41.44]	[38.68, 43.50]
Acc-3↑	[61.72, 63.88]	[62.25, 65.33]
Acc-2↑	[74.97, 79.07]	[76.60, 79.28]
F1↑	[74.05, 78.37]	[75.47, 78.15]
MAE↓	[0.443, 0.475]	[0.433, 0.451]
Corr↑	[0.507, 0.575]	[0.555, 0.593]

References

Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.-P. MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. arXiv 2016, arXiv:1606.06259. [Google Scholar] [CrossRef]
Zadeh, A.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.-P. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), Melbourne, Australia, 15–20 July 2018; pp. 2236–2246. [Google Scholar] [CrossRef]
Gandhi, A.; Adhvaryu, K.; Poria, S.; Cambria, E.; Hussain, A. Multimodal Sentiment Analysis: A Systematic Review of History, Datasets, Multimodal Fusion Methods, Applications, Challenges and Future Directions. Inf. Fusion 2023, 91, 424–444. [Google Scholar] [CrossRef]
Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.-P. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, 9–11 September 2017; pp. 1103–1114. [Google Scholar] [CrossRef]
Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.; Morency, L.-P. Efficient Low-Rank Multimodal Fusion with Modality-Specific Factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), Melbourne, Australia, 15–20 July 2018; pp. 2247–2256. [Google Scholar] [CrossRef]
Zadeh, A.; Liang, P.P.; Mazumder, N.; Poria, S.; Cambria, E.; Morency, L.-P. Memory Fusion Network for Multi-View Sequential Learning. Proc. AAAI Conf. Artif. Intell. 2018, 32, 5634–5641. [Google Scholar] [CrossRef]
Tsai, Y.-H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019; pp. 6558–6569. [Google Scholar] [CrossRef]
Zadeh, A.; Liang, P.P.; Poria, S.; Vij, P.; Cambria, E.; Morency, L.-P. Multi-Attention Recurrent Network for Human Communication Comprehension. Proc. AAAI Conf. Artif. Intell. 2018, 32, 5642–5649. [Google Scholar] [CrossRef]
Hazarika, D.; Zimmermann, R.; Poria, S. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. In Proceedings of the 28th ACM International Conference on Multimedia, Virtual, 12–16 October 2020; pp. 1122–1131. [Google Scholar]
Yu, W.; Xu, H.; Yuan, Z.; Wu, J. Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis. arXiv 2021, arXiv:2102.04830. [Google Scholar] [CrossRef]
Han, W.; Chen, H.; Poria, S. Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic, 7–11 November 2021; pp. 9180–9192. [Google Scholar]
Zhang, H.; Wang, Y.; Yin, G.; Yang, H.; Wang, Z.; Li, D. Learning Language-Guided Adaptive Hyper-Modality Representation for Multimodal Sentiment Analysis. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore, 6–10 December 2023; pp. 756–767. [Google Scholar]
Sun, T.; Wang, W.; Jing, L.; Cui, Y.; Song, X.; Nie, L. Counterfactual Reasoning for Out-of-Distribution Multimodal Sentiment Analysis. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 15–23. [Google Scholar]
Sun, T.; Ni, J.; Wang, W.; Jing, L.; Wei, Y.; Nie, L. General Debiasing for Multimodal Sentiment Analysis. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 5861–5869. [Google Scholar]
Mai, S.; Zeng, Y.; Zheng, S.; Hu, H. Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal Sentiment Analysis. IEEE Trans. Affect. Comput. 2023, 14, 2276–2289. [Google Scholar] [CrossRef]
Mai, S.; Zeng, Y.; Hu, H. Multimodal Information Bottleneck: Learning Minimal Sufficient Unimodal and Multimodal Representations. IEEE Trans. Multimed. 2023, 25, 4121–4134. [Google Scholar] [CrossRef]
Lu, Q.; Sun, X.; Gao, Z.; Long, Y.; Feng, J.; Zhang, H. Coordinated-Joint Translation Fusion Framework with Sentiment-Interactive Graph Convolutional Networks for Multimodal Sentiment Analysis. Inf. Process. Manag. 2024, 61, 103538. [Google Scholar] [CrossRef]
Sun, L.; Lian, Z.; Liu, B.; Tao, J. Efficient Multimodal Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis. IEEE Trans. Affect. Comput. 2024, 15, 309–325. [Google Scholar] [CrossRef]
Mai, S.; Sun, Y.; Zeng, Y.; Hu, H. Excavating Multimodal Correlation for Representation Learning. Inf. Fusion 2023, 91, 542–555. [Google Scholar] [CrossRef]
Wu, Q.; Shao, Y.; Wang, J.; Sun, X. Learning Optimal Multimodal Information Bottleneck Representations. arXiv 2025, arXiv:2505.19996. [Google Scholar] [CrossRef]
Sun, Y.; Liu, Z.; Sheng, Q.Z.; Chu, D.; Yu, J.; Sun, H. Similar Modality Completion-Based Multimodal Sentiment Analysis under Uncertain Missing Modalities. Inf. Fusion 2024, 110, 102454. [Google Scholar] [CrossRef]
Lin, R.; Hu, H. MissModal Increasing Robustness to Missing Modality in Multimodal Sentiment Analysis. Trans. Assoc. Comput. Linguist. 2023, 11, 1686–1702. [Google Scholar] [CrossRef]
Guo, Z.; Jin, T.; Zhao, Z. Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers); Ku, L.-W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 1726–1736. [Google Scholar]
Li, M.; Yang, D.; Liu, Y.; Wang, S.; Chen, J.; Wang, S.; Wei, J.; Jiang, Y.; Xu, Q.; Hou, X.; et al. Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning. Proc. Adv. Neural Inf. Process. Syst. 2024, 37, 28515–28536. [Google Scholar] [CrossRef]
Fang, Y.; Wu, S.; Zhang, S.; Huang, C.; Zeng, T.; Xing, X.; Walsh, S.; Yang, G. Dynamic Multimodal Information Bottleneck for Multimodality Classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024. [Google Scholar] [CrossRef]
Yang, J.; Yu, Y.; Niu, D.; Guo, W.; Xu, Y. ConFEDE: Contrastive Feature Decomposition for Multimodal Sentiment Analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), Toronto, ON, Canada, 9–14 July 2023; pp. 7617–7630. [Google Scholar]
Shi, H.; Pu, Y.; Zhao, Z.; Huang, J.; Zhou, D.; Xu, D.; Cao, J. Co-Space Representation Interaction Network for Multimodal Sentiment Analysis. Knowl.-Based Syst. 2024, 283, 111149. [Google Scholar] [CrossRef]
Mai, S.; Zeng, Y.; Xiong, A.; Hu, H. Injecting Multimodal Information into Pre-Trained Language Model for Multimodal Sentiment Analysis. IEEE Trans. Affect. Comput. 2025, 16, 2074–2089. [Google Scholar] [CrossRef]
Mu, J.; Nie, F.; Wang, W.; Xu, J.; Zhang, J.; Liu, H. MOCOLNet: A Momentum Contrastive Learning Network for Multimodal Aspect-Level Sentiment Analysis. IEEE Trans. Knowl. Data Eng. 2024, 36, 8787–8800. [Google Scholar] [CrossRef]
Fan, C.; Zhu, K.; Tao, J.; Yi, G.; Xue, J.; Lv, Z. Multi-Level Contrastive Learning Hierarchical Alleviation of Heterogeneity in Multimodal Sentiment Analysis. IEEE Trans. Affect. Comput. 2025, 16, 207–222. [Google Scholar] [CrossRef]
Guo, X.; Yang, C.; Liu, Y.; Yuan, C. AL-HCL: Active Learning and Hierarchical Contrastive Learning for Multimodal Sentiment Analysis With Fusion Guidance. In IEEE Transactions on Affective Computing; IEEE: New York, NY, USA, 2024. [Google Scholar] [CrossRef]
Kim, K.; Park, S. AOBERT: All-Modalities-in-One BERT for Multimodal Sentiment Analysis. Inf. Fusion 2023, 92, 37–45. [Google Scholar] [CrossRef]
Wang, S.; Ratnavelu, K.; Shibghatullah, A.S.B. UEFN: Efficient Uncertainty Estimation Fusion Network for Reliable Multimodal Sentiment Analysis. Appl. Intell. 2025, 55, 171. [Google Scholar] [CrossRef]
Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou, J.; Yang, K. CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-Grained Annotation of Modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), Online, 5–10 July 2020; pp. 3718–3727. [Google Scholar] [CrossRef]
Wang, D.; Liu, S.; Wang, Q.; Tian, Y.; He, L.; Gao, X. Cross-Modal Enhancement Network for Multimodal Sentiment Analysis. IEEE Trans. Multimed. 2023, 25, 4909–4921. [Google Scholar] [CrossRef]
Wang, D.; Guo, X.; Tian, Y.; He, L.; Gao, X. TETFN: A Text Enhanced Transformer Fusion Network for Multimodal Sentiment Analysis. Pattern Recognit. 2023, 136, 109259. [Google Scholar] [CrossRef]
Yan, X.; Xue, H.; Jiang, S.; Liu, Z. Multimodal Sentiment Analysis Using Multi-Tensor Fusion Network with Cross-Modal Modeling. Appl. Artif. Intell. 2022, 36, 2000688. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed MISA-GMC framework. (a) Inference-time pipeline. For each modality

m \in {T, A, V}

, the input sequence is encoded and decomposed into a shared representation

S_{m}

and a private representation

P_{m}

. A reliability-aware gating mechanism produces a scalar gate

α_{m}

to adaptively mix shared and private information, yielding a fused modality feature

{\bar{g}}_{m}

. The fused features from all modalities are then tokenized and fed into a lightweight Transformer-based fusion/prediction head to output the final sentiment prediction. (b) Shared–private decomposition and reconstruction regularization. The shared/private factors are learned with auxiliary training objectives (e.g., reconstruction and consistency constraints) to encourage informative shared factors while preserving modality-specific private factors, thereby reducing redundancy and promoting disentanglement. (c) Gating unit (per modality). The gate is computed by concatenating

S_{m}

and

P_{m}

, followed by LayerNorm and a small MLP with a sigmoid activation to obtain

α_{m} \in (0,1)

. A gate regularization term is applied during training to discourage ambiguous gates and stabilize the reliability estimation. The mixed output is denoted as

{\bar{g}}_{m}

. (d) Training-only contrastive regularization (discarded at inference). Two complementary contrastive branches are introduced to improve representation quality: (i) Intra-modal MoCo for each modality, implemented with a query encoder, a momentum-updated key encoder, a memory queue, and an InfoNCE loss; and (ii) Cross-modal GMC that performs pairwise contrastive learning across modalities (Text–Audio–Visual) to strengthen modality correlation modeling. These contrastive branches (including momentum encoders and queues) are used only in training and are removed at inference, so the deployment-time computation follows only the concise pipeline in (a).

Figure 1. Overview of the proposed MISA-GMC framework. (a) Inference-time pipeline. For each modality

m \in {T, A, V}

, the input sequence is encoded and decomposed into a shared representation

S_{m}

and a private representation

P_{m}

. A reliability-aware gating mechanism produces a scalar gate

α_{m}

to adaptively mix shared and private information, yielding a fused modality feature

{\bar{g}}_{m}

. The fused features from all modalities are then tokenized and fed into a lightweight Transformer-based fusion/prediction head to output the final sentiment prediction. (b) Shared–private decomposition and reconstruction regularization. The shared/private factors are learned with auxiliary training objectives (e.g., reconstruction and consistency constraints) to encourage informative shared factors while preserving modality-specific private factors, thereby reducing redundancy and promoting disentanglement. (c) Gating unit (per modality). The gate is computed by concatenating

S_{m}

and

P_{m}

, followed by LayerNorm and a small MLP with a sigmoid activation to obtain

α_{m} \in (0,1)

. A gate regularization term is applied during training to discourage ambiguous gates and stabilize the reliability estimation. The mixed output is denoted as

{\bar{g}}_{m}

. (d) Training-only contrastive regularization (discarded at inference). Two complementary contrastive branches are introduced to improve representation quality: (i) Intra-modal MoCo for each modality, implemented with a query encoder, a momentum-updated key encoder, a memory queue, and an InfoNCE loss; and (ii) Cross-modal GMC that performs pairwise contrastive learning across modalities (Text–Audio–Visual) to strengthen modality correlation modeling. These contrastive branches (including momentum encoders and queues) are used only in training and are removed at inference, so the deployment-time computation follows only the concise pipeline in (a).

Figure 2. Mean ± Std over five pre-defined seeds on CMU-MOSI and CH-SIMS. Bars denote the mean performance across five runs with seeds {1,11,111,1111,11111}, and error bars indicate the standard deviation. MAE and Corr are scaled by 100 for visualization.

Figure 3. Performance analysis of MISA-GMC on the CMU-MOSI test set. (Left) The 7-class confusion matrix. (Right) Per-category classification accuracy. The sentiment labels are abbreviated as follows: SN (Strong Negative), NG (Negative), WN (Weak Negative), NE (Neutral), WP (Weak Positive), PS (Positive), and SP (Strong Positive).

Figure 4. Correlation heatmap of gated modality representations of MISA-GMC on the CMU-MOSI test set. The matrix shows the average pairwise correlations among G-Text, G-Audio, and G-Visual. Here, G-Text, G-Audio, and G-Visual denote the gated representations of the text, audio, and visual modalities, respectively.

Figure 5. Visualization of ablation results on the CMU-MOSI and CH-SIMS datasets. The (left) panel shows the results on the CMU-MOSI dataset, and the (right) panel shows the results on the CH-SIMS dataset. The bar charts compare the performance of different model variants (MISA, MISA + GATE, MISA + MOCO, MISA + MOCO + GATE, and MISA + MOCO + GATE + GMC) under multiple evaluation metrics. Although adding Gating or MoCo alone generally improves performance, their simple combination leads to fluctuations on some metrics; after further introducing the GMC module, the final MISA-GMC model (MISA + MOCO + GATE + GMC) achieves the most stable and comprehensive improvements.

Figure 6. Radar plots of MISA vs. MISA-GMC on CMU-MOSI under different missing-modality settings. The three radar plots visualize the key metrics (Non0-Acc-2, Mult-Acc-7, MAE, and Corr) of MISA (blue) and MISA-GMC (orange) on CMU-MOSI. From left to right, the panels correspond to Missing V (T+A), Missing A (T+V), and Missing A & V (T).

Figure 7. Radar plots of MISA vs. MISA-GMC on CH-SIMS under different missing-modality settings. The three radar plots visualize the key metrics (F1, Mult-Acc-5, MAE, and Corr) of MISA (blue) and MISA-GMC (orange) on the CH-SIMS dataset. From left to right, the panels correspond to Missing V (T+A), Missing A (T+V), and Missing A & V (T).

Table 1. Partitioning of the training, validation, and test sets of different datasets.

Datasets	Train	Valid	Test	Total	Language
CMU-MOSEI	16,326	1871	4659	22,856	English
CMU-MOSI	1284	229	686	2199	English
CH-SIMS	1368	456	457	2281	Chinese

Table 2. Key hyper-parameters of MISA-GMC on different datasets.

Parameter	CMU-MOSI	CMU-MOSEI	CH-SIMS
Batch size	16	64	64
Dropout	0.2	0.5	0
$λ_{diff}$	0.1	0.3	0.3
$λ_{sim}$	0.3	0.8	1
Gating temperature $τ_g$	1.4	1.4	1.4
$γ_{gate}$	0.4	0.4	0.1
Gate strength (range)	[0.3, 0.7]	[0.3, 0.7]	[0.3, 0.7]
$λ_{M O C O}$ (range)	[0.05, 0.12]	[0.05, 0.12]	[0.05, 0.12]
$λ_{g m c} ($ range)	[0, 0.05]	[0, 0.05]	[0.02, 0.05]
$q_{M O C O}$	4096	4096	4096
$m_{M O C O}$	0.995	0.995	0.995
$τ_{M O C O}$	0.06	0.06	0.06
$d_{M O C O}$	256	256	256

Table 3. The experimental results of the model on CMU-MOSI. Boldface indicates the best results: The bolded “Ours” highlights the proposed model and indicates the best-performing results among all compared methods. Boldface is used to visually emphasize the superiority of our model compared with classical baselines such as TFN, MulT, and CENet. Metrics marked with ↓ indicate that lower values are better, whereas metrics marked with ↑ indicate that higher values are better. We boldfaced the results of MISA-GMC (Ours) to facilitate readability and comparison.

Model	CMU-MOSI
Model	Acc-7↑	Acc-2↑	F1↑	MAE↓	Corr↑
TFN	37.17	78.57/80.03	78.35/79.88	0.915	0.662
MulT	33.24	79.30/81.10	79.14/81.01	0.932	0.676
LMF	31.05	77.99/79.12	78.04/79.24	1.006	0.678
CENet	43	82.36/83.84	82.35/83.88	0.747	0.794
MMIM	45.04	82.65/84.45	82.63/84.48	0.741	0.794
Self-MM	42.13	83.53/85.06	83.51/85.09	0.753	0.795
ALMT	42.27	81.78/82.93	81.83/83.03	0.765	0.79
TETFN	44.75	82.51/84.30	82.45/84.29	0.728	0.788
MISA	43.29	81.34/83.08	81.28/83.08	0.785	0.764
Ours	45.92	82.94/85.21	82.75/85.10	0.712	0.795

Table 4. The experimental results of the model on CMU-MOSEI. Boldface indicates the best results: The bolded “Ours” highlights the proposed model and indicates the best-performing results among all compared methods. Boldface is used to visually emphasize the superiority of our model compared with classical baselines such as TFN, MulT, and CENet. Metrics marked with ↓ indicate that lower values are better, whereas metrics marked with ↑ indicate that higher values are better. We boldfaced the results of MISA-GMC (Ours) to facilitate readability and comparison.

Model	CMU-MOSEI
Model	Acc-7↑	Acc-2↑	F1↑	MAE↓	Corr↑
TFN	51.68	82.23/81.81	81.95/81.24	0.58	0.709
MulT	50.35	82.79/83.71	82.69/83.28	0.581	0.724
LMF	53.27	82.08/84.62	82.43/84.53	0.553	0.739
CENet	52.52	83.17/86.02	83.49/85.93	0.549	0.772
MMIM	50.03	83.32/83.30	83.12/82.83	0.58	0.733
Self-MM	50.38	84.89/85.94	84.97/85.76	0.572	0.767
ALMT	50.53	84.07/85.61	84.08/85.31	0.557	0.767
MISA	49.26	83.24/85.03	83.43/84.87	0.585	0.76
Ours	51.62	84.18/85.50	84.25/85.27	0.56	0.756

Table 5. The experimental results of the model on CH-SIMS. Boldface indicates the proposed model (Ours). Acc-2, Acc-3, and Acc-5 denote binary, three-class, and five-class accuracies, respectively. F1 is the F1 score, and MAE↓ is the mean absolute error, where ↓ means that lower values are better. Corr↑ denotes the Pearson correlation coefficient, where higher values are better. We boldfaced the results of MISA-GMC (Ours) to facilitate readability and comparison.

Model	CH-SIMS
Model	Acc-5↑	Acc-3↑	Acc-2↑	F1↑	MAE↓	Corr↑
TFN	40.48	64.33	77.46	75.7	0.44	0.589
MulT	39.17	64.11	78.12	77.93	0.446	0.586
LMF	41.36	68.05	78.56	78.33	0.438	0.591
CENet	21.01	42.89	66.74	60.48	0.586	0.088
MMIM	37.2	60.83	75.93	75.44	0.472	0.497
Self-MM	42.23	67.61	78.99	78.56	0.415	0.6
ALMT	39.82	64.99	78.99	78.45	0.445	0.559
MTFN	40.7	65.21	78.56	78.06	0.464	0.549
MLMF	31.07	65.43	80.53	80.58	0.432	0.643
TETFN	39.39	64.99	79.21	78.96	0.427	0.593
MISA	41.14	63.89	76.59	76.01	0.458	0.552
Ours	42.67	64.11	78.34	77.56	0.437	0.571

Table 6. Robustness of MISA and MISA-GMC on CMU-MOSI over five pre-defined seeds (mean ± Std). Results are reported as mean ± standard deviation over five runs with seeds {1,11,111,1111,11111} under identical settings. Acc-2/F1 are shown for both Has0 and Non0 protocols, where Non0 excludes neutral samples. Metrics marked with ↓ indicate that lower values are better, whereas metrics marked with ↑ indicate that higher values are better. We boldfaced the results of MISA-GMC (Ours) to facilitate readability and comparison.

Model	Acc-7↑	Acc-2↑	F1↑	MAE↓	Corr↑
MISA	41.43 ± 2.25	81.49 ± 0.86/83.05 ± 0.71	81.48 ± 0.89/83.11 ± 0.71	0.796 ± 0.028	0.772 ± 0.011
MISA-GMC	44.42 ± 1.02	82.59 ± 0.58/84.66 ± 0.79	82.45 ± 0.51/84.59 ± 0.74	0.727 ± 0.021	0.784 ± 0.009

Table 7. Robustness of MISA and MISA-GMC on CH-SIMS over five pre-defined seeds (mean ± Std). Results are reported as mean ± standard deviation over five runs with seeds {1,11,111,1111,11111} using the same training and evaluation setup. MAE is lower-is-better, while accuracy/F1/Corr are higher-is-better. Metrics marked with ↓ indicate that lower values are better, whereas metrics marked with ↑ indicate that higher values are better. We boldfaced the results of MISA-GMC (Ours) to facilitate readability and comparison.

Model	Acc-5↑	Acc-3↑	Acc-2↑	F1↑	MAE↓	Corr↑
MISA	39.17 ± 1.83	62.80 ± 0.87	77.02 ± 1.65	76.21 ± 1.74	0.459 ± 0.013	0.541 ± 0.027
MISA-GMC	41.09 ± 1.94	63.79 ± 1.24	77.94 ± 1.08	76.81 ± 1.08	0.442 ± 0.007	0.574 ± 0.015

Table 8. Ablation study on the CMU-MOSI. This table reports the incremental performance gains of each component added to the MISA baseline, including the Gating module, MoCo contrastive learning, and the proposed GMC correlation modeling. The final configuration MISA + MOCO + GATE + GMC represents our full model (MISA-GMC). Metrics marked with ↓ indicate that lower values are better, whereas metrics marked with ↑ indicate that higher values are better. We boldfaced the results of MISA-GMC (Ours) to facilitate readability and comparison.

Model	Acc-7↑	Acc-2↑	F1↑	MAE↓	Corr↑
MISA	43.29	81.34/83.08	81.28/83.08	0.785	0.764
MISA + GATE	43.88	82.22/83.84	82.16/83.84	0.744	0.787
MISA + MOCO	43.59	81.63/84.30	81.50/84.26	0.752	0.779
MISA + MOCO + GATE	40.96	83.38/85.67	83.18/85.56	0.751	0.778
MISA + MOCO + GATE + GMC	45.92	82.94/85.21	82.75/85.10	0.712	0.795

Table 9. Ablation study on the CH-SIMS. This table reports the incremental performance gains of each component added to the MISA baseline, including the Gating module, MOCO contrastive learning, and the proposed GMC correlation modeling. The final configuration MISA + MOCO + GATE + GMC represents our full model (MISA-GMC). Metrics marked with ↓ indicate that lower values are better, whereas metrics marked with ↑ indicate that higher values are better. We boldfaced the results of MISA-GMC (Ours) to facilitate readability and comparison.

Model	Acc-5↑	Acc-2↑	F1↑	MAE↓	Corr↑
MISA	41.14	76.59	76.01	0.458	0.552
MISA + GATE	39.82	77.02	76.64	0.442	0.560
MISA + MOCO	40.04	77.68	77.54	0.439	0.553
MISA + MOCO + GATE	38.73	77.24	76.89	0.449	0.534
MISA + MOCO + GATE + GMC	42.67	78.34	77.56	0.437	0.571

Table 10. Robustness of MISA and MISA-GMC on CMU-MOSI under missing-modality settings. The table reports 7-class accuracy (Acc-7), binary accuracy (Acc-2, Has0/Non0), binary F1 (Has0/Non0), mean absolute error (MAE), and correlation (Corr) of MISA and MISA-GMC on CMU-MOSI. “T+A”, “T+V”, and “T” denote the cases where the visual, acoustic, and both acoustic and visual modalities are removed, respectively. Metrics marked with ↓ indicate that lower values are better, whereas metrics marked with ↑ indicate that higher values are better.

Methods	CMU-MOSI
Methods	Acc-7↑	Acc-2↑	F1↑	MAE↓	Corr↑
MISA	43.29	81.34/83.08	81.28/83.08	0.785	0.764
T+A	40.27	80.59/82.38	80.52/82.39	0.805	0.778
T+V	39.65	80.99/82.50	80.95/82.52	0.798	0.766
T	41.84	81.34/82.32	81.39/82.42	0.794	0.766
MISA-GMC	45.92	82.94/85.21	82.75/85.10	0.712	0.795
T+A	41.98	81.63/84.15	81.48/84.08	0.765	0.771
T+V	45.04	82.51/84.60	82.37/84.54	0.733	0.786
T	42.71	82.36/84.45	82.21/84.37	0.748	0.778

Table 11. Robustness of MISA and MISA-GMC on CH-SIMS under missing-modality settings. This table reports 5-class accuracy (Acc-5), 3-class accuracy (Acc-3), binary accuracy (Acc-2), F1, mean absolute error (MAE), and correlation (Corr) of MISA and MISA-GMC on the CH-SIMS dataset. “T+A”, “T+V”, and “T” denote the cases where the visual, acoustic, and both acoustic and visual modalities are removed, respectively. Metrics marked with ↓ indicate that lower values are better, whereas metrics marked with ↑ indicate that higher values are better.

Methods	CH-SIMS
Methods	Acc-5↑	Acc-3↑	Acc-2↑	F1↑	MAE↓	Corr↑
MISA	41.14	63.89	76.59	76.01	0.458	0.552
T+A	38.25	61.84	74.62	75.06	0.457	0.535
T+V	32.12	61.97	74.88	75.46	0.474	0.531
T	27.79	60.61	66.74	68.04	0.493	0.574
MISA-GMC	42.67	64.11	78.34	77.56	0.437	0.571
T+A	40.7	64.55	77.02	76.8	0.444	0.561
T+V	37.64	61.71	77.02	76.58	0.449	0.555
T	34.57	59.08	74.62	73.96	0.494	0.45

Table 12. Computational efficiency and resource usage comparison on CMU-MOSI (batch size = 64). This table shows the average training time per epoch, peak GPU memory during training and inference (reported as CUDA reserved memory), inference latency (milliseconds per batch), and throughput (samples per second) measured using forward-pass-only timing, as well as FP32 weight size and checkpoint size. We boldfaced the results of MISA-GMC (Ours) to facilitate readability and comparison.

Model	Time/Epoch (s)	Train Peak (GB)	Infer Peak (GB)	Infer Latency (ms/Batch)	Infer Throughput (Samples/s)	Weights (MB)	Ckpt (MB)
CENet	2.9	5.32	5.3223	31.60	1971	464	417
MMIM	6.5	3.77	3.7676	42.93	1452	439	418
Self-MM	3.1	4.07	4.0703	35.03	1780	410	391
ALMT	3.7	6.99	6.9941	43.04	1450	448	427
MISA	3.4	4.18	4.1816	39.14	1529	442	422
Ours	4.1	4.26	4.2636	48.18	1295	443	435

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Du, Z.; Wang, Y.; Yang, X.; Im, S.-K.; Wang, Z. MISA-GMC: An Enhanced Multimodal Sentiment Analysis Framework with Gated Fusion and Momentum Contrastive Modality Relationship Modeling. Mathematics 2026, 14, 115. https://doi.org/10.3390/math14010115

AMA Style

Du Z, Wang Y, Yang X, Im S-K, Wang Z. MISA-GMC: An Enhanced Multimodal Sentiment Analysis Framework with Gated Fusion and Momentum Contrastive Modality Relationship Modeling. Mathematics. 2026; 14(1):115. https://doi.org/10.3390/math14010115

Chicago/Turabian Style

Du, Zheng, Yapeng Wang, Xu Yang, Sio-Kei Im, and Zhiwen Wang. 2026. "MISA-GMC: An Enhanced Multimodal Sentiment Analysis Framework with Gated Fusion and Momentum Contrastive Modality Relationship Modeling" Mathematics 14, no. 1: 115. https://doi.org/10.3390/math14010115

APA Style

Du, Z., Wang, Y., Yang, X., Im, S.-K., & Wang, Z. (2026). MISA-GMC: An Enhanced Multimodal Sentiment Analysis Framework with Gated Fusion and Momentum Contrastive Modality Relationship Modeling. Mathematics, 14(1), 115. https://doi.org/10.3390/math14010115

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MISA-GMC: An Enhanced Multimodal Sentiment Analysis Framework with Gated Fusion and Momentum Contrastive Modality Relationship Modeling

Abstract

1. Introduction

2. Related Work

2.1. Fusion-Centric Multimodal Sentiment Analysis

2.2. Shared–Private Representation and Contrastive Learning

3. Methods

3.1. Task Definition and Notation

3.2. Overall Architecture of MISA-GMC

3.3. Shared–Private Representation Learning

3.4. Lightweight Gating Mechanism

3.5. MOCO-Enhanced Cross-Modal Contrastive Learning (GMC)

3.5.1. MOCO-Style Contrastive Framework

3.5.2. Modality Correlation via Pairwise Contrast (GMC)

3.6. Multimodal Fusion and Objective

4. Experiments

4.1. Datasets

4.2. Baseline Methods

4.3. Experimental Environment and Parameters

4.4. Evaluation Metrics

4.5. Overall Performance and Reliability Analysis

4.5.1. Benchmark Results on CMU-MOSI, CMU-MOSEI, and CH-SIMS

4.5.2. Statistical Reliability via Repeated Runs with Fixed Seeds

4.6. Further Experimental Analysis and Interpretability

4.6.1. Confusion Matrix Analysis

4.6.2. Gating Heatmap Visualization

4.6.3. Ablation Study

4.6.4. Robustness Under Missing Modalities

4.6.5. Efficiency and Engineering Cost Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI