CrossSent: Cross-Modal Attention with Pairwise Ranking Regularization for Multi-Modal Sentiment

Liu, Jiaxiong; Qi, Ke; Liao, Zhiwen; Yuan, Feixiang; Zhuo, Wen

doi:10.3390/electronics15061157

Open AccessArticle

CrossSent: Cross-Modal Attention with Pairwise Ranking Regularization for Multi-Modal Sentiment

by

Jiaxiong Liu

,

Ke Qi

^*,

Zhiwen Liao

,

Feixiang Yuan

and

Wen Zhuo

School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(6), 1157; https://doi.org/10.3390/electronics15061157

Submission received: 30 January 2026 / Revised: 19 February 2026 / Accepted: 25 February 2026 / Published: 11 March 2026

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Multi-modal sentiment analysis (MSA) aims to accurately identify users’ emotional states by integrating textual, acoustic, and visual modalities. However, existing methods often suffer from insufficient cross-modal interaction, rigid fusion strategies, and limited sensitivity to subtle sentiment-level differences, which severely restrict model generalization and robustness. To address these issues, this paper proposes CrossSent, a multi-modal sentiment analysis framework that combines cross-modal attention with pairwise ranking regularization. Specifically, a Gated Multi-modal Residual Adapter (GMRA) is introduced to dynamically integrate heterogeneous features through gated residual connections, effectively mitigating modality asynchrony and noise interference. Meanwhile, a Monotonic Pairwise Ranking (MPR) regularization enhances discrimination among fine-grained sentiment levels. Furthermore, an Error-Interval Ordinal Inconsistency (EIOI) loss is designed to tolerate small prediction deviations, improving both stability and robustness. Experimental results on CMU-MOSI, CMU-MOSEI, and CH-SIMS demonstrate that CrossSent consistently surpasses state-of-the-art baselines across key metrics. For instance, it achieves 89.78% binary accuracy and 52.1% seven-class accuracy on CMU-MOSI, 87.72% and 54.7% on CMU-MOSEI, and 80.41%, 62.36%, and 43.54% for three- and five-level CH-SIMS tasks, with reduced mean absolute errors of 0.563, 0.513, and 0.408, respectively. We further report ordinal-consistency measures (QWK and level-jump statistics) to complement conventional metrics and quantify level-wise agreement. These results validate the effectiveness and generalization capability of the proposed framework.

Keywords:

multi-modal sentiment analysis; cross-modal attention; ordinal regression; pairwise ranking; robust fusion

1. Introduction

With the rapid growth of social media, short-video, and live-streaming platforms, people increasingly express emotions and attitudes through a mixture of text, audio, and visual signals. This trend has fueled the development of multi-modal sentiment analysis (MSA), which integrates complementary cues across modalities to improve robustness and accuracy over unimodal approaches, and it has shown strong potential in applications such as public opinion monitoring, market intelligence, online education, and mental health screening [1]. Recent work also highlights the growing practical demand for robust emotion/sentiment understanding from user-generated content and multi-modal signals [2,3].

Despite significant progress, three challenges remain prominent in current MSA research: (i) limited depth of cross-modal interaction, (ii) rigid or static fusion strategies, and (iii) insufficient modeling of ordinal sentiment levels. Early fusion architectures such as Tensor Fusion Networks (TFNs) [4] and Low-rank Multi-modal Fusion (LMF) [5] capture higher-order inter-modal relations but often suffer from feature redundancy, high computational overhead, and static fusion rules. Dynamic gating has been explored, e.g., MAG-BERT [6] reduces the influence of noisy modalities via gated modulation, yet the fusion depth is relatively shallow and the model struggles to capture fine-grained sentiment distinctions.

For cross-modal interaction, Multi-modal Transformer (MulT) [7] pioneered Transformer-based modeling for unaligned sequences and achieved notable gains, but its interaction pathways are fixed and may be sensitive to modality asynchrony or quality fluctuations. Representation-decoupling methods such as MISA [8] separate modality-invariant and modality-specific features to enhance generalization, yet their fusion remains largely static and may underutilize local cross-modal structures. A growing body of work continues to refine fusion and interaction—e.g., contrastive or enhancement-based designs and graph/attention variants—but the aforementioned limitations persist in realistic, noisy settings.

A second thread concerns ordinal sentiment modeling. Sentiment intensity is naturally ordered; however, many MSA systems overlook monotonic constraints among levels, leading to non-monotone predictions or level skipping. Rank-consistent ordinal regression (CORAL) [9] improves continuity by imposing cumulative binary constraints, but it is designed for unimodal settings and does not address challenges raised by multi-modal fusion. More recently, trustworthy multi-modal fusion in the ordinal space has been explored to model uncertainty [10], yet its fusion policy remains static and thus less adaptive to modality quality shifts and temporal asynchrony.

This paper proposes CrossSent, a multi-modal sentiment analysis framework for fine-grained sentiment prediction. The objective is to improve robustness under modality asynchrony and noise, and the study investigates how to (i) adaptively inject acoustic and visual cues into a textual backbone, (ii) enforce ordinal-consistent learning to enhance discrimination and reduce level skipping, and (iii) adopt tolerance-aware optimization to mitigate minor annotation uncertainty. Concretely, CrossSent introduces three components:

The Gated Multi-modal Residual Adapter (GMRA) performs dynamic cross-modal fusion by injecting visual and acoustic cues into textual representations through cross-modal attention and gated residual connections. The gating adaptively controls information flow, effectively suppressing modality asynchrony and noise.
Monotonic Pairwise Ranking (MPR) encodes pairwise ordering constraints between samples according to their sentiment levels. By enforcing consistent pairwise relations, it enhances fine-grained discrimination and mitigates level skipping.
Error-Interval Ordinal Inconsistency (EIOI) loss defines a tolerance interval for ordinal predictions, penalizing only deviations that violate ordinal consistency beyond an acceptable margin. This improves robustness to label uncertainty and enhances overall model stability.

We conduct comprehensive experiments on three standard benchmarks: CMU-MOSI, CMU-MOSEI, and the Chinese dataset CH-SIMS. CrossSent achieves consistent improvements on binary accuracy, multi-level (e.g., seven-class) accuracy, and mean absolute error (MAE), outperforming competitive baselines across datasets. Ablation studies further validate the individual effectiveness of GMRA, MPR, and EIOI as well as their complementary benefits when combined, demonstrating the practicality and robustness of CrossSent for complex, in-the-wild multi-modal sentiment analysis.

2. Related Work

2.1. Modality Fusion Mechanisms

Modality fusion aims to effectively integrate textual, acoustic, and visual features to yield expressive sentiment representations. Early strategies relied on tensor-based operators, such as Tensor Fusion Networks (TFNs) [4] and Low-rank Multi-modal Fusion (LMF) [5]. A TFN explicitly enumerates higher-order inter-modal interactions via outer products, but it often incurs severe feature redundancy, high computational cost, and overfitting risks. To reduce complexity, LMF introduces low-rank tensor factorization for efficient fusion; however, its fusion remains essentially static, lacking adaptation to modality-quality fluctuations and information degradation.

To enhance fusion dynamics, MAG-BERT [6] uses gated modulation to down-weight noisy modalities and improve robustness. Despite this progress, its interaction depth is shallow, and the mechanism struggles to capture deeper semantic relations across modalities for fine-grained ordinal prediction. Recent enhancement-based designs further strengthen fused representations, e.g., a cross-modal enhancement network [11], yet they generally do not explicitly model modality uncertainty nor provide sufficiently adaptive control under quality shifts. Attention-driven fusion has also been widely explored: Bi-Bimodal fusion [12] and dynamic cross-modal networks [13] employ attention to adapt to modality reliability changes, but residual issues such as modality asynchrony and local misalignment remain only partially addressed.

2.2. Cross-Modal Interaction Methods

Cross-modal interaction focuses on capturing semantic relations across modalities. Multi-modal Transformer (MulT) [7] pioneers cross-modal attention for unaligned sequences and alleviates asynchrony to a certain extent, yet its interaction pathways are pre-defined and relatively inflexible for real-world dynamics in modality quality. Representation decoupling provides another angle: MISA [8] separates modality-invariant and modality-specific components to improve generalization, but its interaction is largely global and static, underutilizing local cross-modal structure. Self-supervised multi-task learning further stabilizes modality-specific signals (Self-MM [14]), though the granularity of cross-modal matching is still limited for fine-grained sentiment levels. More recently, graph-based interaction deepens cross-modal reasoning: hierarchical graph attention networks (HGAMNs) [15] enrich the interaction hierarchy, while Multi-modal Transformer variants [16] push finer-grained alignment; nevertheless, flexibility to handle strong asynchrony and quality variation remains an open challenge.

2.3. Ordinal Sentiment Prediction

Sentiment intensity is intrinsically ordinal. Many MSA systems, however, overlook monotonic constraints across levels, which can cause non-monotone predictions and level skipping. CORAL [9] improves continuity via cumulative binary constraints for rank-consistent ordinal regression, but it is primarily designed for unimodal settings and does not address multi-modal fusion challenges. Trustworthy multi-modal fusion in an ordinal space [10] models uncertainty to enhance robustness, yet the fusion policy is still static and less responsive to modality-quality shifts and temporal asynchrony. Additional efforts explore multi-output/ordinal formulations with convolutional or multi-loss designs [17] and dedicated multi-modal ordinal regression networks [18], underscoring the value of ordinal modeling for fine-grained sentiment prediction.

3. Methodology

3.1. Framework Overview

As shown in Figure 1, CrossSent takes aligned text, audio, and visual features as input and injects non-text cues into the textual backbone via a Gated Multi-modal Residual Adapter (GMRA) for multi-modal fusion. In particular, GMRA serves as a lightweight plug-in module that enables cross-modal interaction while preserving the strong textual representations learned by the pretrained encoder. The fused representation is fed into a regression head to predict a continuous sentiment score. During training, we optimize the regression loss together with two ordinal-aware regularizers, MPR and EIOI, to encourage ordinal consistency and robustness to annotation noise. Overall, this design couples effective multi-modal fusion with ordinal-consistent learning, leading to more reliable fine-grained sentiment regression under realistic noisy conditions. Importantly, each component targets a distinct failure mode: GMRA addresses cross-modal asynchrony and modality-quality fluctuation via controllable cue injection, MPR enforces ordinal monotonicity through relative-order constraints, and EIOI improves robustness to small annotation discrepancies through an explicit tolerance band.

3.2. Gated Multi-Modal Residual Adapter (GMRA)

One major challenge in multi-modal sentiment analysis (MSA) lies in handling inter-modal asynchrony and dynamic variation of modality quality. Traditional fusion strategies are mostly static (e.g., simple concatenation or linear combination), which makes it difficult to capture fine-grained, dynamic interactions among modalities and to adapt the fusion strength when modality quality changes. Although Transformer-based cross-modal attention (e.g., MulT [7]) enables interaction across modalities, its interaction depth and adaptability remain insufficient for complex, noisy scenarios.

To address these issues, we propose a Gated Multi-modal Residual Adapter (GMRA) that achieves fine-grained token-level cross-modal injection and dynamic control of fusion strength, thereby improving robustness and generalization. As illustrated in Figure 2, GMRA performs cross-modal attention to inject visual and acoustic cues into text and then applies a gating mechanism to adaptively regulate the contribution of each non-text modality. Finally, a lightweight residual adapter refines the fused representation while preserving the original textual backbone, enabling stable optimization under modality asynchrony and quality fluctuations. Motivation-wise, non-text cues are often reliable only for certain tokens (silence/occlusion/misalignment segments); hence, fusion should be selectively injected rather than uniformly mixed. GMRA realizes this selectivity via token-level injected signals and fine-grained gates.

3.2.1. Modality Feature Embedding and Preprocessing

We consider three modalities: the textual feature sequence

X^{T} \in R^{L \times d_{T}}

, the acoustic feature sequence

X^{A} \in R^{L \times d_{A}}

, and the visual feature sequence

X^{V} \in R^{L \times d_{V}}

. Here, L is the sequence length, and

d_{T}, d_{A}, d_{V}

are the feature dimensions of the text, audio, and vision modalities, respectively. To unify the dimensionality across modalities, we project all features into a common d-dimensional space via modality-specific linear mappings:

\begin{matrix} H^{T} & = X^{T} W_{T}, \\ H^{A} & = X^{A} W_{A}, \\ H^{V} & = X^{V} W_{V} \end{matrix}

(1)

where

W_{T} \in R^{d_{T} \times d}

,

W_{A} \in R^{d_{A} \times d}

, and

W_{V} \in R^{d_{V} \times d}

are trainable modality-specific projection parameters.

3.2.2. Cross-Modal Multi-Head Attention Mechanism

To achieve deeper and more effective feature interaction among modalities, we adopt a cross-modal multi-head attention mechanism. Specifically, the textual modality features are used as the Query, while the concatenated acoustic and visual modality features serve as the Key and Value. Formally, the attention inputs are defined as

\begin{matrix} Q = H^{T} W_{Q}, \\ K = [H^{A}; H^{V}] W_{K}, \\ V = [H^{A}; H^{V}] W_{V} \end{matrix}

(2)

where

W_{Q}, W_{K}, W_{V} \in R^{d \times d}

are trainable parameters of the attention mechanism.

The cross-modal multi-head attention is computed as

\begin{matrix} H_{cross} = MultiHead (Q, K, V) \\ = Concat ({head}_{1}, \dots, {head}_{h}) W_{O} \end{matrix}

(3)

where the i-th attention head is defined as

{head}_{i} = softmax (\frac{Q_{i} K_{i}^{⊤}}{\sqrt{d_{k}}}) V_{i}

(4)

and

Q_{i}, K_{i}, V_{i}

denote the query, key, and value matrices for the i-th head, respectively. Here,

d_{k} = d / h

is the dimensionality of each attention head, h is the number of attention heads, and

W_{O} \in R^{d \times d}

is the output projection matrix.

Through this cross-modal attention mechanism, fine-grained feature interaction between modalities is achieved. This enables the model to learn enhanced cross-modal representations

H_{cross} \in R^{L \times d}

that capture contextual dependencies between textual, acoustic, and visual streams. Notably, since

H_{cross}

is computed with text tokens as Query, it forms an injected cue signal that is explicitly aligned to the textual timeline, which is crucial under inter-modal asynchrony.

3.2.3. Gated Fusion and Residual Connection

Although the above cross-modal attention mechanism enables the initial fusion of multi-modal information, it does not explicitly consider the quality differences among modalities, which may introduce noise. Therefore, we propose a gating fusion mechanism to dynamically control the fusion strength across modalities.

g = σ ([H^{T}; H_{cross}] W_{g} + b_{g}), g \in {[0, 1]}^{L \times d}

(5)

where

σ (\cdot)

denotes the sigmoid activation function and

W_{g} \in R^{2 d \times d}

and

b_{g} \in R^{d}

are trainable parameters. The gating matrix g dynamically adjusts the fusion strength between modalities.

The final fused representation is computed as

H_{fusion} = H^{T} + g ⊙ H_{cross}

(6)

where ⊙ denotes element-wise multiplication. This gating design enables the fused representation to adaptively control the amount of injected cross-modal information and mitigate noise amplification caused by modality quality variations or asynchrony. In particular, the token-wise, element-wise gate

g \in {[0, 1]}^{L \times d}

allows the model to suppress unreliable injected cues at specific tokens/features (e.g., silent or occluded segments), rather than applying a uniform fusion strength.

GMRA is closely related to prior multi-modal fusion paradigms, including cross-modal attention for interaction (MulT [7]) and gating-based multi-modal injection into a language backbone (MAG-BERT [6]), as well as early static tensor/low-rank fusion (TFN [4], LMF [5]). However, CrossSent is designed as a lightweight residual-adapter plug-in for targeted cue injection into the textual stream: text tokens serve as Query, while audio–visual cues are jointly used as Key/Value to produce an injected signal aligned to the textual timeline. Different from shallow modulation or coarse modality weighting, GMRA further applies token-wise, element-wise gating on the injected cross-modal signal before residual addition, which explicitly controls the amount of non-text information added to each textual token under modality asynchrony and quality fluctuation while preserving the pretrained text representation as the primary carrier. Positioning-wise, GMRA differs from MAG-style gated addition that typically relies on coarser modulation of text features using pooled multi-modal cues: GMRA first constructs an attention-based injected signal

H_{cross}

aligned to each text token and then regulates it via a fine-grained gate

g \in R^{L \times d}

before residual addition. Compared with MulT-style deep cross-modal transformers with stacked cross-attention blocks across modality streams, GMRA acts as a lightweight plug-in that injects non-text cues with minimal architectural change, which is particularly suitable for stable fine-tuning on noisy sentiment data.

3.3. Monotonic Pairwise Ranking Regularization (MPR)

In multi-modal sentiment analysis (MSA), sentiment-level prediction is essentially an ordinal regression problem where sentiment levels exhibit clear monotonic relationships. However, existing methods often simplify this task into general regression or classification, leading to discontinuous or inconsistent predictions across adjacent sentiment levels. To address this issue, we propose a Monotonic Pairwise Ranking (MPR) mechanism, which explicitly enforces ordinal ranking constraints among samples to maintain monotonic consistency and enhance fine-grained prediction.

Given a batch of B samples, the ground-truth sentiment labels and model predictions are defined as

Y = {y_{1}, y_{2}, \dots, y_{B}}, y_{i} \in R

(7)

and the model’s predicted sentiment scores are

\hat{Y} = {{\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{B}}, {\hat{y}}_{i} \in R

(8)

To construct effective ranking pairs, we define the pair set:

P = {(i, j) ∣ | y_{i} - y_{j} | \geq γ, 1 \leq i, j \leq B, i \neq j}

(9)

where

γ

is a threshold controlling whether the label difference is significant enough to form a ranking pair (typically

γ = 1

).

For each

(i, j) \in P

, the MPR objective encourages consistency between the ordering of predictions and ground-truth labels. The pairwise ranking loss is defined as

L_{MPR} = \frac{1}{| P |} \sum_{(i, j) \in P} max \{0, m (| y_{i} - y_{j} |) - sgn (y_{i} - y_{j}) ({\hat{y}}_{i} - {\hat{y}}_{j})\}

(10)

where

sgn (\cdot)

denotes the sign function:

sgn (x) = \{\begin{matrix} 1, & x > 0, \\ - 1, & x < 0, \\ 0, & x = 0 . \end{matrix}

(11)

The adaptive margin

m (| y_{i} - y_{j} |)

varies with the ordinal difference:

m (| y_{i} - y_{j} |) = α \cdot | y_{i} - y_{j} |^{β}

(12)

where

α > 0

and

β \geq 1

are hyperparameters controlling the margin magnitude and sensitivity.

Intuitively, if two samples

(i, j)

have a large label gap (e.g.,

| y_{i} - y_{j} | \geq 2

), the model is required to preserve a proportional margin in predictions. When the predicted difference fails to satisfy this margin, the pair contributes a positive penalty; otherwise, the loss is zero.

By explicitly modeling ordinal consistency, the MPR mechanism effectively reduces discontinuity and reversals in sentiment-level prediction. In practice, MPR is optimized jointly with regression-based losses (e.g., MSE and EIOI) to enhance both prediction accuracy and ordinal robustness in fine-grained sentiment estimation.

MPR is related to ordinal/robust learning objectives that improve monotonicity and stability (CORAL [9] and ordinal-space trustworthy fusion [10]), but it is tailored to fine-grained multi-modal sentiment regression by explicitly modeling relative ordering among samples. Specifically, we construct ranking pairs using an ordinal-gap threshold (

| y_{i} - y_{j} | \geq γ

) to focus supervision on meaningful label differences and to reduce the influence of minor annotation noise. Moreover, MPR adopts an ordinal-gap-aware adaptive margin

m (| y_{i} - y_{j} |)

that scales with label distance, rather than using a fixed margin, so that larger ordinal gaps are encouraged to have proportionally larger prediction separation, directly suppressing level-skipping and local order reversals. Compared with standard fixed-margin ranking objectives, MPR makes two noise-aware choices for continuous sentiment annotation: (i) it filters supervision using

| y_{i} - y_{j} | \geq γ

to avoid enforcing potentially unreliable orderings for near-tied samples, and (ii) it scales the margin with ordinal distance so that large gaps are encouraged to produce proportionally larger prediction separations, directly penalizing level-skipping that may persist under a constant margin.

3.4. Error-Interval Ordinal Inconsistency Loss (EIOI)

In multi-modal sentiment analysis, annotation labels are often subjective and noisy. Different annotators may assign slightly different sentiment intensities for similar expressions, and even the same annotator can introduce small variations. To handle such uncertainty, we propose an Error-Interval Ordinal Inconsistency (EIOI) loss that introduces a tolerance interval to avoid penalizing minor deviations and improves prediction stability and generalization.

Formally, let the model-predicted sentiment intensities and the corresponding ground-truth labels be

\hat{Y} = {{\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{B}}, {\hat{y}}_{i} \in R

(13)

and the corresponding ground-truth sentiment labels be

Y = {y_{1}, y_{2}, \dots, y_{B}}, y_{i} \in R

(14)

For each

y_{i}

, we define a tolerance range

[y_{i} - δ, y_{i} + δ]

with a hyperparameter

δ > 0

controlling acceptable deviation. The EIOI loss is formulated as

L_{EIOI} = \frac{1}{B} \sum_{i = 1}^{B} max {\{0, | {\hat{y}}_{i} - y_{i} | - δ\}}^{2}

(15)

When

| {\hat{y}}_{i} - y_{i} | \leq δ

, the prediction is considered acceptable and the loss is zero; otherwise, only the excess deviation contributes to the loss, allowing smooth penalization beyond the tolerance band. This encourages the model to focus on significant prediction errors rather than label noise.

The value of

δ

can be tuned according to the dataset granularity. For coarse-grained tasks (e.g., 5- or 7-level scales), a larger

δ

(typically

0.5 \sim 1.0

) is recommended, while for fine-grained continuous regression, a smaller

δ

(

0.2 \sim 0.5

) ensures balanced precision and stability.

Finally, the total loss is defined as

L_{total} = L_{MSE} + λ_{mpr} L_{MPR} + λ_{eioi} L_{EIOI}

(16)

where

λ_{eioi}

controls the contribution of EIOI. This joint objective improves both predictive accuracy and ordinal consistency of the model in fine-grained sentiment regression tasks.

EIOI is related to robust regression objectives that reduce sensitivity to small label perturbations, and it complements ordinal-consistent learning such as CORAL [9] and ordinal-space robustness modeling [10]. Different from standard robust losses that apply a fixed shape everywhere, EIOI introduces an explicit tolerance interval

[y_{i} - δ, y_{i} + δ]

and assigns zero penalty to predictions within this band. Thus, the optimization focuses on deviations that exceed the acceptable ordinal error range, which is particularly suitable for subjective sentiment annotations where minor disagreements are common. In CrossSent, EIOI provides absolute-deviation robustness, while MPR provides relative-order consistency; together they address both tolerance-aware accuracy and monotonicity, reducing level-skipping under noisy supervision. In terms of positioning, EIOI can be viewed as a tolerance-band robust regression objective; our goal is to explicitly encode an acceptable ordinal error interval for subjective sentiment labels. Combined with MPR (relative-order constraints), EIOI (absolute-deviation tolerance) jointly reduces both noise-driven over-penalization and ranking reversals.

4. Experiments

4.1. Datasets and Evaluation Metrics

We adopt three widely used multi-modal sentiment analysis datasets: CMU-MOSI [19], CMU-MOSEI [20], and CH-SIMS [21]. CMU-MOSI and CMU-MOSEI are standard English multi-modal sentiment analysis datasets, each containing opinion video clips annotated with sentiment intensities ranging from

- 3

to

+ 3

. Each sample includes synchronized textual transcripts, acoustic features, and visual representations. CH-SIMS is a large-scale Chinese multi-modal sentiment analysis dataset, in which each video segment contains aligned text, audio, and visual modalities with corresponding continuous sentiment annotations. This dataset supports both regression and classification formulations of MSA.

Dataset statistics: The detailed statistics of the three datasets are summarized in Table 1.

Classification tasks: Following prior work, we report accuracy under different granularity levels, including binary accuracy (ACC2), three-class accuracy (ACC3), five-class accuracy (ACC5), and seven-class accuracy (ACC7). We also compute macro-average F1-score to evaluate overall classification balance.

Regression tasks: For continuous sentiment regression, we adopt two widely used evaluation metrics: (i) mean absolute error (MAE), which measures the average deviation between predicted and ground-truth sentiment intensities, and (ii) Pearson correlation coefficient (Corr), which evaluates the linear correlation between predictions and labels. All results are reported on the official test splits for a fair comparison with existing methods.

Ordinal-consistency metrics: Since sentiment intensity is inherently ordinal, we additionally report two ordinal-specific measures to substantiate the benefits of the proposed ordinal regularizers (MPR and EIOI). First, we compute quadratic weighted kappa (QWK), which measures agreement between predicted and ground-truth ordinal levels with quadratic penalties for larger level deviations. Second, we report level-jump statistics to quantify level-skipping behavior, including (i)

{Jump}_{2 +}

, the ratio of samples whose absolute level difference is at least 2, and (ii) MeanAbsJump, the mean absolute level difference.

To compute these ordinal metrics, we discretize continuous scores into ordered levels following standard practice. For MOSI/MOSEI (label range

[- 3, 3]

), we round predictions and labels to the 7-level set

{- 3, - 2, - 1, 0, 1, 2, 3}

. For CH-SIMS (label range

[- 1, 1]

), we primarily use a 5-level discretization by scaling scores by 2 and rounding to

{- 2, - 1, 0, 1, 2}

, and we also report a 3-level variant by rounding to

{- 1, 0, 1}

for reference. These ordinal metrics are reported on the official test splits.

4.2. Implementation Details

We evaluate CrossSent on two English datasets (CMU-MOSI and CMU-MOSEI) and one Chinese dataset (CH-SIMS) using the official train/valid/test splits. For text, we fine-tune a RoBERTa-large backbone on English datasets and a BERT-family backbone on CH-SIMS, following the model-path based backend selection in our implementation. For audio and vision, we use publicly released, pre-extracted features. Because A/V feature dimensionalities vary across datasets, we keep dataset-specific A/V dimensions and project them into the shared fusion space of

d = 768

inside GMRA, ensuring consistent multi-modal interaction space (Table 2).

We adopt a deterministic token-level alignment and fixed-length input construction pipeline (Table 3). For MOSI/MOSEI, words are tokenized into subwords; word-level discrete tags are expanded to subwords using token-inversion indices, and the sequence is truncated/padded to a fixed maximum length

T = 50

. For CH-SIMS, continuous A/V streams are resampled to the token length and written into a fixed buffer following the [CLS] + content + padding layout. A unified attention mask is used to ignore padded positions.

For baseline comparisons, we report numbers from the original papers, since many methods do not release public implementations and/or adopt different backbones and feature-extraction pipelines that are not directly comparable under a unified retraining setting.

All experiments are conducted on a single NVIDIA A100 GPU (Table 3). We train with AdamW and a linear warmup schedule, apply gradient clipping, and select the best checkpoint based on validation MSE. The overall objective is MSE regression augmented with ordinal-aware regularizers (MPR and EIOI), where the regularizer weights are activated via a cosine ramp-up over epochs (Table 3). Dataset-dependent hyperparameters are summarized in Table 2. We also performed a small sensitivity check by perturbing the GMRA injection depth

l^{★}

and the regularizer weights

(λ_{MPR}, λ_{EIOI})

around the tuned values, and observed consistent trends with the best performance attained near the reported settings. We clarify that our experiments follow the widely used pre-extracted feature setting for audio and vision; end-to-end optimization of acoustic/visual encoders and deployment-oriented pipelines are left for future work.

4.3. Computational Complexity and Inference Efficiency

We analyze the computational overhead introduced by the proposed GMRA injection in terms of parameter count and inference efficiency. As summarized in Table 3, CrossSent injects GMRA once at

l^{★} = 1

with a fixed maximum token length

T = 50

. The dominant extra cost stems from attention-style interactions inside GMRA, whose complexity is on the order of

O (T_{q} T_{k} d)

(with hidden size d); under our token-level alignment where

T_{q} \approx T_{k} \approx T

, this is approximately

O (T^{2} d)

per injected layer. Other components (e.g., gating and lightweight projections) add comparatively minor overhead.

Efficiency comparison: We compare the full model with a baseline variant where GMRA is disabled, keeping all other settings identical. Both variants are evaluated on the same NVIDIA A100 GPU with the same input shape (

T = 50

). We measure forward-only inference latency with warmup = 20 and iters = 50, applying CUDA synchronization for accurate timing. We report single-sample latency (batch = 1), throughput (batch = 64), and peak GPU memory footprint.

Results: Table 4 shows that enabling GMRA introduces 8.89M additional parameters (125.25M → 134.14M, +7.10%). The batch-1 inference latency increases from 12.04 to 13.42 ms/sample (+11.42%), while the batch-64 throughput decreases from 1489.19 to 1392.97 samples/s (−6.46%). Peak GPU memory slightly increases from 702.6 MB to 712.0 MB (+1.33%). Overall, GMRA incurs a modest inference overhead while preserving practical efficiency.

4.4. Performance Comparison

To comprehensively evaluate the proposed CrossSent model, we conduct comparisons on CMU-MOSI, CMU-MOSEI, and CH-SIMS. We benchmark our model against several state-of-the-art methods, including MulT [7], MISA [8], Self-MM [14], BBFN [22], CubeMLP [23], TETFN [24], MIMM [25], VLP2MSA [26], CAGC [27], FMFN [28], CMLG [29], and ULMD [30]. Results on CMU-MOSI, CMU-MOSEI, and CH-SIMS are shown in Table 5, Table 6 and Table 7, respectively.

For transparency, we clarify the source of the compared results. Unless otherwise specified, all baseline numbers are quoted from the original papers (reported results), as many methods do not release public code and/or use different backbones and feature-extraction pipelines that are not directly comparable under a unified retraining setting. In contrast, the results of CrossSent are obtained by our using the official dataset splits and the alignment/training protocol described in Section 4.2. We treat cross-paper baselines as reference results rather than strictly controlled comparisons and therefore emphasize the performance of CrossSent under our fixed protocol.

Results on CMU-MOSI: CrossSent achieves the best performance across all key metrics. It attains 89.78% ACC2 and 89.75% F1-score, surpassing strong baselines. On the fine-grained seven-class task, CrossSent reaches 52.1%, markedly higher than CAGC (44.8%) and ULMD (47.8%). For regression evaluation, it yields the lowest MAE (0.563) and the highest Corr (0.878), indicating more precise and consistent sentiment estimation.

Results on CMU-MOSEI: On the larger-scale MOSEI dataset, CrossSent again achieves the best overall performance. It reaches 87.72% ACC2 and 87.71% F1-score and maintains competitive fine-grained accuracy (ACC7 = 54.7%). Moreover, it produces the lowest MAE (0.513) and the highest Corr (0.805), demonstrating robust generalization across modalities in more diverse scenarios.

Results on CH-SIMS: CrossSent exhibits consistent superiority on the Chinese CH-SIMS dataset. It achieves the best ACC5 (43.54%) and Corr (0.622) and attains the lowest MAE (0.408), indicating stable continuous prediction and robust cross-lingual generalization.

Overall analysis: Across both English and Chinese benchmarks, the improvements mainly come from three complementary components: GMRA for dynamic feature fusion, MPR for ordinal consistency, and EIOI for tolerance-aware regression. Together, they enhance fusion depth, adaptivity to modality quality, and fine-grained sentiment precision. Moreover, as sentiment prediction is ordinal in nature, we report ordinal-specific measures in Table 8. CrossSent achieves strong agreement under quadratic weighted kappa (QWK) and exhibits reduced level-jumping behavior, as indicated by low Jump₂₊ and MeanAbsJump. These results provide direct evidence that the proposed ordinal-aware objectives (MPR and EIOI) improve ordinal consistency beyond accuracy- and regression-oriented metrics.

4.5. Ablation Study

To isolate the contribution of each core component in CrossSent (GMRA, MPR, and EIOI), we conduct ablation experiments on two English benchmarks (CMU-MOSI and CMU-MOSEI) and one Chinese benchmark (CH-SIMS). We create three variants by removing MPR, EIOI, or GMRA while keeping all other settings unchanged. We report ACC/F1/MAE/Corr as the primary ablation metrics for direct comparison with prior MSA studies; ordinal-consistency measures (e.g., QWK and level-jump statistics) are presented in the main results as supplementary characterization and are not separately ablated. The results are summarized in Tables 9–11.

4.5.1. Ablation Analysis on CMU-MOSI Dataset

We first conduct ablation experiments on the CMU-MOSI dataset. The detailed results are summarized in Table 9.

Table 9 shows that adding MPR slightly reduces coarse-grained polarity performance on MOSI (ACC2/F1: 89.78/89.75 vs. 90.09/90.06 without MPR) while consistently improving fine-grained ordinal discrimination and regression fidelity (ACC7: 52.1 vs. 50.7; MAE/Corr: 0.563/0.878 vs. 0.584/0.868). This trade-off is expected because ACC2 is obtained by thresholding a continuous sentiment score around the neutral boundary, where a small number of borderline samples can flip the binary decision without reflecting a genuine improvement in the underlying sentiment geometry. By contrast, MPR imposes ordinal-gap-aware pairwise constraints that explicitly regularize relative ordering across sentiment levels, encouraging smoother monotone separation and suppressing local order reversals; consequently, it may slightly perturb decisions near the binary threshold, yet it yields a more faithful global ordinal structure that is better captured by ACC7 and correlation-based regression metrics.

When Error-Interval Ordinal Inconsistency (EIOI) is removed, ACC7 also drops to 50.7%, MAE rises to 0.576, and Corr decreases to 0.868. This demonstrates that EIOI effectively introduces tolerance for annotation uncertainty and mitigates over-sensitivity to subtle label variations, thereby improving stability and generalization.

Finally, removing the Gated Multi-modal Residual Adapter (GMRA) causes the largest performance decline. ACC7 decreases to 50.1%, MAE increases to 0.592, and Corr drops to 0.864. This suggests that GMRA is essential for lightweight feature-level fusion via residual adapter and for adaptively balancing modality quality differences. Without GMRA, the model struggles to handle asynchronous or noisy modality inputs, resulting in weaker overall performance.

4.5.2. Ablation Analysis on CMU-MOSEI Dataset

We further perform ablation experiments on the CMU-MOSEI dataset, and the detailed results are shown in Table 10.

On the larger and more diverse MOSEI benchmark (Table 10), the polarity–ordinal tension becomes much weaker: removing MPR causes a small but consistent degradation in fine-grained accuracy (ACC7: 54.7 vs. 54.4), while ACC2/F1 change only marginally (87.72/87.71 vs. 87.45/87.46) and MAE/Corr remain essentially stable (0.513/0.804 vs. 0.512/0.805). This suggests that, under a broader distribution, the binary metric is less dominated by a handful of near-boundary cases, and the main contribution of MPR is to stabilize the ordinal structure rather than to shift coarse-grained polarity, which aligns with its goal of improving level-wise separability and reducing potential level skipping.

When Error-Interval Ordinal Inconsistency (EIOI) is removed, ACC7 also decreases to 54.4% and ACC2 further declines to 87.23%. Although Corr (0.804) remains steady and MAE is nearly constant, EIOI is beneficial for handling label uncertainty and reducing over-sensitivity to noisy annotations, which helps maintain robustness and generalization on large datasets.

In contrast, removing the Gated Multi-modal Residual Adapter (GMRA) leads to a more obvious performance drop: ACC7 falls to 53.7%, ACC2 declines to 86.90%, MAE rises to 0.520, and Corr decreases to 0.796. This demonstrates that GMRA effectively manages modality quality variations and temporal asynchrony, enhancing the model’s robustness and generalization performance.

4.5.3. Ablation Analysis on CH-SIMS Dataset

Finally, we conduct detailed ablation experiments on the CH-SIMS dataset, and the results are reported in Table 11.

For CH-SIMS (Table 11), MPR provides clearer benefits for overall robustness: removing MPR leads to a noticeable drop in polarity performance (ACC2/F1: 80.41/80.06 vs. 78.86/78.55) and also degrades regression quality (MAE/Corr: 0.408/0.622 vs. 0.418/0.601), while multi-class accuracies remain nearly unchanged (ACC5/ACC3: 43.54/62.36 vs. 43.42/62.36). Given that CH-SIMS is annotated with continuous scores and exhibits different linguistic and distributional characteristics, enforcing monotone pairwise constraints helps suppress local order reversals induced by subjective labeling noise, thereby improving the ordinal geometry and the fidelity of continuous prediction beyond what discrete multi-class accuracy alone can reflect.

When Error-Interval Ordinal Inconsistency (EIOI) is removed, ACC3 decreases from 62.36% to 61.92, and ACC2 and F1 also drop (79.38% and 79.44%). In regression, MAE increases to 0.416 and Corr decreases to 0.615, suggesting that EIOI improves tolerance to annotation uncertainty and stabilizes regression under noise.

Finally, removing the Gated Multi-modal Residual Adapter (GMRA) yields the most pronounced degradation. Most metrics decline, with ACC3 decreasing to 61.26 and Corr dropping to 0.606. This shows that GMRA plays a pivotal role in deep cross-modal fusion and adaptively balancing modality quality and asynchrony, which enhances robustness and generalization in the Chinese setting.

4.5.4. Summary of Ablation Experiments

Overall: Across MOSI, MOSEI, and CH-SIMS, GMRA, MPR, and EIOI provide complementary benefits; together they deliver the best accuracy, ordinal consistency, and robustness.

GMRA: Removing GMRA consistently degrades fine-grained classification and regression (MAE↑, Corr↓), confirming its role in dynamic cross-modal fusion and robustness to modality asynchrony and quality variations.

MPR: Without MPR, fine-grained performance and stability decrease, particularly on MOSI and CH-SIMS, indicating that pairwise ranking improves ordinal separability and mitigates level-jumping.

EIOI: Removing EIOI leads to higher MAE and lower Corr with moderate accuracy degradation, showing that its interval-based design enhances robustness against label noise and annotation uncertainty.

4.6. Generalization Experiments

To further verify the generalization capability of the proposed CrossSent model under cross-dataset sentiment prediction scenarios, we conduct zero-shot generalization experiments on two English datasets, CMU-MOSI and CMU-MOSEI. Specifically, we train CrossSent on one dataset and directly evaluate it on the other, without any fine-tuning or adaptation, to assess the model’s ability to generalize across data domains and sentiment distributions. The experimental results are summarized in Table 12.

When the CrossSent model is trained on the smaller CMU-MOSI dataset and tested on the larger and more diverse CMU-MOSEI dataset, ACC2 decreases to 82.63% (a drop of approximately 7.15 percentage points compared to 89.78% on MOSI), ACC7 decreases to 46.31%, Corr decreases to 71.36%, MAE increases to 0.646, and F1 drops to 0.825. Although all metrics decline, the model still demonstrates solid cross-dataset generalization, indicating that CrossSent can transfer from a smaller dataset to a larger and more complex one.

When the CrossSent model is trained on the larger CMU-MOSEI dataset and tested on the smaller CMU-MOSI dataset, ACC2 reaches 86.43%, representing only a 1.29 percentage point decrease compared to 87.72% on MOSEI. ACC7 drops to 49.85%, Corr decreases to 84.51%, MAE increases to 0.621, and F1 decreases to 0.864. Overall, the degradation is minor, suggesting that representations learned from large-scale data transfer well to smaller datasets.

A comparison between intra-dataset and cross-dataset performance (MOSI→MOSI vs. MOSI→MOSEI; MOSEI→MOSEI vs. MOSEI→MOSI) reveals that generalizing from the smaller dataset to the larger one incurs a larger drop, while training on the larger dataset yields better transfer. This indicates that large-scale training improves robustness to distribution shift and enhances cross-dataset generalization.

Overall, these experiments further verify the effectiveness of the three key modules in CrossSent (GMRA, MPR, and EIOI) under cross-dataset generalization scenarios, highlighting its potential in real-world multi-modal sentiment prediction.

4.7. Visualization and Comparative Analysis

t-SNE analysis (CMU-MOSI, 686 test clips): We visualize fused tri-modal embeddings of two variants, BackBone and w/o MPR, using t-SNE (Figure 3 and Figure 4).

BackBone: Clear sentiment geometry is observed: strong positives and negatives occupy opposite extremes, while neutrals cluster near the center.

w/o MPR: Clusters become blurrier with heavier overlap near the center, and the sentiment gradient weakens, indicating reduced ordinal separability.

4.8. Attention Heatmap Analysis

Attention heatmap analysis (CMU-MOSI): We compare the averaged self-attention (layer 12) of the BackBone and w/o MPR models (Figure 5 and Figure 6).

BackBone: The full model exhibits more focused attention patterns and clearer contextual alignment.

w/o MPR: The attention map becomes more diffuse, suggesting weaker ordinal awareness and reduced interpretability.

5. Conclusions

This paper proposes CrossSent, a multi-modal sentiment analysis framework for fine-grained continuous sentiment prediction. CrossSent integrates three complementary components: GMRA injects acoustic–visual cues into a textual backbone through token-aligned cross-modal attention with gated residual updates to improve robustness under modality asynchrony and quality variation; MPR introduces ordinal-consistent pairwise constraints to enhance separability across sentiment levels and reduce level skipping; and EIOI adopts a tolerance-band objective that avoids over-penalizing minor annotation deviations, stabilizing optimization under noisy supervision. Experiments on two English benchmarks (CMU-MOSI and CMU-MOSEI) and one Chinese benchmark (CH-SIMS) demonstrate consistent improvements on both regression- and classification-oriented metrics, and ablation/visualization analyses further verify the complementary contributions of the three components. Moreover, an efficiency study shows that enabling GMRA introduces only modest overhead in parameters, latency, throughput, and memory footprint under a standard GPU setting, supporting the practical feasibility of the proposed design. The ordinal-consistency evaluation provides additional evidence of improved level-wise agreement.

In practical applications, CrossSent is well-suited for sentiment assessment on user-generated multi-modal content, where text often provides a reliable semantic anchor while acoustic and visual signals can be noisy or intermittently informative. The token-wise gated injection mechanism allows the model to selectively exploit non-text cues without amplifying unreliable segments, and the ordinal-consistent learning objectives yield more stable sentiment scores for downstream tasks such as ranking, trend monitoring, and risk-sensitive decision making.

Although CrossSent achieves strong performance, several limitations remain and merit further investigation:

(1): Our experiments follow the widely adopted pre-extracted feature setting for audio and vision; thus, the reported gains mainly reflect improved fusion and ordinal-consistent learning under fixed A/V representations rather than end-to-end optimization of acoustic/visual encoders, which may affect deployment-oriented conclusions.
(2): Results are reported under a controlled single-run protocol with a fixed random seed to ensure reproducibility; however, this does not replace multi-seed evaluation with statistical significance testing or confidence intervals, which would better quantify the robustness of the observed gains.
(3): Efficiency measurements are conducted with a fixed sequence length ( $T = 50$ ) and a single GPU type; inference characteristics may vary across different hardware platforms, batch sizes, and deployment configurations.

In future work, we will extend CrossSent to end-to-end training with learnable acoustic/visual encoders, conduct multi-seed evaluation with statistical testing, and perform broader deployment-oriented benchmarking across sequence lengths and devices.

Author Contributions

Conceptualization, K.Q.; methodology, J.L.; software, J.L.; validation, Z.L. and F.Y.; formal analysis, J.L.; investigation, J.L.; writing—original draft preparation, J.L.; writing—review and editing, K.Q. and W.Z.; supervision, K.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The datasets used in this study are publicly available. CMU-MOSI and CMU-MOSEI can be obtained from Carnegie Mellon University, and CH-SIMS is available from the original authors upon request.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT (OpenAI) to assist with language polishing, grammar checking, and improving the clarity and readability of the manuscript. The tool was not used to generate scientific content, data, results, or interpretations. All content was critically reviewed, revised, and approved by the authors, who take full responsibility for the originality, accuracy, and integrity of the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lai, S.; Xu, H.; Hu, X.; Ren, Z.; Liu, Z. Multimodal Sentiment Analysis: A Survey. arXiv 2023, arXiv:2305.07611. [Google Scholar] [CrossRef]
Oprea, S.-V.; Bâra, A. Extracting Emotions from Customer Reviews Using Text Mining, Large Language Models and Fine-Tuning Strategies. J. Theor. Appl. Electron. Commer. Res. 2025, 20, 221. [Google Scholar] [CrossRef]
Jayanthi, S.; Arumugam, S.S. Multimodal Sentiment Analysis Integrating Text, Audio, and Video for Emotion Detection. In Proceedings of the 2024 International Conference on Sustainable Communication Networks and Application (ICSCNA), Theni, India, 11–13 December 2024; pp. 1736–1741. [Google Scholar] [CrossRef]
Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, 9–11 September 2017; pp. 1103–1114. [Google Scholar]
Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.; Morency, L.P. Efficient Low-rank Multimodal Fusion with Modality-specific Factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), Melbourne, Australia, 15–20 July 2018; pp. 2247–2256. [Google Scholar]
Rahman, W.; Hasan, M.K.; Lee, S.; Zadeh, A.; Mao, C.; Morency, L.P. MAG-BERT: Injecting multimodal information in the BERT structure. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), Online, 5–10 July 2020; pp. 2359–2369. [Google Scholar]
Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Zadeh, A.; Morency, L.P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019; pp. 6558–6569. [Google Scholar]
Hazarika, D.; Zimmermann, R.; Poria, S. MISA: Modality-invariant and -specific Representations for Multimodal Sentiment Analysis. In Proceedings of the 28th ACM International Conference on Multimedia (ACM MM), Seattle, WA, USA, 12–16 October 2020; pp. 1122–1131. [Google Scholar]
Cao, W.; Mirjalili, V.; Raschka, S. Rank consistent ordinal regression for neural networks. Pattern Recognit. Lett. 2020, 140, 325–331. [Google Scholar] [CrossRef]
Xie, Z.; Zhang, H.; Ye, M.; Sun, K. Trustworthy multimodal fusion for sentiment analysis in ordinal sentiment space. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7657–7670. [Google Scholar] [CrossRef]
Wang, D.; Mao, Z.; Wang, Z.; Chen, G. Cross Modal Enhancement Network for Multimodal Sentiment Analysis. IEEE Trans. Multimed. 2023, 25, 4213–4224. [Google Scholar] [CrossRef]
Han, W.; Chen, H.; Gelbukh, A.; Zadeh, A.; Morency, L.P.; Poria, S. Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis. In Proceedings of the 2021 International Conference on Multimodal Interaction (ICMI), Montréal, QC, Canada, 18–22 October 2021; pp. 6–15. [Google Scholar]
Zhang, X.; Zhao, T.; Wu, F. Dynamic Cross-Modal Network for Multimodal Sentiment Analysis. IEEE Trans. Multimed. 2021, 23, 4496–4507. [Google Scholar] [CrossRef]
Yu, W.; Xu, H.; Ma, Y.; Wu, J.; Zou, J.; Zhang, W.; Yang, K. Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), Online, 1–6 August 2021; pp. 6359–6370. [Google Scholar] [CrossRef]
Lin, Z.; Liang, B.; Long, Y.; Dang, Y.; Yang, M.; Zhang, M.; Xu, R. Modeling Intra- and Inter-Modal Relations: Hierarchical Graph Contrastive Learning for Multimodal Sentiment Analysis. In Proceedings of the 29th International Conference on Computational Linguistics (COLING 2022), Gyeongju, Republic of Korea, 12–17 October 2022; Available online: https://aclanthology.org/2022.coling-1.622/ (accessed on 24 February 2026).
Kim, K.; Park, S. AOBERT: All-modalities-in-One BERT for Multimodal Sentiment Analysis. Inf. Fusion 2023, 92, 37–45. [Google Scholar] [CrossRef]
Wu, Z.; Gong, Z.; Koo, J.; Hirschberg, J. Multimodal Multi-loss Fusion Network for Sentiment Analysis. arXiv 2023, arXiv:2308.00264. [Google Scholar]
Liu, Y.; Zhang, H.; Zhao, H. Multimodal Ordinal Regression Network for Fine-grained Sentiment Prediction. Expert Syst. Appl. 2024, 269, 126274. [Google Scholar] [CrossRef]
Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.P. MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. arXiv 2016, arXiv:1606.06259. [Google Scholar] [CrossRef]
Zadeh, A.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.P. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), Melbourne, Australia, 15–20 July 2018; pp. 2236–2246. [Google Scholar]
Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou, J.; Yang, K. CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), Online, 5–10 July 2020; pp. 3718–3727. [Google Scholar]
Zhang, H.; Wang, W.; Yu, T. Bilateral Branch Fusion Network for Multimodal Sentiment Analysis. Knowl.-Based Syst. 2022, 245, 108632. [Google Scholar] [CrossRef]
Sun, H.; Wang, H.; Liu, J.; Chen, Y.W.; Lin, L. CubeMLP: An MLP-based Model for Multimodal Sentiment Analysis and Depression Estimation. In Proceedings of the 30th ACM International Conference on Multimedia (ACM MM), Lisboa, Portugal, 10–14 October 2022; pp. 4357–4365. [Google Scholar]
Wang, D.; Guo, X.; Tian, Y.; Liu, J.; He, L.; Luo, X. TETFN: A Text Enhanced Transformer Fusion Network for Multimodal Sentiment Analysis. Pattern Recognit. 2023, 136, 109259. [Google Scholar] [CrossRef]
Han, W.; Chen, H.; Poria, S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic, 7–11 November 2021; pp. 9180–9192. [Google Scholar]
Yi, G.; Fan, C.; Zhu, K.; Lv, Z.; Liang, S.; Wen, Z.; Pei, G.; Li, T.; Tao, J. VLP2MSA: Expanding vision-language pre-training to multimodal sentiment analysis. Knowl.-Based Syst. 2024, 283, 111136. [Google Scholar] [CrossRef]
Sun, K.; Xie, Z.; Ye, M.; Zhang, H. Contextual Augmented Global Contrast for Multimodal Intent Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 26963–26973. [Google Scholar]
Li, X.; Zhang, H.; Dong, Z.; Cheng, X.; Liu, Y.; Zhang, X. Learning Fine-grained Representation with Token-level Alignment for Multimodal Sentiment Analysis. Expert Syst. Appl. 2024, 269, 126274. [Google Scholar] [CrossRef]
Wang, R.; Yang, Q.; Xu, Q.; Zhang, X.; Zhang, Z.; Niu, J. Transformer-based Correlation Mining Network with Self-supervised Label Generation for Multimodal Sentiment Analysis. Neurocomputing 2025, 618, 129163. [Google Scholar] [CrossRef]
Zhu, L.; Zhao, H.; Zhu, Z.; Zhang, C.; Kong, X. Multimodal Sentiment Analysis with Unimodal Label Generation and Modality Decomposition. Inf. Fusion 2025, 116, 102787. [Google Scholar] [CrossRef]

Figure 1. Framework overview of CrossSent. The model integrates cross-modal attention via GMRA and employs MPR and EIOI as training regularizers.

Figure 2. Structure of the Gated Multi-modal Residual Adapter (GMRA).

Figure 3. BackBone.

Figure 4. w/o MPR.

Figure 5. BackBone.

Figure 6. w/o MPR.

Table 1. Statistics of the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets.

Dataset	Train	Valid	Test	Total
MOSI	1284	229	686	2199
MOSEI	16,326	1871	4659	22,856
CH-SIMS	1368	456	457	2281

Table 2. Dataset-specific input dimensions and tuned hyperparameters.

D_{a} / D_{v}

denote acoustic/visual input feature dimensions (after preprocessing), and d denotes the text hidden size (TEXT_DIM), which is also the shared fusion space where A/V are projected to via GMRA.

Table 2. Dataset-specific input dimensions and tuned hyperparameters.

D_{a} / D_{v}

denote acoustic/visual input feature dimensions (after preprocessing), and d denotes the text hidden size (TEXT_DIM), which is also the shared fusion space where A/V are projected to via GMRA.

Dataset	$D_{a}$	$D_{v}$	d	Batch	Epochs	LR	Dropout	$(λ_{MPR}, λ_{EIOI})$
CMU-MOSI	74	27	768	8	26	$7.5 \times 10^{- 6}$	0.28	(0.50, 0.06)
CMU-MOSEI	74	35	768	8	26	$9.0 \times 10^{- 6}$	0.28	(0.40, 0.05)
CH-SIMS	33	39	768	16	35	$1.25 \times 10^{- 5}$	0.28	(0.70, 0.07)

Table 3. Core reproducibility protocol and key model settings shared across datasets.

Item	Value
Hardware	single NVIDIA A100 GPU
Max length T	50
MOSI/MOSEI truncation	truncate subwords to $T - 3$ before adding special tokens
CH-SIMS buffer layout	write A/V to indices $1 : 1 + L_{tok}$ under `[CLS] + content + padding`
GMRA injection	injected once at $l^{★} = 1$ (0-based)
GMRA attention heads	8
GMRA dropout	0.2
Adapter bottleneck/LayerScale	48/ $1 \times 10^{- 4}$
Optimizer/ $ϵ$ /wd	AdamW/ $1 \times 10^{- 8}$ /0
Warmup schedule	linear warmup; proportion = 0.4 of total steps
Grad accumulation/clip	1/max norm = 2
Ramp-up	cosine ramp over epochs
MPR internal params	margin = 0.30, min_gap = 0.30, pow_gap = 1.0, max_pairs = 8192
EIOI start epoch	0
Seed	6758
Model selection	best checkpoint by validation MSE

Table 4. Inference efficiency comparison between the full model and the baseline variant with GMRA disabled (RoBERTa backbone,

T = 50

, single NVIDIA A100 GPU).

Table 4. Inference efficiency comparison between the full model and the baseline variant with GMRA disabled (RoBERTa backbone,

T = 50

, single NVIDIA A100 GPU).

Setting	#Params (M)	Latency (ms/Sample, b = 1)	Throughput (Samples/s, b = 64)	Peak Mem (MB)
Baseline (GMRA disabled)	125.248	12.043	1489.191	702.60
Full (GMRA enabled)	134.138	13.418	1392.969	711.97
Overhead	+8.89 (+7.10%)	+11.42%	−6.46%	+9.37 (+1.33%)

Table 5. Performance comparison on CMU-MOSI dataset.

Model	ACC2	F1	ACC7	MAE	Corr
MulT	84.10	83.90	–	0.861	0.711
MISA	82.10	82.03	–	0.804	0.764
ConFEDE	85.52	85.52	42.2	0.742	0.784
ALMT	85.82	85.86	46.7	0.705	0.795
Self-MM	85.98	85.95	–	0.713	0.798
MIMM	86.06	85.98	46.6	0.700	0.800
VLP2MSA	86.28	86.26	–	0.696	0.813
CAGC	85.70	85.60	44.8	0.775	0.774
FMFN	86.00	86.10	–	0.728	0.792
CMLG	86.43	86.42	–	0.706	0.798
ULMD	85.82	85.71	47.8	0.700	0.799
(ours)	89.78	89.75	52.1	0.563	0.878

Table 6. Performance comparison on CMU-MOSEI dataset.

Model	ACC2	F1	ACC7	MAE	Corr
MulT	82.5	82.3	–	0.580	0.713
MISA	84.23	83.97	–	0.568	0.717
Self-MM	85.17	85.30	–	0.530	0.765
ConFEDE	85.82	85.83	54.8	0.522	0.780
VLP2MSA	85.97	85.89	–	0.535	0.770
MIMM	85.97	85.94	54.2	0.526	0.772
ALMT	85.99	86.05	53.6	0.530	0.774
FMFN	86.00	86.10	–	0.535	0.772
CMLG	85.75	85.54	–	0.547	0.758
ULMD	85.95	85.91	53.8	0.531	0.770
(ours)	87.72	87.71	54.7	0.513	0.805

Table 7. Performance comparison on CH-SIMS dataset.

Model	ACC5	ACC3	ACC2	F1	MAE	Corr
MulT	37.94	64.77	78.56	79.66	0.453	0.564
BBFN	40.92	61.05	78.12	77.88	0.430	0.564
Self-MM	41.53	65.47	80.04	80.44	0.425	0.595
CubeMLP	41.79	65.86	77.68	77.59	0.419	0.593
TETFN	41.79	63.24	81.18	80.24	0.420	0.577
ALMT	43.11	65.86	78.77	78.71	0.408	0.594
FMFN	–	–	80.70	80.70	0.416	0.598
CMLG	–	–	80.96	80.94	0.415	0.581
(ours)	43.54	62.36	80.41	80.06	0.408	0.622

Table 8. Ordinal-specific evaluation of CrossSent on three benchmarks. For MOSI/MOSEI, QWK is computed on 7-level discretization. For CH-SIMS, we report QWK on 5-level discretization and additionally provide a 3-level variant for reference.

{Jump}_{2 +}

denotes the ratio of samples with an absolute level difference

\geq 2

; MeanAbsJump is the mean absolute level difference.

Table 8. Ordinal-specific evaluation of CrossSent on three benchmarks. For MOSI/MOSEI, QWK is computed on 7-level discretization. For CH-SIMS, we report QWK on 5-level discretization and additionally provide a 3-level variant for reference.

{Jump}_{2 +}

denotes the ratio of samples with an absolute level difference

\geq 2

; MeanAbsJump is the mean absolute level difference.

Dataset	QWK	QWK₃	Jump₂₊	MeanAbsJump	MeanAbsJump₃
CMU-MOSI	0.8459	–	0.0583	0.5773	–
CMU-MOSEI	0.7541	–	0.0434	0.5068	–
CH-SIMS	0.3965	0.3811	0.2560	1.0503	0.5295

Table 9. Ablation study results on CMU-MOSI dataset.

Setting	ACC7	ACC2	F1	MAE	Corr
Backbone	52.1	89.78	89.75	0.563	0.878
w/o MPR	50.7	90.09	90.06	0.584	0.868
w/o EIOI	50.7	89.48	89.41	0.576	0.868
w/o GMRA	50.1	89.32	89.25	0.592	0.864

Table 10. Ablation study results on CMU-MOSEI dataset.

Setting	ACC7	ACC2	MAE	Corr	F1
Backbone	54.7	87.72	0.513	0.804	87.71
w/o MPR	54.4	87.45	0.512	0.805	87.46
w/o EIOI	54.4	87.23	0.512	0.804	87.24
w/o GMRA	53.7	86.90	0.520	0.796	86.94

Table 11. Ablation study results on CH-SIMS dataset.

Setting	ACC5	ACC3	ACC2	F1	MAE	Corr
Backbone	43.54	62.36	80.41	80.06	0.408	0.622
w/o MPR	43.42	62.36	78.86	78.55	0.418	0.601
w/o EIOI	43.54	61.92	79.38	79.44	0.416	0.615
w/o GMRA	43.32	61.26	79.89	79.64	0.415	0.606

Table 12. Zero-shot generalization results on English datasets.

Training → Test	ACC2	ACC7	Corr	MAE	F1
MOSI → MOSI	89.78	52.18	87.81	0.563	0.897
MOSI → MOSEI	82.63	46.31	71.36	0.646	0.825
MOSEI → MOSEI	87.72	54.77	80.49	0.513	0.877
MOSEI → MOSI	86.43	49.85	84.51	0.621	0.864

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, J.; Qi, K.; Liao, Z.; Yuan, F.; Zhuo, W. CrossSent: Cross-Modal Attention with Pairwise Ranking Regularization for Multi-Modal Sentiment. Electronics 2026, 15, 1157. https://doi.org/10.3390/electronics15061157

AMA Style

Liu J, Qi K, Liao Z, Yuan F, Zhuo W. CrossSent: Cross-Modal Attention with Pairwise Ranking Regularization for Multi-Modal Sentiment. Electronics. 2026; 15(6):1157. https://doi.org/10.3390/electronics15061157

Chicago/Turabian Style

Liu, Jiaxiong, Ke Qi, Zhiwen Liao, Feixiang Yuan, and Wen Zhuo. 2026. "CrossSent: Cross-Modal Attention with Pairwise Ranking Regularization for Multi-Modal Sentiment" Electronics 15, no. 6: 1157. https://doi.org/10.3390/electronics15061157

APA Style

Liu, J., Qi, K., Liao, Z., Yuan, F., & Zhuo, W. (2026). CrossSent: Cross-Modal Attention with Pairwise Ranking Regularization for Multi-Modal Sentiment. Electronics, 15(6), 1157. https://doi.org/10.3390/electronics15061157

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CrossSent: Cross-Modal Attention with Pairwise Ranking Regularization for Multi-Modal Sentiment

Abstract

1. Introduction

2. Related Work

2.1. Modality Fusion Mechanisms

2.2. Cross-Modal Interaction Methods

2.3. Ordinal Sentiment Prediction

3. Methodology

3.1. Framework Overview

3.2. Gated Multi-Modal Residual Adapter (GMRA)

3.2.1. Modality Feature Embedding and Preprocessing

3.2.2. Cross-Modal Multi-Head Attention Mechanism

3.2.3. Gated Fusion and Residual Connection

3.3. Monotonic Pairwise Ranking Regularization (MPR)

3.4. Error-Interval Ordinal Inconsistency Loss (EIOI)

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Implementation Details

4.3. Computational Complexity and Inference Efficiency

4.4. Performance Comparison

4.5. Ablation Study

4.5.1. Ablation Analysis on CMU-MOSI Dataset

4.5.2. Ablation Analysis on CMU-MOSEI Dataset

4.5.3. Ablation Analysis on CH-SIMS Dataset

4.5.4. Summary of Ablation Experiments

4.6. Generalization Experiments

4.7. Visualization and Comparative Analysis

4.8. Attention Heatmap Analysis

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI