1. Introduction
With the rapid growth of social media, short-video, and live-streaming platforms, people increasingly express emotions and attitudes through a mixture of text, audio, and visual signals. This trend has fueled the development of multi-modal sentiment analysis (MSA), which integrates complementary cues across modalities to improve robustness and accuracy over unimodal approaches, and it has shown strong potential in applications such as public opinion monitoring, market intelligence, online education, and mental health screening [
1]. Recent work also highlights the growing practical demand for robust emotion/sentiment understanding from user-generated content and multi-modal signals [
2,
3].
Despite significant progress, three challenges remain prominent in current MSA research: (i) limited depth of cross-modal interaction, (ii) rigid or static fusion strategies, and (iii) insufficient modeling of ordinal sentiment levels. Early fusion architectures such as Tensor Fusion Networks (TFNs) [
4] and Low-rank Multi-modal Fusion (LMF) [
5] capture higher-order inter-modal relations but often suffer from feature redundancy, high computational overhead, and static fusion rules. Dynamic gating has been explored, e.g., MAG-BERT [
6] reduces the influence of noisy modalities via gated modulation, yet the fusion depth is relatively shallow and the model struggles to capture fine-grained sentiment distinctions.
For cross-modal interaction, Multi-modal Transformer (MulT) [
7] pioneered Transformer-based modeling for unaligned sequences and achieved notable gains, but its interaction pathways are fixed and may be sensitive to modality asynchrony or quality fluctuations. Representation-decoupling methods such as MISA [
8] separate modality-invariant and modality-specific features to enhance generalization, yet their fusion remains largely static and may underutilize local cross-modal structures. A growing body of work continues to refine fusion and interaction—e.g., contrastive or enhancement-based designs and graph/attention variants—but the aforementioned limitations persist in realistic, noisy settings.
A second thread concerns
ordinal sentiment modeling. Sentiment intensity is naturally ordered; however, many MSA systems overlook monotonic constraints among levels, leading to non-monotone predictions or level skipping. Rank-consistent ordinal regression (CORAL) [
9] improves continuity by imposing cumulative binary constraints, but it is designed for unimodal settings and does not address challenges raised by multi-modal fusion. More recently, trustworthy multi-modal fusion in the ordinal space has been explored to model uncertainty [
10], yet its fusion policy remains static and thus less adaptive to modality quality shifts and temporal asynchrony.
This paper proposes CrossSent, a multi-modal sentiment analysis framework for fine-grained sentiment prediction. The objective is to improve robustness under modality asynchrony and noise, and the study investigates how to (i) adaptively inject acoustic and visual cues into a textual backbone, (ii) enforce ordinal-consistent learning to enhance discrimination and reduce level skipping, and (iii) adopt tolerance-aware optimization to mitigate minor annotation uncertainty. Concretely, CrossSent introduces three components:
The Gated Multi-modal Residual Adapter (GMRA) performs dynamic cross-modal fusion by injecting visual and acoustic cues into textual representations through cross-modal attention and gated residual connections. The gating adaptively controls information flow, effectively suppressing modality asynchrony and noise.
Monotonic Pairwise Ranking (MPR) encodes pairwise ordering constraints between samples according to their sentiment levels. By enforcing consistent pairwise relations, it enhances fine-grained discrimination and mitigates level skipping.
Error-Interval Ordinal Inconsistency (EIOI) loss defines a tolerance interval for ordinal predictions, penalizing only deviations that violate ordinal consistency beyond an acceptable margin. This improves robustness to label uncertainty and enhances overall model stability.
We conduct comprehensive experiments on three standard benchmarks: CMU-MOSI, CMU-MOSEI, and the Chinese dataset CH-SIMS. CrossSent achieves consistent improvements on binary accuracy, multi-level (e.g., seven-class) accuracy, and mean absolute error (MAE), outperforming competitive baselines across datasets. Ablation studies further validate the individual effectiveness of GMRA, MPR, and EIOI as well as their complementary benefits when combined, demonstrating the practicality and robustness of CrossSent for complex, in-the-wild multi-modal sentiment analysis.
3. Methodology
3.1. Framework Overview
As shown in
Figure 1,
CrossSent takes aligned text, audio, and visual features as input and injects non-text cues into the textual backbone via a
Gated Multi-modal Residual Adapter (GMRA) for multi-modal fusion. In particular, GMRA serves as a lightweight plug-in module that enables cross-modal interaction while preserving the strong textual representations learned by the pretrained encoder. The fused representation is fed into a regression head to predict a continuous sentiment score. During training, we optimize the regression loss together with two ordinal-aware regularizers,
MPR and
EIOI, to encourage ordinal consistency and robustness to annotation noise. Overall, this design couples effective multi-modal fusion with ordinal-consistent learning, leading to more reliable fine-grained sentiment regression under realistic noisy conditions. Importantly, each component targets a distinct failure mode: GMRA addresses cross-modal asynchrony and modality-quality fluctuation via controllable cue injection, MPR enforces ordinal monotonicity through relative-order constraints, and EIOI improves robustness to small annotation discrepancies through an explicit tolerance band.
3.2. Gated Multi-Modal Residual Adapter (GMRA)
One major challenge in multi-modal sentiment analysis (MSA) lies in handling
inter-modal asynchrony and
dynamic variation of modality quality. Traditional fusion strategies are mostly
static (e.g., simple concatenation or linear combination), which makes it difficult to capture fine-grained, dynamic interactions among modalities and to adapt the fusion strength when modality quality changes. Although Transformer-based cross-modal attention (e.g., MulT [
7]) enables interaction across modalities, its interaction depth and adaptability remain insufficient for complex, noisy scenarios.
To address these issues, we propose a
Gated Multi-modal Residual Adapter (GMRA) that achieves fine-grained token-level cross-modal injection and dynamic control of fusion strength, thereby improving robustness and generalization. As illustrated in
Figure 2, GMRA performs cross-modal attention to inject visual and acoustic cues into text and then applies a gating mechanism to
adaptively regulate the contribution of each non-text modality. Finally, a lightweight residual adapter refines the fused representation while preserving the original textual backbone, enabling stable optimization under modality asynchrony and quality fluctuations. Motivation-wise, non-text cues are often reliable only for certain tokens (silence/occlusion/misalignment segments); hence, fusion should be selectively injected rather than uniformly mixed. GMRA realizes this selectivity via token-level injected signals and fine-grained gates.
3.2.1. Modality Feature Embedding and Preprocessing
We consider three modalities: the textual feature sequence
, the acoustic feature sequence
, and the visual feature sequence
. Here,
L is the sequence length, and
are the feature dimensions of the text, audio, and vision modalities, respectively. To unify the dimensionality across modalities, we project all features into a common
d-dimensional space via modality-specific linear mappings:
where
,
, and
are trainable modality-specific projection parameters.
3.2.2. Cross-Modal Multi-Head Attention Mechanism
To achieve deeper and more effective feature interaction among modalities, we adopt a
cross-modal multi-head attention mechanism. Specifically, the textual modality features are used as the
Query, while the concatenated acoustic and visual modality features serve as the
Key and
Value. Formally, the attention inputs are defined as
where
are trainable parameters of the attention mechanism.
The cross-modal multi-head attention is computed as
where the
i-th attention head is defined as
and
denote the query, key, and value matrices for the
i-th head, respectively. Here,
is the dimensionality of each attention head,
h is the number of attention heads, and
is the output projection matrix.
Through this cross-modal attention mechanism, fine-grained feature interaction between modalities is achieved. This enables the model to learn enhanced cross-modal representations that capture contextual dependencies between textual, acoustic, and visual streams. Notably, since is computed with text tokens as Query, it forms an injected cue signal that is explicitly aligned to the textual timeline, which is crucial under inter-modal asynchrony.
3.2.3. Gated Fusion and Residual Connection
Although the above cross-modal attention mechanism enables the initial fusion of multi-modal information, it does not explicitly consider the quality differences among modalities, which may introduce noise. Therefore, we propose a
gating fusion mechanism to dynamically control the fusion strength across modalities.
where
denotes the sigmoid activation function and
and
are trainable parameters. The gating matrix
g dynamically adjusts the fusion strength between modalities.
The final fused representation is computed as
where ⊙ denotes element-wise multiplication. This gating design enables the fused representation to adaptively control the amount of injected cross-modal information and mitigate noise amplification caused by modality quality variations or asynchrony. In particular, the token-wise, element-wise gate
allows the model to suppress unreliable injected cues at specific tokens/features (e.g., silent or occluded segments), rather than applying a uniform fusion strength.
GMRA is closely related to prior multi-modal fusion paradigms, including cross-modal attention for interaction (MulT [
7]) and gating-based multi-modal injection into a language backbone (MAG-BERT [
6]), as well as early static tensor/low-rank fusion (TFN [
4], LMF [
5]). However, CrossSent is designed as a lightweight residual-adapter plug-in for
targeted cue injection into the textual stream: text tokens serve as Query, while audio–visual cues are jointly used as Key/Value to produce an injected signal aligned to the textual timeline. Different from shallow modulation or coarse modality weighting, GMRA further applies
token-wise, element-wise gating on the injected cross-modal signal before residual addition, which explicitly controls the amount of non-text information added to each textual token under modality asynchrony and quality fluctuation while preserving the pretrained text representation as the primary carrier. Positioning-wise, GMRA differs from MAG-style gated addition that typically relies on coarser modulation of text features using pooled multi-modal cues: GMRA first constructs an attention-based injected signal
aligned to each text token and then regulates it via a fine-grained gate
before residual addition. Compared with MulT-style deep cross-modal transformers with stacked cross-attention blocks across modality streams, GMRA acts as a lightweight plug-in that injects non-text cues with minimal architectural change, which is particularly suitable for stable fine-tuning on noisy sentiment data.
3.3. Monotonic Pairwise Ranking Regularization (MPR)
In multi-modal sentiment analysis (MSA), sentiment-level prediction is essentially an ordinal regression problem where sentiment levels exhibit clear monotonic relationships. However, existing methods often simplify this task into general regression or classification, leading to discontinuous or inconsistent predictions across adjacent sentiment levels. To address this issue, we propose a Monotonic Pairwise Ranking (MPR) mechanism, which explicitly enforces ordinal ranking constraints among samples to maintain monotonic consistency and enhance fine-grained prediction.
Given a batch of
B samples, the ground-truth sentiment labels and model predictions are defined as
and the model’s predicted sentiment scores are
To construct effective ranking pairs, we define the pair set:
where
is a threshold controlling whether the label difference is significant enough to form a ranking pair (typically
).
For each
, the MPR objective encourages consistency between the ordering of predictions and ground-truth labels. The pairwise ranking loss is defined as
where
denotes the sign function:
The adaptive margin
varies with the ordinal difference:
where
and
are hyperparameters controlling the margin magnitude and sensitivity.
Intuitively, if two samples have a large label gap (e.g., ), the model is required to preserve a proportional margin in predictions. When the predicted difference fails to satisfy this margin, the pair contributes a positive penalty; otherwise, the loss is zero.
By explicitly modeling ordinal consistency, the MPR mechanism effectively reduces discontinuity and reversals in sentiment-level prediction. In practice, MPR is optimized jointly with regression-based losses (e.g., MSE and EIOI) to enhance both prediction accuracy and ordinal robustness in fine-grained sentiment estimation.
MPR is related to ordinal/robust learning objectives that improve monotonicity and stability (CORAL [
9] and ordinal-space trustworthy fusion [
10]), but it is tailored to fine-grained multi-modal sentiment regression by explicitly modeling
relative ordering among samples. Specifically, we construct ranking pairs using an ordinal-gap threshold (
) to focus supervision on meaningful label differences and to reduce the influence of minor annotation noise. Moreover, MPR adopts an ordinal-gap-aware adaptive margin
that scales with label distance, rather than using a fixed margin, so that larger ordinal gaps are encouraged to have proportionally larger prediction separation, directly suppressing level-skipping and local order reversals. Compared with standard fixed-margin ranking objectives, MPR makes two noise-aware choices for continuous sentiment annotation: (i) it filters supervision using
to avoid enforcing potentially unreliable orderings for near-tied samples, and (ii) it scales the margin with ordinal distance so that large gaps are encouraged to produce proportionally larger prediction separations, directly penalizing level-skipping that may persist under a constant margin.
3.4. Error-Interval Ordinal Inconsistency Loss (EIOI)
In multi-modal sentiment analysis, annotation labels are often subjective and noisy. Different annotators may assign slightly different sentiment intensities for similar expressions, and even the same annotator can introduce small variations. To handle such uncertainty, we propose an Error-Interval Ordinal Inconsistency (EIOI) loss that introduces a tolerance interval to avoid penalizing minor deviations and improves prediction stability and generalization.
Formally, let the model-predicted sentiment intensities and the corresponding ground-truth labels be
and the corresponding ground-truth sentiment labels be
For each
, we define a tolerance range
with a hyperparameter
controlling acceptable deviation. The EIOI loss is formulated as
When , the prediction is considered acceptable and the loss is zero; otherwise, only the excess deviation contributes to the loss, allowing smooth penalization beyond the tolerance band. This encourages the model to focus on significant prediction errors rather than label noise.
The value of can be tuned according to the dataset granularity. For coarse-grained tasks (e.g., 5- or 7-level scales), a larger (typically ) is recommended, while for fine-grained continuous regression, a smaller () ensures balanced precision and stability.
Finally, the total loss is defined as
where
controls the contribution of EIOI. This joint objective improves both predictive accuracy and ordinal consistency of the model in fine-grained sentiment regression tasks.
EIOI is related to robust regression objectives that reduce sensitivity to small label perturbations, and it complements ordinal-consistent learning such as CORAL [
9] and ordinal-space robustness modeling [
10]. Different from standard robust losses that apply a fixed shape everywhere, EIOI introduces an explicit
tolerance interval and assigns
zero penalty to predictions within this band. Thus, the optimization focuses on deviations that exceed the acceptable ordinal error range, which is particularly suitable for subjective sentiment annotations where minor disagreements are common. In CrossSent, EIOI provides
absolute-deviation robustness, while MPR provides relative-order consistency; together they address both tolerance-aware accuracy and monotonicity, reducing level-skipping under noisy supervision. In terms of positioning, EIOI can be viewed as a tolerance-band robust regression objective; our goal is to explicitly encode an acceptable ordinal error interval for subjective sentiment labels. Combined with MPR (relative-order constraints), EIOI (absolute-deviation tolerance) jointly reduces both noise-driven over-penalization and ranking reversals.
4. Experiments
4.1. Datasets and Evaluation Metrics
We adopt three widely used multi-modal sentiment analysis datasets: CMU-MOSI [
19], CMU-MOSEI [
20], and CH-SIMS [
21]. CMU-MOSI and CMU-MOSEI are standard English multi-modal sentiment analysis datasets, each containing opinion video clips annotated with sentiment intensities ranging from
to
. Each sample includes synchronized textual transcripts, acoustic features, and visual representations. CH-SIMS is a large-scale Chinese multi-modal sentiment analysis dataset, in which each video segment contains aligned text, audio, and visual modalities with corresponding continuous sentiment annotations. This dataset supports both regression and classification formulations of MSA.
Dataset statistics: The detailed statistics of the three datasets are summarized in
Table 1.
Classification tasks: Following prior work, we report accuracy under different granularity levels, including binary accuracy (ACC2), three-class accuracy (ACC3), five-class accuracy (ACC5), and seven-class accuracy (ACC7). We also compute macro-average F1-score to evaluate overall classification balance.
Regression tasks: For continuous sentiment regression, we adopt two widely used evaluation metrics: (i) mean absolute error (MAE), which measures the average deviation between predicted and ground-truth sentiment intensities, and (ii) Pearson correlation coefficient (Corr), which evaluates the linear correlation between predictions and labels. All results are reported on the official test splits for a fair comparison with existing methods.
Ordinal-consistency metrics: Since sentiment intensity is inherently ordinal, we additionally report two ordinal-specific measures to substantiate the benefits of the proposed ordinal regularizers (MPR and EIOI). First, we compute quadratic weighted kappa (QWK), which measures agreement between predicted and ground-truth ordinal levels with quadratic penalties for larger level deviations. Second, we report level-jump statistics to quantify level-skipping behavior, including (i) , the ratio of samples whose absolute level difference is at least 2, and (ii) MeanAbsJump, the mean absolute level difference.
To compute these ordinal metrics, we discretize continuous scores into ordered levels following standard practice. For MOSI/MOSEI (label range ), we round predictions and labels to the 7-level set . For CH-SIMS (label range ), we primarily use a 5-level discretization by scaling scores by 2 and rounding to , and we also report a 3-level variant by rounding to for reference. These ordinal metrics are reported on the official test splits.
4.2. Implementation Details
We evaluate
CrossSent on two English datasets (CMU-MOSI and CMU-MOSEI) and one Chinese dataset (CH-SIMS) using the official train/valid/test splits. For text, we fine-tune a RoBERTa-large backbone on English datasets and a BERT-family backbone on CH-SIMS, following the model-path based backend selection in our implementation. For audio and vision, we use publicly released, pre-extracted features. Because A/V feature dimensionalities vary across datasets, we keep dataset-specific A/V dimensions and project them into the shared fusion space of
inside GMRA, ensuring consistent multi-modal interaction space (
Table 2).
We adopt a deterministic token-level alignment and fixed-length input construction pipeline (
Table 3). For MOSI/MOSEI, words are tokenized into subwords; word-level discrete tags are expanded to subwords using token-inversion indices, and the sequence is truncated/padded to a fixed maximum length
. For CH-SIMS, continuous A/V streams are resampled to the token length and written into a fixed buffer following the
[CLS] + content + padding layout. A unified attention mask is used to ignore padded positions.
For baseline comparisons, we report numbers from the original papers, since many methods do not release public implementations and/or adopt different backbones and feature-extraction pipelines that are not directly comparable under a unified retraining setting.
All experiments are conducted on a single NVIDIA A100 GPU (
Table 3). We train with AdamW and a linear warmup schedule, apply gradient clipping, and select the best checkpoint based on validation MSE. The overall objective is MSE regression augmented with ordinal-aware regularizers (MPR and EIOI), where the regularizer weights are activated via a cosine ramp-up over epochs (
Table 3). Dataset-dependent hyperparameters are summarized in
Table 2. We also performed a small sensitivity check by perturbing the GMRA injection depth
and the regularizer weights
around the tuned values, and observed consistent trends with the best performance attained near the reported settings. We clarify that our experiments follow the widely used
pre-extracted feature setting for audio and vision; end-to-end optimization of acoustic/visual encoders and deployment-oriented pipelines are left for future work.
4.3. Computational Complexity and Inference Efficiency
We analyze the computational overhead introduced by the proposed GMRA injection in terms of parameter count and inference efficiency. As summarized in
Table 3, CrossSent injects GMRA once at
with a fixed maximum token length
. The dominant extra cost stems from attention-style interactions inside GMRA, whose complexity is on the order of
(with hidden size
d); under our token-level alignment where
, this is approximately
per injected layer. Other components (e.g., gating and lightweight projections) add comparatively minor overhead.
Efficiency comparison: We compare the full model with a baseline variant where GMRA is disabled, keeping all other settings identical. Both variants are evaluated on the same NVIDIA A100 GPU with the same input shape (). We measure forward-only inference latency with warmup = 20 and iters = 50, applying CUDA synchronization for accurate timing. We report single-sample latency (batch = 1), throughput (batch = 64), and peak GPU memory footprint.
Results:
Table 4 shows that enabling GMRA introduces 8.89M additional parameters (125.25M → 134.14M, +7.10%). The batch-1 inference latency increases from 12.04 to 13.42 ms/sample (+11.42%), while the batch-64 throughput decreases from 1489.19 to 1392.97 samples/s (−6.46%). Peak GPU memory slightly increases from 702.6 MB to 712.0 MB (+1.33%). Overall, GMRA incurs a modest inference overhead while preserving practical efficiency.
4.4. Performance Comparison
To comprehensively evaluate the proposed CrossSent model, we conduct comparisons on CMU-MOSI, CMU-MOSEI, and CH-SIMS. We benchmark our model against several state-of-the-art methods, including MulT [
7], MISA [
8], Self-MM [
14], BBFN [
22], CubeMLP [
23], TETFN [
24], MIMM [
25], VLP2MSA [
26], CAGC [
27], FMFN [
28], CMLG [
29], and ULMD [
30]. Results on CMU-MOSI, CMU-MOSEI, and CH-SIMS are shown in
Table 5,
Table 6 and
Table 7, respectively.
For transparency, we clarify the source of the compared results. Unless otherwise specified,
all baseline numbers are quoted from the original papers (reported results), as many methods do not release public code and/or use different backbones and feature-extraction pipelines that are not directly comparable under a unified retraining setting. In contrast, the results of
CrossSent are obtained by our using the official dataset splits and the alignment/training protocol described in
Section 4.2. We treat cross-paper baselines as reference results rather than strictly controlled comparisons and therefore emphasize the performance of
CrossSent under our fixed protocol.
Results on CMU-MOSI: CrossSent achieves the best performance across all key metrics. It attains 89.78% ACC2 and 89.75% F1-score, surpassing strong baselines. On the fine-grained seven-class task, CrossSent reaches 52.1%, markedly higher than CAGC (44.8%) and ULMD (47.8%). For regression evaluation, it yields the lowest MAE (0.563) and the highest Corr (0.878), indicating more precise and consistent sentiment estimation.
Results on CMU-MOSEI: On the larger-scale MOSEI dataset, CrossSent again achieves the best overall performance. It reaches 87.72% ACC2 and 87.71% F1-score and maintains competitive fine-grained accuracy (ACC7 = 54.7%). Moreover, it produces the lowest MAE (0.513) and the highest Corr (0.805), demonstrating robust generalization across modalities in more diverse scenarios.
Results on CH-SIMS: CrossSent exhibits consistent superiority on the Chinese CH-SIMS dataset. It achieves the best ACC5 (43.54%) and Corr (0.622) and attains the lowest MAE (0.408), indicating stable continuous prediction and robust cross-lingual generalization.
Overall analysis: Across both English and Chinese benchmarks, the improvements mainly come from three complementary components: GMRA for dynamic feature fusion, MPR for ordinal consistency, and EIOI for tolerance-aware regression. Together, they enhance fusion depth, adaptivity to modality quality, and fine-grained sentiment precision. Moreover, as sentiment prediction is ordinal in nature, we report ordinal-specific measures in
Table 8. CrossSent achieves strong agreement under quadratic weighted kappa (QWK) and exhibits reduced level-jumping behavior, as indicated by low Jump
2+ and MeanAbsJump. These results provide direct evidence that the proposed ordinal-aware objectives (MPR and EIOI) improve ordinal consistency beyond accuracy- and regression-oriented metrics.
4.5. Ablation Study
To isolate the contribution of each core component in CrossSent (GMRA, MPR, and EIOI), we conduct ablation experiments on two English benchmarks (CMU-MOSI and CMU-MOSEI) and one Chinese benchmark (CH-SIMS). We create three variants by removing MPR, EIOI, or GMRA while keeping all other settings unchanged. We report ACC/F1/MAE/Corr as the primary ablation metrics for direct comparison with prior MSA studies; ordinal-consistency measures (e.g., QWK and level-jump statistics) are presented in the main results as supplementary characterization and are not separately ablated. The results are summarized in Tables 9–11.
4.5.1. Ablation Analysis on CMU-MOSI Dataset
We first conduct ablation experiments on the CMU-MOSI dataset. The detailed results are summarized in
Table 9.
Table 9 shows that adding MPR slightly reduces coarse-grained polarity performance on MOSI (ACC2/F1: 89.78/89.75 vs. 90.09/90.06 without MPR) while consistently improving fine-grained ordinal discrimination and regression fidelity (ACC7: 52.1 vs. 50.7; MAE/Corr: 0.563/0.878 vs. 0.584/0.868). This trade-off is expected because ACC2 is obtained by thresholding a continuous sentiment score around the neutral boundary, where a small number of borderline samples can flip the binary decision without reflecting a genuine improvement in the underlying sentiment geometry. By contrast, MPR imposes ordinal-gap-aware pairwise constraints that explicitly regularize relative ordering across sentiment levels, encouraging smoother monotone separation and suppressing local order reversals; consequently, it may slightly perturb decisions near the binary threshold, yet it yields a more faithful global ordinal structure that is better captured by ACC7 and correlation-based regression metrics.
When Error-Interval Ordinal Inconsistency (EIOI) is removed, ACC7 also drops to 50.7%, MAE rises to 0.576, and Corr decreases to 0.868. This demonstrates that EIOI effectively introduces tolerance for annotation uncertainty and mitigates over-sensitivity to subtle label variations, thereby improving stability and generalization.
Finally, removing the Gated Multi-modal Residual Adapter (GMRA) causes the largest performance decline. ACC7 decreases to 50.1%, MAE increases to 0.592, and Corr drops to 0.864. This suggests that GMRA is essential for lightweight feature-level fusion via residual adapter and for adaptively balancing modality quality differences. Without GMRA, the model struggles to handle asynchronous or noisy modality inputs, resulting in weaker overall performance.
4.5.2. Ablation Analysis on CMU-MOSEI Dataset
We further perform ablation experiments on the CMU-MOSEI dataset, and the detailed results are shown in
Table 10.
On the larger and more diverse MOSEI benchmark (
Table 10), the polarity–ordinal tension becomes much weaker: removing MPR causes a small but consistent degradation in fine-grained accuracy (ACC7: 54.7 vs. 54.4), while ACC2/F1 change only marginally (87.72/87.71 vs. 87.45/87.46) and MAE/Corr remain essentially stable (0.513/0.804 vs. 0.512/0.805). This suggests that, under a broader distribution, the binary metric is less dominated by a handful of near-boundary cases, and the main contribution of MPR is to stabilize the ordinal structure rather than to shift coarse-grained polarity, which aligns with its goal of improving level-wise separability and reducing potential level skipping.
When Error-Interval Ordinal Inconsistency (EIOI) is removed, ACC7 also decreases to 54.4% and ACC2 further declines to 87.23%. Although Corr (0.804) remains steady and MAE is nearly constant, EIOI is beneficial for handling label uncertainty and reducing over-sensitivity to noisy annotations, which helps maintain robustness and generalization on large datasets.
In contrast, removing the Gated Multi-modal Residual Adapter (GMRA) leads to a more obvious performance drop: ACC7 falls to 53.7%, ACC2 declines to 86.90%, MAE rises to 0.520, and Corr decreases to 0.796. This demonstrates that GMRA effectively manages modality quality variations and temporal asynchrony, enhancing the model’s robustness and generalization performance.
4.5.3. Ablation Analysis on CH-SIMS Dataset
Finally, we conduct detailed ablation experiments on the CH-SIMS dataset, and the results are reported in
Table 11.
For CH-SIMS (
Table 11), MPR provides clearer benefits for overall robustness: removing MPR leads to a noticeable drop in polarity performance (ACC2/F1: 80.41/80.06 vs. 78.86/78.55) and also degrades regression quality (MAE/Corr: 0.408/0.622 vs. 0.418/0.601), while multi-class accuracies remain nearly unchanged (ACC5/ACC3: 43.54/62.36 vs. 43.42/62.36). Given that CH-SIMS is annotated with continuous scores and exhibits different linguistic and distributional characteristics, enforcing monotone pairwise constraints helps suppress local order reversals induced by subjective labeling noise, thereby improving the ordinal geometry and the fidelity of continuous prediction beyond what discrete multi-class accuracy alone can reflect.
When Error-Interval Ordinal Inconsistency (EIOI) is removed, ACC3 decreases from 62.36% to 61.92, and ACC2 and F1 also drop (79.38% and 79.44%). In regression, MAE increases to 0.416 and Corr decreases to 0.615, suggesting that EIOI improves tolerance to annotation uncertainty and stabilizes regression under noise.
Finally, removing the Gated Multi-modal Residual Adapter (GMRA) yields the most pronounced degradation. Most metrics decline, with ACC3 decreasing to 61.26 and Corr dropping to 0.606. This shows that GMRA plays a pivotal role in deep cross-modal fusion and adaptively balancing modality quality and asynchrony, which enhances robustness and generalization in the Chinese setting.
4.5.4. Summary of Ablation Experiments
Overall: Across MOSI, MOSEI, and CH-SIMS, GMRA, MPR, and EIOI provide complementary benefits; together they deliver the best accuracy, ordinal consistency, and robustness.
GMRA: Removing GMRA consistently degrades fine-grained classification and regression (MAE↑, Corr↓), confirming its role in dynamic cross-modal fusion and robustness to modality asynchrony and quality variations.
MPR: Without MPR, fine-grained performance and stability decrease, particularly on MOSI and CH-SIMS, indicating that pairwise ranking improves ordinal separability and mitigates level-jumping.
EIOI: Removing EIOI leads to higher MAE and lower Corr with moderate accuracy degradation, showing that its interval-based design enhances robustness against label noise and annotation uncertainty.
4.6. Generalization Experiments
To further verify the generalization capability of the proposed CrossSent model under cross-dataset sentiment prediction scenarios, we conduct zero-shot generalization experiments on two English datasets, CMU-MOSI and CMU-MOSEI. Specifically, we train CrossSent on one dataset and directly evaluate it on the other, without any fine-tuning or adaptation, to assess the model’s ability to generalize across data domains and sentiment distributions. The experimental results are summarized in
Table 12.
When the CrossSent model is trained on the smaller CMU-MOSI dataset and tested on the larger and more diverse CMU-MOSEI dataset, ACC2 decreases to 82.63% (a drop of approximately 7.15 percentage points compared to 89.78% on MOSI), ACC7 decreases to 46.31%, Corr decreases to 71.36%, MAE increases to 0.646, and F1 drops to 0.825. Although all metrics decline, the model still demonstrates solid cross-dataset generalization, indicating that CrossSent can transfer from a smaller dataset to a larger and more complex one.
When the CrossSent model is trained on the larger CMU-MOSEI dataset and tested on the smaller CMU-MOSI dataset, ACC2 reaches 86.43%, representing only a 1.29 percentage point decrease compared to 87.72% on MOSEI. ACC7 drops to 49.85%, Corr decreases to 84.51%, MAE increases to 0.621, and F1 decreases to 0.864. Overall, the degradation is minor, suggesting that representations learned from large-scale data transfer well to smaller datasets.
A comparison between intra-dataset and cross-dataset performance (MOSI→MOSI vs. MOSI→MOSEI; MOSEI→MOSEI vs. MOSEI→MOSI) reveals that generalizing from the smaller dataset to the larger one incurs a larger drop, while training on the larger dataset yields better transfer. This indicates that large-scale training improves robustness to distribution shift and enhances cross-dataset generalization.
Overall, these experiments further verify the effectiveness of the three key modules in CrossSent (GMRA, MPR, and EIOI) under cross-dataset generalization scenarios, highlighting its potential in real-world multi-modal sentiment prediction.
4.7. Visualization and Comparative Analysis
t-SNE analysis (CMU-MOSI, 686 test clips): We visualize fused tri-modal embeddings of two variants, BackBone and w/o MPR, using t-SNE (
Figure 3 and
Figure 4).
BackBone: Clear sentiment geometry is observed: strong positives and negatives occupy opposite extremes, while neutrals cluster near the center.
w/o MPR: Clusters become blurrier with heavier overlap near the center, and the sentiment gradient weakens, indicating reduced ordinal separability.
4.8. Attention Heatmap Analysis
Attention heatmap analysis (CMU-MOSI): We compare the averaged self-attention (layer 12) of the BackBone and w/o MPR models (
Figure 5 and
Figure 6).
BackBone: The full model exhibits more focused attention patterns and clearer contextual alignment.
w/o MPR: The attention map becomes more diffuse, suggesting weaker ordinal awareness and reduced interpretability.
5. Conclusions
This paper proposes CrossSent, a multi-modal sentiment analysis framework for fine-grained continuous sentiment prediction. CrossSent integrates three complementary components: GMRA injects acoustic–visual cues into a textual backbone through token-aligned cross-modal attention with gated residual updates to improve robustness under modality asynchrony and quality variation; MPR introduces ordinal-consistent pairwise constraints to enhance separability across sentiment levels and reduce level skipping; and EIOI adopts a tolerance-band objective that avoids over-penalizing minor annotation deviations, stabilizing optimization under noisy supervision. Experiments on two English benchmarks (CMU-MOSI and CMU-MOSEI) and one Chinese benchmark (CH-SIMS) demonstrate consistent improvements on both regression- and classification-oriented metrics, and ablation/visualization analyses further verify the complementary contributions of the three components. Moreover, an efficiency study shows that enabling GMRA introduces only modest overhead in parameters, latency, throughput, and memory footprint under a standard GPU setting, supporting the practical feasibility of the proposed design. The ordinal-consistency evaluation provides additional evidence of improved level-wise agreement.
In practical applications, CrossSent is well-suited for sentiment assessment on user-generated multi-modal content, where text often provides a reliable semantic anchor while acoustic and visual signals can be noisy or intermittently informative. The token-wise gated injection mechanism allows the model to selectively exploit non-text cues without amplifying unreliable segments, and the ordinal-consistent learning objectives yield more stable sentiment scores for downstream tasks such as ranking, trend monitoring, and risk-sensitive decision making.
Although CrossSent achieves strong performance, several limitations remain and merit further investigation:
- (1)
Our experiments follow the widely adopted pre-extracted feature setting for audio and vision; thus, the reported gains mainly reflect improved fusion and ordinal-consistent learning under fixed A/V representations rather than end-to-end optimization of acoustic/visual encoders, which may affect deployment-oriented conclusions.
- (2)
Results are reported under a controlled single-run protocol with a fixed random seed to ensure reproducibility; however, this does not replace multi-seed evaluation with statistical significance testing or confidence intervals, which would better quantify the robustness of the observed gains.
- (3)
Efficiency measurements are conducted with a fixed sequence length () and a single GPU type; inference characteristics may vary across different hardware platforms, batch sizes, and deployment configurations.
In future work, we will extend CrossSent to end-to-end training with learnable acoustic/visual encoders, conduct multi-seed evaluation with statistical testing, and perform broader deployment-oriented benchmarking across sequence lengths and devices.