4.1. Datasets
We evaluated our model on three multimodal datasets: CMU-MOSI [
7], CMU-MOSEI [
8], and CH-SIMS [
15]. (See
Table 2).
CMU-MOSI. CMU-MOSI contains 2199 opinion-level annotated samples from 93 YouTube vlogs by 89 speakers, with sentiment scores ranging from −3 to 3. The dataset is divided into 1284 training, 229 validation, and 686 test samples.
CMU-MOSEI. CMU-MOSEI, the largest dataset for sentence-level sentiment analysis, consists of 23,453 annotated clips from 1000 speakers in 3228 YouTube videos. It is divided into 16,326 training, 1871 validation, and 4659 test samples, with sentiment scores from −3 to 3.
CH-SIMS. The CH-SIMS dataset, a Chinese single and multimodal sentiment analysis benchmark, comprises 2281 video clips from 60 raw videos. It contains 1368 training, 456 validation, and 457 test samples, each labeled with sentiment scores ranging from −1 to 1, allowing a complete sentiment analysis.
All experiments were conducted using Python 3.10.4 with the PyTorch 1.2.0 framework accelerated by CUDA 11.4. Model training was performed on a system equipped with an Intel Core i5-12500H processor and an NVIDIA GeForce RTX 3090 GPU (24 GB memory). Due to the varying characteristics of the three datasets, we adopt dataset-specific hyperparameter settings, which are detailed in
Table 3.
4.4. Results and Analysis
The results of the HTRN on CMU-MOSI and CMU-MOSEI are shown in
Table 4, compared to several state-of-the-art baselines. On CMU-MOSI, the HTRN achieves the best performance in all metrics. Specifically, it surpasses the second best method, ConFEDE, by 0.8% on Acc-2 and 0.9 on the F1 score and improves Acc-5 and Acc-7 by 4.9% and 1.6%, respectively. Compared to the TFN and LMF, the HTRN produces lower MAE (0.716) and higher correlation (0.794), reflecting more precise regression and better modeling of multimodal interactions. The TFN and LMF suffer from limited fusion expressiveness or high computational costs, while the HTRN maintains both efficiency and accuracy. Although PS-Mixer achieves competitive classification results, its performance in fine-grained metrics remains less consistent, indicating the HTRN’s advantage in robustness.
On CMU-MOSEI, the HTRN also delivers strong results. It achieves 54.9% on Acc-5, outperforming all competing models. For Acc-2 and F1, the HTRN slightly exceeds PS-Mixer and ConFEDE, showing comparable performance with top-tier methods. Although gains over baselines such as MFM or MISA are smaller than those on CMU-MOSI, the HTRN demonstrates stable improvements and strong generalization. Prior methods focus on static modality fusion or lack effective temporal alignment, while the dynamic token-role design of the HTRN allows for better integration of modality-specific and shared information.
Notably, CMU-MOSI is a smaller dataset with binary sentiment labels primarily based on monologue videos, which makes it prone to overfitting but suitable for observing performance under limited data. In contrast, CMU-MOSEI provides more diverse speakers and topics, along with more sentiment levels (from −3 to +3), thus serving as a more challenging task for model generalization. Our consistent results across both datasets confirm the HTRN’s robustness in both low-resource and large-scale settings.
In addition, to verify the generality of the HTRN, we jointly trained the CMU-MOSI and CMU-MOSEI training sets and evaluated them on their respective test sets. To accommodate the heterogeneity of the two datasets, we designed independent modal embedding linear layers for each dataset while sharing the parameters of the backbone model. Experimental results show that this joint training strategy across datasets effectively improves the overall performance of the two datasets, especially in the classification accuracy and correlation metrics.
Specifically, on CMU-MOSI, the HTRN
‡ achieved leading performance in multiple key indicators such as Acc-7, Acc-5, F1, and Corr. Acc-7 improved to 47.2%, and Corr improved to 0.812, indicating that the model has a stronger discriminant ability and a better fitting effect in multi-level classification tasks and affective intensity modeling. It is worth noting that as can be seen from
Table 3, the HTRN
‡ can achieve faster convergence by simply increasing the batch size to 128 under the same CrossTrans depth and epoch settings as CMU-MOSI, demonstrating good training efficiency and stability. On CMU-MOSI, the HTRN
‡ also improved the Acc-5 and Corr metrics, reaching 56.6% and 0.785, respectively, further verifying its strong generalization ability on large-scale multimodal datasets.
This indicates that HTRN can effectively capture the commonalities between CMU-MOSI and CMU-MOSEI despite their topic distribution, sample size, and sentiment expression differences. Driven by the improved multimodal semantic alignment and modeling capabilities, joint training avoids the risk of overfitting to a single dataset. In addition, it improves the model’s generalization ability for sentiment semantics across datasets. This advantage is due to the text-guided dynamic role modeling mechanism, which enables the model to more accurately identify the semantic core in multi-source data, thereby improving downstream sentiment discrimination performance.
We note that on Acc-7 of CMU-MOSEI, neither the single training nor the joint training of the HTRN performed as expected, in contrast to the improvement trend of other metrics. This may be due to the following factors: First, the definition of Acc-7 focuses more on distinguishing the intensity of intermediate emotions, while CMU-MOSEI contains many neutral or mildly emotional expressions, which are more subjective and introduce greater uncertainty in model decisions. Second, since the distribution of emotional labels in CMU-MOSEI is relatively biased towards neutral and light polarities, it is difficult for the model to effectively draw the decision boundary between different categories. To further optimize this indicator, subsequent work may consider introducing a more refined label redistribution mechanism or fuzzy boundary processing for neutral regions to enhance the HTRN’s recognition ability under ambiguous emotional expressions.
As shown in
Table 5, the HTRN consistently outperforms all baseline methods on the challenging CH-SIMS dataset across all evaluation metrics. In the fine-grained five-class classification task, the HTRN achieves an Acc-5 of 43.98%, outperforming the closest competitor, LMF, by a significant margin of 3.45 percentage points. Similarly, the HTRN achieves 68. 71% in Acc-3, demonstrating superior capability over early fusion models like the TFN (65.12%) and modality interaction-based approaches like MuLT (64.77%).
Regarding Acc-2, the HTRN demonstrates strong performance by attaining 80.31% accuracy and an F1 score of 80.23, outperforming MuLT (78.56%, 79.66 F1) and MISA (76.54%, 76.59 F1). The HTRN achieves the lowest MAE (0.394) and the highest correlation (0.628), demonstrating improved prediction precision and more substantial alignment with actual sentiment trends.
These improvements can be attributed to the HTRN’s novel hierarchical text-guided alignment strategy, which dynamically guides the crossmodal fusion process and effectively captures fine-grained intermodality dependencies. Furthermore, the SIF module introduces structured perturbations into the fusion process, enhancing the HTRN’s robustness by mitigating modality-specific noise and irrelevant variance. This dual strategy enables the HTRN to preserve semantic consistency while selectively amplifying emotionally salient cues, making it particularly effective across diverse classification levels.
As shown in
Table 6, we compare the size and computational complexity of the HTRN with several representative MSA models on CMU-MOSI, excluding the shared BERT-based text encoder for a fair comparison [
40,
46,
47]. Despite its relatively compact architecture with only 2.51 M parameters and 4.76 G FLOPs, the HTRN achieves the highest Acc-2 score of 86.3%, outperforming all baseline models in both efficiency and effectiveness. Compared to ConFEDE, which achieves 85.5% but at the cost of 20.12 M parameters and a heavy computational burden of 134.16 G FLOPs, the HTRN offers an 8× reduction in parameters and 28× reduction in FLOPs, while still improving performance by a notable margin. Even when compared with highly efficient models such as MAG-BERT (1.22 M, 3.91 G) and MISA (3.10 M, 1.7 G), the HTRN achieves a better balance between model size and performance, outperforming them by 1.9% and 2.9% on NP Acc-2, respectively.
These results highlight the efficiency of the hierarchical alignment and structured perturbation strategy of the HTRN, which enables expressive crossmodal representation without incurring significant computational cost. Rather than relying on deep and computationally expensive fusion networks, the HTRN uses a lightweight yet highly effective architecture that selectively integrates multimodal cues, achieving competitive accuracy with significantly fewer resources. This makes the HTRN a promising choice for real-world deployment scenarios where both performance and efficiency are critical.
Figure 4 compares the ROC and PR curves of MISA, MuLT, and the HTRN on three datasets under the Acc-2 metric. On CMU-MOSI, the AUC value under the ROC curve of the HTRN model reaches 0.913, significantly higher than that of MuLT (0.871) and MISA (0.902), demonstrating better discriminative ability. In the corresponding PR curve, the HTRN also achieves the highest AP value (0.888), outperforming MISA (0.881) and MuLT (0.831).
On CMU-MOSEI, the HTRN demonstrates superior performance, achieving an AUC of 0.849, which slightly exceeds that of MuLT (0.837) and MISA (0.840). The AP value in its PR curve is 0.843, which is also better than that of MISA (0.830) and MuLT (0.826), indicating that the HTRN still has robust sentiment recognition capabilities in complex and diverse natural language contexts.
On the Chinese dataset CH-SIMS, the HTRN achieves an AUC of 0.823, substantially outperforming MISA (0.791) and slightly surpassing MuLT (0.819), demonstrating its robust cross-lingual generalization capability. In the PR curve, the AP value of the HTRN is 0.743, which also leads to MISA (0.687) and MuLT (0.693), further verifying the model’s broad adaptability and strong generalization ability in Chinese MSA.
In general, the HTRN outperforms existing mainstream methods on different datasets and evaluation metrics, especially in the ROC curve and the PR curve, which demonstrated superior classification performance and accuracy, verifying its effectiveness and robustness in MSA.
4.5. Ablation Studies
4.5.1. Comparison of Effects of Different Modalities
To evaluate the impact of each modality on the performance of the model and the contribution of the auxiliary modalities to the extraction of text, as shown in
Table 7, we conducted ablation studies on CMU-MOSI and CMU-MOSEI, focusing on the metrics NP Acc-2 and Acc-7 on CMU-MOSI, as well as the metrics Acc-5 and MAE on CMU-MOSEI.
For clarity in notation, we designate the text-guided approach in the TAL (Text-Guided Alignment Layer) as TG, with corresponding abbreviations for audio-guided and video-guided methods: AG (audio-guided) and VG (video-guided), respectively.
Table 7 indicates that the HTRN further demonstrates that text-guided feature alignment plays the most significant role compared to non-text modalities. As shown in
Table 7, TG consistently outperforms AG and VG in both the CMU-MOSI and CMU-MOSEI datasets. Specifically, TG achieves the highest Acc-2 (86.3%) and Acc-5 (54.9%) scores, as well as the lowest MAE (0.531), indicating its superior ability to capture features relevant to sentiment.
In addition, the results highlight the greater contribution value of the audio modality, suggesting that the audio provides more auxiliary information than the visual modality. Specifically, removing the audio input results in a 2.1% drop in NP Acc-2, whereas removing the video input leads to a minor decrease of 0.9%, highlighting the crucial role of the audio modality in the HTRN.
4.5.2. Textual Feature Scale Analysis
In
Table 8, we analyze how the textual features on different hierarchical scales affect the performance of the HTRN on CMU-MOSI. The experimental results clearly illustrate the effectiveness of the progressive guidance strategy adopted in the HTRN. Specifically, we observe that each level of textual representation (
,
, and
) contributes differently to the performance, revealing their unique roles in guiding the alignment of non-text modalities. To further validate this observation, we enable different scales of textual features to guide the fusion of non-textual features in the experiments. For layers that do not require language guidance, we replace the TAL with a simple MLP layer. As the experimental results show, enabling hierarchical textual guidance significantly improves performance, confirming the importance of multi-scale semantic cues in the fusion process.
The low-level text representation , which captures shallow and local semantics, already provides a strong baseline performance (accuracy 84.5%, F1 score 84.8%). However, its ability to handle complex or abstract expressions is limited. Performance remains competitive when using the mid-level representation , indicating its capacity to generalize contextual patterns. Similarly, the high-level representation excels in capturing abstract and global semantics, as evidenced by the strong correlation score (0.782), suggesting its advantage in guiding the global alignment of audio and visual signals.
More importantly, as we progressively combine multiple scales of textual features, the model consistently improves across almost all metrics. For example, combining and leads to a better correlation (0.781), while the combination of and yields the lowest MAE (0.715), demonstrating that complementary semantic cues from different layers are beneficial. The complete integration of the three levels, , , and , achieves the best overall performance with 86. 3% precision, 86.4% F1 score, and a correlation of 0.794. This confirms the importance of multi-scale textual guidance in building a semantically coherent multimodal feature space.
In summary, these results empirically substantiate the fundamental principle underlying our text-guided alignment design. Hierarchical textual features offer increasingly abstract and complementary cues that progressively refine the alignment of non-textual modalities. The HTRN can dynamically bridge the semantic gap between modalities by leveraging such multi-scale representations, leading to more accurate and robust sentiment predictions.
4.5.3. Impact of Shuffle Token Insertion Positions
Table 9 compares the different insertion positions of the shuffle token (interval insertion, front insertion, end insertion, and no insertion setting) on CMU-MOSI and CMU-MOSEI. As shown in
Table 9, inserting shuffle tokens at fixed intervals (HTRN) outperforms other strategies across both datasets, achieving the highest Acc-2 (86.3% for CMU-MOSI, 86.4% for CMU-MOSEI) and lowest MAE. This improvement stems from balanced local perturbation: uniformly distributed empty tokens preserve sequential coherence while enhancing feature diversity, similar to structured dropout. Additionally, truncating to the first N tokens compresses redundant information, forcing the model to focus on discriminative features of the early stages.
In contrast, front-/end insertion disrupts sequence integrity by overloading perturbations in localized regions (e.g., erasing initial cues in front insertion), leading to context fragmentation and higher MAE. The baseline without insertion suffers from overfitting to noisy segments in full-length sequences (50 tokens), highlighting the necessity of structured regularization. Thus, the HTRN’s interval-based design optimally balances robustness and efficiency.
4.5.4. Visualization of Token Importance in SIF
Figure 5 presents our analysis of the visual modality input of the HTRN on CMU-MOSI. We adopt the Integrated Gradients (IG) method to measure the importance of each visual token to the final prediction and visualize the results to verify the model’s attention regions and decision basis when processing video modality information.
In
Figure 5, we analyze the first two samples from the CMU-MOSI test set, focusing on all visual tokens (
) before applying the SIF module, as well as the top 10 tokens (
) retained by the HTRN after using the SIF module.
Figure 5a,c illustrate the important distribution of all visual symbols for each sample. The most important positions are highlighted with red dots, and the relative importance is indicated by the intensity of the color. The x-axis represents the position of the tokens in the original sequence, while the y-axis shows the corresponding importance scores. We compute the overall importance of each token by adding the attributions in its feature dimensions using the IG method. In particular, within the interval of the first 10 tokens—marked by green dashed boxes—the importance scores are significantly higher than those of other positions. This suggests that the HTRN tends to focus more on the early part of the input sequence.
To further explore the attention distribution of the HTRN within the selected input range, subplots (b) and
Figure 5b,d show the importance scores of each marker in
. The results show that the HTRN’s attention to these 10 tokens is significantly selective, rather than evenly distributed. This phenomenon supports the design concept of using the SIF module in the HTRN to retain key information and suppress redundant information. It should be noted that the HTRN does not assign the same weight to all frames. Instead, through the SIF module, the HTRN automatically learns the top 10 most representative markers
, which more effectively represent the core information in the mode and significantly reduce redundant interference, further verifying the effectiveness of the SIF mechanism in information compression and key content extraction.
Moreover, comparing the two samples reveals a degree of dynamic attention distribution. The high-importance token positions within the top 10 tokens are not identical across samples; for example, the model focuses on tokens 0, 2, and 6 in sample 1. In contrast, it shifts attention to tokens 1, 3, and 6 in sample 2. This variation reflects the model’s adaptability to different inputs, showing its ability to flexibly adjust attention based on contextual information and make more semantically aligned multimodal inferences.
In general,
Figure 5 not only uncovers the token utilization pattern of the model in the visual modality but also provides an effective tool for interpreting multimodal fusion strategies. The IG-based analysis successfully identifies the key inputs driving the model’s decisions. It supports our hypothesis that effective crossmodal fusion can be achieved using only the top 10 visual tokens. This insight offers strong evidence for the lightweight and efficient design of the HTRN.
4.5.5. Visualization of Attention in TAL
In
Figure 6, we show the average attention heatmaps for the last layer of the TAL on CMU-MOSI. These heatmaps visually interpret how the HTRN attends to different modalities during the matching process. It can be seen that the attention weight assigned to the audio modality is consistently higher than that of the visual modality. This indicates that the HTRN tends to rely more on audio features when constructing a fused multimodal representation, which can be attributed to the audio containing richer and more direct affective cues, such as variations in rhythm, tone, or pitch, providing valuable information for MSA.
To further validate this observation, we refer to the results shown in
Table 7, where a modality deletion experiment is conducted. Specifically, we compare the performance degradation of the HTRN when the audio or visual modality is removed. The results show that excluding audio leads to a more significant performance degradation in multiple metrics (classification accuracy, F1 score, and correlation) compared to removing the visual modality. This quantitative evidence is consistent with attention analysis and confirms the central role of the audio modality in the HTRN.
The attention visualization results and the removed modality’s results further confirm this conclusion: Compared with the visual modality, the audio modality in the HTRN provides more auxiliary and emotion-related information. By adaptively focusing on the modality with more information, the HTRN effectively uses the most relevant signals for alignment and prediction, a key factor in its superior performance in MSA.
4.5.6. Visualization of Different Representations
Figure 7 illustrates the visualization of different feature representations using t-SNE on three datasets. The left column of
Figure 7 presents the initial unimodal characteristics
,
, and
extracted from the text, vision, and audio modalities, where each modality forms a distinct group, indicating their separability in the original feature space. The right column of
Figure 7 shows the distribution of the learned multimodal representations
F, where the positive samples (
) are marked in blue and the negative samples (
) in red. The clear separation between the two groups suggests that the model effectively captures discriminative information, facilitating better classification. This visualization highlights the importance of integrating multimodal information to enhance the representation learning process.
4.5.7. Convergence Performance
To further explore the convergence behavior of the HTRN, we present a comparative analysis of MAE training and validation curves on CMU-MOSI alongside several representative baselines, including MuLT, Self-MM, MISA, and ConFEDE. As shown in
Figure 8, the HTRN demonstrates the fastest convergence speed on the training set and achieves consistently lower validation errors with reduced oscillations, reflecting its strong generalizability. The HTRN demonstrates the fastest convergence speed on the training dataset among all methods. On the validation dataset, it consistently achieves lower metrics and exhibits more excellent stability. These characteristics demonstrate its exceptional ability to achieve efficient and reliable convergence.
Convergence, in the context of multimodal fusion networks, refers to the stability and reliability of model optimization under various combinations of modes. Poor convergence often indicates overfitting or unstable gradient flow, especially when complex attention-based fusion mechanisms are involved. In contrast, the HTRN benefits from its hierarchical text-guided alignment and structured modality perturbation, which regularize the fusion process and guide the optimization toward more stable and semantically meaningful representations.
The convergence curves visually illustrate that the HTRN reaches an optimal state more efficiently and robustly than its counterparts, highlighting that the proposed design improves performance and improves training reliability, a crucial factor for real-world deployment.
4.5.8. Real Case
To further validate the effectiveness of the HTRN, we present two representative samples from the CH-SIMS test set in
Figure 9. In the HTRN, the text modality (T) serves as the primary source for sentiment prediction, while audio (A) and visual (V) modalities act as auxiliary sources. The TAL module allows the auxiliary modalities to modulate the textual sentiment representation, especially when non-verbal cues convey emotions that are not explicitly expressed in text. Additionally, our SIF module effectively minimizes redundant information to optimize the retention of informative features within each individual modality.
In the first sample, the speaker reflects on an emotional reaction to music. The overall ground truth sentiment is Weak Negative (M: 1), mainly driven by the visual modality (V: 2), while both text and audio are labeled as neutral (T: 0, A: 0). Despite the lack of clear sentiment in the transcript, the HTRN correctly predicts a Weak Negative sentiment (M: 1), demonstrating its ability to leverage non-verbal cues such as facial expressions to compensate for the neutrality in text.
In the second sample, the speaker expresses determination and optimism. The text modality shows a strongly positive sentiment (T: 4), but both audio and visual cues are neutral (A: 0, V: 0). The overall sentiment label, however, is Weak Negative (M: 1), which our model accurately predicts. This indicates that the model is capable of detecting inconsistencies between the excessively positive textual content and the comparatively muted emotional cues present in the audio and visual modalities, thereby mitigating textual bias and achieving alignment with the speaker’s true emotional state.