1. Introduction
Video, as a multi-modal data source integrating visual streams (e.g., facial expressions and gestures), linguistic content (e.g., subtitles and captions), and acoustic information (e.g., human voice and ambient sounds), offers a variety of affective signals for understanding human psychology and social dynamics. To fully exploit this potential, a wide range of studies have focused on emotion recognition based on human reactions under controlled recording conditions. For instance, various multi-modal frameworks [
1,
2,
3,
4,
5] have been developed and evaluated using professionally recorded videos that exhibit structured consistency, such as those in the CMU-MOSI [
6] and IEMOCAP [
7] datasets. Although these efforts have advanced multi-modal fusion techniques for emotion recognition, they primarily focus on analyzing the emotional signals displayed by the people in the video, rather than uncovering the emotional responses evoked in the viewers. In contrast, emotion recognition in user-generated videos (UGVs) aims to uncover the subconscious emotional responses elicited by diverse video stimuli—responses that are not tied to on-screen human subjects but rather to the affective impact of the video content on viewers. With the explosive growth of self-media and user-generated content, emotion analysis in UGVs has become an area of significant academic and commercial interest, ultimately enabling real-world applications such as targeted advertising and video recommendation systems [
8].
Most existing methods in this field focus on addressing the affective gap. They mainly achieve this by designing a learnable module that performs sequence-to-sequence mapping, a paradigm we term the encoder-only approach. While the state-of-the-art method [
9] in this paradigm introduces multi-layer self-attention mechanisms for effective temporal modeling, it suffers from quadratic complexity with respect to the sequence length, imposing significant computational and memory burdens. Furthermore, emotional expression in videos is often characterized by intense affective peaks rather than a linear progression. A standard encoder-only model, which processes all temporal elements continuously, risks diluting the impact of critical segments and introducing noise from neutral or irrelevant frames. To address this, we introduce a decoder paradigm that achieves adaptive attention to crucial emotional segments with linear complexity.
Simultaneously, training multi-layer structures from scratch is challenging due to limited training data and a significant domain discrepancy, leading to suboptimal emotional representation learning. To mitigate similar issues, previous methods have relied on auxiliary data [
10,
11] or more sophisticated training strategies [
12]. However, no prior studies have explored the potential of prior knowledge encoded in models of other modalities. A recent study [
13] reveals that transformer models can improve performance on a target modality by exploiting modality-complementary knowledge from an auxiliary modality, even when the auxiliary modality is seemingly irrelevant and without requiring extra auxiliary data. This effect is achieved by repurposing general token mapping and interaction rules across modalities using two identically structured transformers, each trained on a distinct modality. While effective, this approach is constrained by its architectural rigidity and fails to leverage modality-specific knowledge. Inspired by this finding, we propose a novel approach that eliminates the need for paired, identical models, allowing for a more flexible and explicit transfer of modality-specific prior knowledge.
To this end, we introduce a novel decoder architecture that establishes a prototype evolution pathway to progressively bridge the affective gap. As illustrated in
Figure 1, unlike previous encoder-only methods, our approach utilizes a multi-layer decoder. This decoder is repurposed from a pretrained text encoder to incorporate its rich prior knowledge for affective understanding, and it works in concert with a neutral prototype. It gradually guides the prototype’s evolution toward distinct affective attributes, dynamically generating emotion-specific representations aligned with the visual content. The decoder thus leverages large-scale linguistic knowledge to ground and rationalize the entire prototype evolution pathway. This reframes the affective gap challenge into a more tractable prototype shift problem within the target domain, effectively avoiding the domain shift instability associated with cross-domain transfer. Furthermore, it enables a direct and stable comparison between the evolved prototype and the semantic anchors of predefined emotional categories, thereby facilitating robust and interpretable classification.
More specifically, we introduce CLIP to build our prototype evolution framework in an encoder–decoder manner, motivated by its inherent cross-modal alignment that naturally bridges visual and linguistic domains. However, the CLIP image encoder extracts only frame-level features, lacking temporal modeling for videos. To this end, we employ a lightweight sequence-to-sequence encoder with linear time complexity. It efficiently migrates separate image tokens into coherent video representations. Our full approach, termed Prototype Evolution CLIP (PE-CLIP), incorporates cross-attention layers to enable fine-grained cross-modal interaction. These layers allow the model to selectively attend to visual features most relevant to the emotion prototype, thereby minimizing the risk of diluting critical segments. In this process, the emotion prototype—represented as a set of tokens—undergoes progressive refinement through residual connections that integrate relevant visual cues extracted via cross-attention, effectively evolving it toward distinct affective attributes. To fully exploit the affective potential of this architecture, we design a category-aggregated prompt to enhance inter-category correlations without relying on external knowledge. We further argue that using prototypes with varying emotional tendencies for evolution can compromise the accuracy of the results, whereas a prototype with the correct emotional tendency should theoretically yield a lower loss. Based on this hypothesis, we designed bidirectional supervision to leverage this supervisory signal. Last but not least, PE-CLIP is a parameter-efficient framework that balances memory costs and high performance. We only convert the deeper layers of the CLIP text encoder into PE-CLIP decoder layers, ensuring that backpropagation traverses only the final few transformer layers rather than the entire model.
Our contributions can be summarized as follows: (1) We propose PE-CLIP, an encoder–decoder framework that progressively bridges the affective gap by evolving a neutral prototype toward emotion-specific representations with rich linguistic knowledge. (2) We design category-aggregated prompts and bidirectional supervision to enhance inter-category correlation and stabilize the prototype evolution pathway. (3) PE-CLIP is highly parameter-efficient while achieving new state-of-the-art performance in UGVs emotion recognition.
The remainder of this paper is organized as follows.
Section 2 reviews related work on video emotion recognition, vision-language pretraining, and prototype-based learning.
Section 3 details the proposed PE-CLIP framework.
Section 4 describes the experimental setup, datasets, and experiment results. Finally,
Section 5 concludes the paper and discusses potential directions for future work.
4. Experiments
4.1. Datasets
Figure 3 shows the distribution of categories in the experimental datasets. Consistent with previous research, we evaluate our method on the Ekman-6 [
34] and VideoEmotion-8 [
35] datasets. The VideoEmotion-8 dataset contains 1101 videos sourced from video websites (YouTube and Flickr) with annotations of eight emotion classes, averaging 107 s in duration. In the original annotation procedure, each video was labeled by multiple human annotators based on the emotion they perceived the video would evoke in viewers. For example, the disgust category includes videos containing content (e.g., trypophobia-inducing imagery) that elicits psychological or physiological aversion in viewers. Thus, the labels directly operationalize viewer-evoked affect. For a fair comparison with existing works [
8,
12,
31], we conduct ten runs of experiments with various 2:1 train-test splits following prior protocols, reporting the mean accuracy across all runs.
Similarly, the Ekman-6 dataset consists of 1637 YouTube and Flickr videos labeled with six basic emotions, featuring an average duration of 112 s. The construction of Ekman-6 involves a mixture of annotation sources. Some videos are labeled based on the emotion they directly evoke in viewers (e.g., landscape scenes annotated as joy), while others are labeled based on the emotions expressed by individuals appearing in the video (e.g., videos of happy children). Despite this heterogeneity, both types of labels ultimately correspond to emotions that viewers can plausibly experience—either through direct affective response to the content or through empathy with depicted individuals. Therefore, the dataset remains suitable for our viewer-evoked affect task. We follow the 1:1 train-test split configuration used in previous studies and experiment with various random seeds for averaging. The video samples collected from different users inherently ensure subject independence in user-generated video datasets.
Additionally, we employ the MusicVideo-6 dataset [
36], which is composed of online-sourced music videos characterized by substantial diversity in geography, language, cultural context, and instrumentation. The available version includes 1979 music video clips, each approximately 30 s long. Following the emotion categories defined in references [
37,
38,
39], the fine-grained emotions conveyed in these videos are categorized into six basic types: exciting, fear, neutral, relaxation, sad, and tension. These categories were originally selected to represent the emotional responses typically elicited in viewers by music videos, as validated by subjective listening tests, thereby grounding them in viewer-evoked affect. We conduct experiments under three different 2:1 training-test splits on the MusicVideo-6 dataset.
4.2. Implementation Details
For all datasets, we first extract frames at 30 fps using FFmpeg (Version 4.4). Each video is then uniformly divided into 64 segments along the temporal axis. During the training phase, we randomly sample one frame per segment, whereas during evaluation we adopt center-frame sampling within each segment to ensure reproducibility. For spatial normalization, frames are resized to a resolution of 224 × 224 while preserving aspect ratios by center-cropping during inference. To mitigate overfitting during training, we implement a light data augmentation strategy consisting of random resized cropping with scale parameters sampled from [0.64, 1] and aspect ratios constrained within [3/4, 4/3]. Notably, all frames within a given video sequence are subjected to identical spatial cropping coordinates to preserve temporal consistency.
PE-CLIP uses CLIP B/32 as the backbone to extract frame-level visual features. The temporal modeling module comprises three randomly initialized Mamba blocks. We consider an embedding dimension of 512 with 8 heads in both Transformer and Mamba. Specifically, the inner dimension, the projection dimension of , and the state dimension in Mamba are set to 384, 16, and 4, respectively. The value of the balancing hyperparameter is set to 0.7. Additionally, the value of the temperature hyperparameter is set to 0.01, and the starting decoder layer d is set to 8.
For optimization, we utilize the Adam optimizer and train for 50 epochs with a batch size of 16 and a 2 × 10−4 learning rate. All experiments adopt single-label emotion classification following standard evaluation practices in emotion recognition, with performance quantified through Top-1 accuracy (%).
4.3. Comparison with SOTAs
4.3.1. Results on VideoEmotion-8 and Ekman-6
In
Table 1, we compare PE-CLIP with other state-of-the-art approaches on the VideoEmotion-8 and Ekman-6 datasets. The superiority of our framework becomes particularly evident when compared with previous UGVs benchmarks. Our PE-CLIP establishes new state-of-the-art performance by achieving 64.5% and 66.7% on VideoEmotion-8 and Ekman-6, respectively, outperforming previous advanced non-CLIP approaches like MPEM [
40] by significant margins of +6.9% and +5.4%. Notably, CLIP-based models exhibit inherent advantages over conventional visual backbones, where even a basic full fine-tuning method without temporal dependency modeling (MM-VEMA) or a simple baseline surpasses conventional visual pretrained methods on each dataset. This suggests that vision-language pretraining provides superior foundational representations for emotion recognition tasks. Note that most previous visual pretrained approaches adopt a multi-modal pipeline to capture modality-complementary features, except IFPN which proposes to use unimodal features. However, it introduces an auxiliary emotional image dataset to diminish the domain gap.
Among CLIP-based methods, our approach also shows significant improvement. While TE [
9] achieves competitive results on UGVs benchmarks, our architecture demonstrates +4.7% and +1.9% improvements on VE-8 and EK-6, respectively, showing the outstanding performance of our PE-CLIP for emotion understanding in UGVs.
4.3.2. Results on MusicVideo-6
We also conduct SOTA experiments on the MusicVideo-6 dataset. The experimental results are reported in
Table 2. We reproduce the most recent state-of-the-art approaches whose codes are released. We follow the best settings in previous papers and only adjust the number of segments to ensure each approach can make full use of memory during training. Although our PE-CLIP has advantages in saving training memory, we still experiment with a few-segment condition to promote a fair comparison. We again find that CLIP-based methods demonstrate significant improvement compared with other methods. Meanwhile, CLIP-based methods reduce the reliance on the number of frames as they do not require successive frames to capture 3D visual features. In contrast, 3D-CNN methods are restricted by the numerous input frames, struggling to find the balance between memory and performance. They also require 4 to 6 times longer training time than CLIP-based methods. The non-overlapping standard deviations between our method and the best competitor further corroborate the robustness of our approach. Given the consistent superiority across all datasets and all experimental runs, concerns about false positives due to multiple comparisons are effectively mitigated.
The confusion matrix in
Figure 4 demonstrates the classification performance of PE-CLIP across six emotion categories. The model achieved strong overall performance. These results indicate that high-arousal emotions are effectively distinguished by the model. However, two categories presented notable challenges. The neutral emotion achieved moderate performance, with misclassifications distributed across fear, sad, and tension. This pattern suggests that neutral states may share overlapping features with low-arousal emotions. More significantly, sad demonstrated the lowest accuracy, with 7% of samples misclassified as neutral, 4% as fear, and 5% as tension. This indicates that sad emotional expressions exhibit ambiguous features overlapping with other low-valence, low-arousal states. The observed confusion patterns align with established psychological theories of emotion, where emotions sharing similar positions in the valence-arousal space exhibit higher misclassification rates. The primary confusion between sad and neutral emotions is theoretically consistent, as both represent low-arousal states with potentially overlapping features. Meanwhile, the minimal off-diagonal values between high-valence emotions (exciting) and low-valence emotions (sad, tension) demonstrate clear separation in the feature space.
Comparing the results on VideoEmotion-8/Ekman-6 and MusicVideo-6, we observe a substantial performance gap, with PE-CLIP achieving 91.5% on MusicVideo-6 versus 64.5% and 66.7% on the UGVs benchmarks. This disparity stems from dataset characteristics. MusicVideo-6 contains professionally produced music videos featuring consistent visual aesthetics within each emotion category and professional editing rhythms, whereas VideoEmotion-8 and Ekman-6 consist of noisy user-generated videos with diverse content and inconsistent filming quality. These results underscore the robustness of PE-CLIP to content variability, confirming the effectiveness of our framework for real-world emotion recognition tasks.
4.4. Ablation Study and Analysis
4.4.1. The Effects of Components
We report a few selected ablation studies in
Table 3. We evaluate the effectiveness of our major extensions based on the proposed encoder–decoder framework. The component ablation study starts with our baseline, where we strip the temporal modeling module and bidirectional supervision, adopting learnable tokens instead of designed prompt embeddings. It only maintains the basic encoder–decoder framework and prompt-based classifier weights, but does not utilize pretrained weights in the 4-layer decoder, which means all parameters have to be trained except the frozen CLIP image encoder.
We first experiment with the core improvements of our method. The pretrained decoder can improve the accuracy by +1.7% on each dataset, which highlights the significance of textual priors for our approach. Then, we introduce a 3-layer Mamba to serve as a temporal modeling module, which improves the baseline by +2.0% and +2.6%, respectively. This result illustrates that MH-Mamba is effective in extracting temporal dependencies hidden in video clips. Interestingly, the type of prototype and bidirectional supervision demonstrate a complex interplay. As evidenced in the last four rows of
Table 3, using a CA prompt instead of learnable tokens in the decoder decreases the top-1 accuracy by −0.1% and −0.5%. When we adopt the bidirectional supervision strategy, however, using a CA prompt overtakes learnable tokens, surpassing by +0.6% on each dataset. This phenomenon not only reflects the effectiveness of the proposed bidirectional supervision but also reveals the potential of “query engineering,” mirroring the evolution of “prompt engineering” in language models. We leave these parts for future exploration. Overall, under the optimal setting, PE-CLIP can boost the top-1 accuracy of the baseline from 61.1% and 88.3% to 64.5% and 91.5% on the two datasets, respectively.
4.4.2. Impact of Starting Decoder Layer
We analyze how different starting decoder layers
affect model performance and memory efficiency. The results are summarized in
Figure 5. As can be seen in the figure, training parameters decrease linearly as the decoder layer starts later. Meanwhile, the performance of the model peaks at
. Therefore, for an optimal trade-off between performance and memory efficiency, it is empirically better to use only a few of the top layers as the PE-CLIP decoder. Notably, the evolution pathway length is defined as
, i.e., the number of decoder layers involved in the iterative prototype refinement. The consistent performance improvement observed as
decreases from 11 to 8 (corresponding to longer evolution pathways) provides qualitative evidence for the effectiveness of the proposed prototype evolution mechanism. This trend suggests that allowing prototypes to evolve through more layers enables finer-grained iterative updates, which in turn enhances the model’s ability to capture nuanced emotional cues in user-generated videos.
4.4.3. Experiment on Various Classifier Weights
We implement this experiment on a variant of PE-CLIP to ensure architectural consistency, as different prompts would introduce prototype variables in the decoder when using PE-CLIP. Specifically, we strip the bidirectional supervision from PE-CLIP and adopt learnable tokens instead of a CA prompt to build the variant. This design isolates the impact of classifier weight generation strategies from other components. We follow the configurations described in the manuscript during the experiment.
We compare our category-aggregated prompt with three representative approaches: (1) Trainable linear classifier: The conventional approach that learns classifier weights through supervised training, where each weight vector is randomly initialized and optimized via backpropagation; (2) Emotional prompt-based classifier [
31]: Generates classifier weights by fine-tuning the CLIP text encoder on category-specific text templates (e.g., “this picture evokes a feeling like {classname}”) and using the resulting embeddings; (3) CoOp prompt-based classifier [
42]: Implements learnable context vectors that replace hand-crafted prompts. The method optimizes continuous prompt embeddings through gradient descent while keeping CLIP’s parameters frozen.
Table 4 demonstrates the superiority of our category-aggregated prompt across all datasets. First, the trainable linear classifier underperforms prompt-based methods, suggesting that learned weights struggle to capture semantic relationships in CLIP’s cross-modal embedding space. Second, we observe that CoOp’s learnable prompt shows improvements over the emotional prompt but remains inferior to our method, indicating that simple category name substitution lacks sufficient contextual modeling, while our explicit “with/without” contrast mechanism provides more discriminative context. The +0.3–0.6% margins over CoOp highlight the effectiveness of explicit category relationship modeling through negative tagging contrast.
4.4.4. Experiment on Temporal Modeling Methods
We compare various temporal modeling methods with the proposed MH-Mamba on both the VideoEmotion-8 and MusicVideo-6 datasets, as shown in
Table 5. The results demonstrate the consistent advantage of our approach across different benchmarks.
Our evaluation begins with a 3-layer Transformer and a 3-layer Mamba module for temporal modeling. The Transformer achieves 61.5% on VideoEmotion-8 and 88.2% on MusicVideo-6, falling −1.3% and −1.8% below the non-temporal Identity setup. In contrast, Mamba surpasses the baseline by +0.3% and +0.9%, reaching 63.1% and 90.9%. This indicates that while Transformers may suffer from overfitting, Mamba proves more effective in capturing temporal dependencies.
We further examine traditional recurrent architectures including LSTM and GRU under various configurations. Both methods underperform compared to our approach, with LSTM achieving 60.6% and GRU obtaining 60.1% on VideoEmotion-8. The performance gap also appears on MusicVideo-6, where MH-Mamba achieves the highest accuracy of 90.9%, outperforming LSTM at 88.1% and GRU at 86.5%. These consistent improvements across datasets validate the effectiveness of our MH-Mamba in temporal emotion understanding.
4.4.5. Experiment on Number of Segments
We investigate the impact of different segment numbers on model performance, with comprehensive results presented in
Table 6. The analysis reveals a consistent tendency across both the VideoEmotion-8 and MusicVideo-6 datasets, where performance initially improves with increasing segment number before eventually declining.
Our experiments demonstrate that 64 segments achieves optimal performance on both benchmarks, yielding 64.5% on VideoEmotion-8 and 91.5% on MusicVideo-6. When using fewer segments, the model captures insufficient temporal information, resulting in suboptimal performance of 61.3% and 89.9% with 16 segments on the respective datasets. Notably, extending to 128 segments leads to performance degradation, with accuracy dropping to 63.1% on VideoEmotion-8 and 90.8% on MusicVideo-6. This phenomenon suggests that excessive segmentation may introduce redundant information or noise that hinders effective temporal modeling.
4.5. Prompt Sensitivity Analysis
To examine whether the CA prompt is robust to linguistic variations and does not rely on superficial prompt artifacts, we conduct a series of controlled experiments on the MusicVideo-6 dataset. All results, obtained by sampling 16 frames per video, are averaged over three data splits. These experiments demonstrate that the model performance is reasonably stable across various prompt changes. As shown in
Table 7, the performance variations caused by different prompt modifications are generally small; the largest drop is 0.6% when all three modifications are combined, while changing the separator even slightly improves accuracy. All variants maintain competitive performance. We therefore conclude that the proposed prompt design does not suffer from severe prompt artifacts.
4.5.1. Prefix Word
We replace the prefix word “without” with “no” and the prefix word “with” with “only” in the prompts. The original prompts containing “with” and “without” achieve the highest performance, while the alternative wordings cause a slight degradation. The relatively small performance gap suggests that the model is not overly sensitive to exact wording, although “without” appears to be a suitable choice.
4.5.2. Separator
We vary the delimiter used to concatenate category tags within prompts: (a) semicolon + space (original), and (b) comma + space. The separator has negligible impact, indicating that the decoder primarily relies on the semantic content rather than the specific punctuation.
4.5.3. Category Order
We reverse the order of all emotion tags while keeping the prefix and separator unchanged. The accuracy drops by 0.5%, which is slightly larger than the variation caused by prefix word changes but still within a small range. This suggests that the CA prompt functions reasonably well across different orders.
4.5.4. Length of Prototype
We also investigate the impact of the number of prototype tokens on model performance on the VideoEmotion-8 dataset. In this experiment, we sample 64 frames per video, remove the bidirectional supervision, and adopt learnable prototypes, which provides a suitable setting for studying the effect of prototype token length. As shown in
Table 8, we evaluate prototypes with varying lengths: 2, 14, 26, 38, and 50 tokens. The results indicate that using only 2 tokens yields substantially lower performance, while increasing to 14 tokens achieves the best results. Further increasing the number of tokens to 26, 38, or 50 leads to a slight performance drop but remains stable around 63.1% accuracy. This suggests that the model is not highly sensitive to the exact number of prototype tokens as long as a sufficient length is provided.
4.6. Visualization
Figure 6 compares the t-SNE visualizations of feature representations between our PE-CLIP and the baseline model. Our method demonstrates significantly tighter and more distinct cluster formations in the affective feature space, as quantitatively validated by silhouette coefficients of 0.093 for PE-CLIP versus 0.045 for our baseline model. This improvement confirms the superior discriminative capability brought by our prototype evolution design, which better preserves the inherent affective structure and separability across emotional categories.
4.7. Memory Efficiency
Although simpler encoder-only approaches [
8] can match the computational efficiency of PE-CLIP, they are weaker in performance. Among methods with competitive performance, PE-CLIP demonstrates lower computational requirements. TE [
9], which ranks second in performance, introduces CLIP’s visual features as a third modality. Its multi-layer cross-modal temporal enhancement module, trained to enhance inter-modal correlation, requires substantially greater computational resources than any other approach. In contrast, our method achieves state-of-the-art performance while maintaining remarkable parameter efficiency, establishing an optimal balance between computational requirements and model effectiveness.
Although requiring more computational cost, PE-CLIP retains its advantage in memory efficiency. As illustrated in
Table 9, MM-VEMA and ST-Adapter require approximately
greater memory capacity when processing the same number of input frames. This efficiency advantage persists even when PE-CLIP handles longer sequences. In contrast, our PE-CLIP adopts a parameter-efficient architecture similar to ST-Adapter [
33], preserving the pretrained backbone while only training a part of the modules with minimal additional parameters. This combination of parameter efficiency and operational effectiveness makes our method ideal for practical video understanding applications.