Building Prototype Evolution Pathway for Emotion Recognition in User-Generated Videos

Liu, Yujie; Dong, Zhenyang; Li, Yante; Zhao, Guoying

doi:10.3390/bdcc10030073

Open AccessArticle

Building Prototype Evolution Pathway for Emotion Recognition in User-Generated Videos

¹

Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China

²

Center for Machine Vision and Signal Analysis, University of Oulu, 90014 Oulu, Finland

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2026, 10(3), 73; https://doi.org/10.3390/bdcc10030073

Submission received: 11 January 2026 / Revised: 22 February 2026 / Accepted: 24 February 2026 / Published: 28 February 2026

(This article belongs to the Special Issue Sentiment Analysis in the Context of Big Data)

Download

Browse Figures

Versions Notes

Abstract

Large-scale pretrained foundation models are increasingly essential for affective analysis in user-generated videos. However, current approaches typically reuse generic multi-modal representations directly with task-specific adapters learned from scratch, and their performance is limited by the large affective domain gap and scarce emotion annotations. To address these issues, we introduce a novel paradigm that leverages auxiliary cross-modal priors to enhance unimodal emotion modeling, effectively exploiting modality-shared semantics and modality-specific inductive biases. Specifically, we propose a progressive prototype evolution framework that gradually transforms a neutral prototype into discriminative emotional representations through fine-grained cross-modal interactions with visual cues. The auxiliary prior serves as a structural constraint, reframing the adaptation challenge from a difficult domain shift problem into a more tractable prototype shift within the affective space. To ensure robust prototype construction and guided evolution, we further design category-aggregated prompting and bidirectional supervision mechanisms. Extensive experiments on VideoEmotion-8, Ekman-6, and MusicVideo-6 validate the superiority of our approach, achieving state-of-the-art results and demonstrating the effectiveness of leveraging auxiliary modality priors for foundation-model-based emotion recognition.

Keywords:

vision-language model; video emotion recognition; prototype evolution

1. Introduction

Video, as a multi-modal data source integrating visual streams (e.g., facial expressions and gestures), linguistic content (e.g., subtitles and captions), and acoustic information (e.g., human voice and ambient sounds), offers a variety of affective signals for understanding human psychology and social dynamics. To fully exploit this potential, a wide range of studies have focused on emotion recognition based on human reactions under controlled recording conditions. For instance, various multi-modal frameworks [1,2,3,4,5] have been developed and evaluated using professionally recorded videos that exhibit structured consistency, such as those in the CMU-MOSI [6] and IEMOCAP [7] datasets. Although these efforts have advanced multi-modal fusion techniques for emotion recognition, they primarily focus on analyzing the emotional signals displayed by the people in the video, rather than uncovering the emotional responses evoked in the viewers. In contrast, emotion recognition in user-generated videos (UGVs) aims to uncover the subconscious emotional responses elicited by diverse video stimuli—responses that are not tied to on-screen human subjects but rather to the affective impact of the video content on viewers. With the explosive growth of self-media and user-generated content, emotion analysis in UGVs has become an area of significant academic and commercial interest, ultimately enabling real-world applications such as targeted advertising and video recommendation systems [8].

Most existing methods in this field focus on addressing the affective gap. They mainly achieve this by designing a learnable module that performs sequence-to-sequence mapping, a paradigm we term the encoder-only approach. While the state-of-the-art method [9] in this paradigm introduces multi-layer self-attention mechanisms for effective temporal modeling, it suffers from quadratic complexity with respect to the sequence length, imposing significant computational and memory burdens. Furthermore, emotional expression in videos is often characterized by intense affective peaks rather than a linear progression. A standard encoder-only model, which processes all temporal elements continuously, risks diluting the impact of critical segments and introducing noise from neutral or irrelevant frames. To address this, we introduce a decoder paradigm that achieves adaptive attention to crucial emotional segments with linear complexity.

Simultaneously, training multi-layer structures from scratch is challenging due to limited training data and a significant domain discrepancy, leading to suboptimal emotional representation learning. To mitigate similar issues, previous methods have relied on auxiliary data [10,11] or more sophisticated training strategies [12]. However, no prior studies have explored the potential of prior knowledge encoded in models of other modalities. A recent study [13] reveals that transformer models can improve performance on a target modality by exploiting modality-complementary knowledge from an auxiliary modality, even when the auxiliary modality is seemingly irrelevant and without requiring extra auxiliary data. This effect is achieved by repurposing general token mapping and interaction rules across modalities using two identically structured transformers, each trained on a distinct modality. While effective, this approach is constrained by its architectural rigidity and fails to leverage modality-specific knowledge. Inspired by this finding, we propose a novel approach that eliminates the need for paired, identical models, allowing for a more flexible and explicit transfer of modality-specific prior knowledge.

To this end, we introduce a novel decoder architecture that establishes a prototype evolution pathway to progressively bridge the affective gap. As illustrated in Figure 1, unlike previous encoder-only methods, our approach utilizes a multi-layer decoder. This decoder is repurposed from a pretrained text encoder to incorporate its rich prior knowledge for affective understanding, and it works in concert with a neutral prototype. It gradually guides the prototype’s evolution toward distinct affective attributes, dynamically generating emotion-specific representations aligned with the visual content. The decoder thus leverages large-scale linguistic knowledge to ground and rationalize the entire prototype evolution pathway. This reframes the affective gap challenge into a more tractable prototype shift problem within the target domain, effectively avoiding the domain shift instability associated with cross-domain transfer. Furthermore, it enables a direct and stable comparison between the evolved prototype and the semantic anchors of predefined emotional categories, thereby facilitating robust and interpretable classification.

More specifically, we introduce CLIP to build our prototype evolution framework in an encoder–decoder manner, motivated by its inherent cross-modal alignment that naturally bridges visual and linguistic domains. However, the CLIP image encoder extracts only frame-level features, lacking temporal modeling for videos. To this end, we employ a lightweight sequence-to-sequence encoder with linear time complexity. It efficiently migrates separate image tokens into coherent video representations. Our full approach, termed Prototype Evolution CLIP (PE-CLIP), incorporates cross-attention layers to enable fine-grained cross-modal interaction. These layers allow the model to selectively attend to visual features most relevant to the emotion prototype, thereby minimizing the risk of diluting critical segments. In this process, the emotion prototype—represented as a set of tokens—undergoes progressive refinement through residual connections that integrate relevant visual cues extracted via cross-attention, effectively evolving it toward distinct affective attributes. To fully exploit the affective potential of this architecture, we design a category-aggregated prompt to enhance inter-category correlations without relying on external knowledge. We further argue that using prototypes with varying emotional tendencies for evolution can compromise the accuracy of the results, whereas a prototype with the correct emotional tendency should theoretically yield a lower loss. Based on this hypothesis, we designed bidirectional supervision to leverage this supervisory signal. Last but not least, PE-CLIP is a parameter-efficient framework that balances memory costs and high performance. We only convert the deeper layers of the CLIP text encoder into PE-CLIP decoder layers, ensuring that backpropagation traverses only the final few transformer layers rather than the entire model.

Our contributions can be summarized as follows: (1) We propose PE-CLIP, an encoder–decoder framework that progressively bridges the affective gap by evolving a neutral prototype toward emotion-specific representations with rich linguistic knowledge. (2) We design category-aggregated prompts and bidirectional supervision to enhance inter-category correlation and stabilize the prototype evolution pathway. (3) PE-CLIP is highly parameter-efficient while achieving new state-of-the-art performance in UGVs emotion recognition.

The remainder of this paper is organized as follows. Section 2 reviews related work on video emotion recognition, vision-language pretraining, and prototype-based learning. Section 3 details the proposed PE-CLIP framework. Section 4 describes the experimental setup, datasets, and experiment results. Finally, Section 5 concludes the paper and discusses potential directions for future work.

2. Related Work

2.1. Vision-Language Model

Vision-language pretraining, as a new deep learning paradigm in computer vision, aims to learn effective cross-modal knowledge between visual and textual modalities and boost comprehension in cross-modal problems. Early works attempted to use sentence prediction as the proxy task [14,15], which proved less efficient for large-scale pretraining. To tackle this problem, CLIP [16] and ALIGN [17] employ an image–text contrastive objective to simplify the training pipeline on large-scale image-text pair datasets, demonstrating substantial performance on downstream tasks. CLIP is a vision-language model trained on a large corpus of image–text pairs using a contrastive objective. It learns to align images and their corresponding textual descriptions in a shared embedding space, enabling effective cross-modal understanding. This inherent alignment allows CLIP to naturally bridge visual and linguistic domains, making it well-suited for tasks that require joint reasoning about visual content and semantic concepts. Specifically, CLIP has shown its potential in linear probing and full fine-tuning settings, serving as a foundational method for comparison with other works [18,19,20,21]. More recently, BLIP [22] and its improved version BLIP-2 [23] introduce image–text matching to remove noisy captions from training data. Vision-language models not only provide transferable visual information for downstream tasks, but their unique cross-modal alignment mechanism also facilitates a wide range of future works.

2.2. Video Emotion Recognition

Recently, deep features have shown impressive improvements in video emotion recognition [8,24]. Compared with hand-crafted features [25,26], deep features can provide complex and robust emotional cues from video and are easily extended to new training frameworks. VAANet [8] proposes a novel end-to-end framework that uses spatial, channel-wise, and temporal attention on multi-modal inputs for video emotion recognition, but its gains are largely attributable to attention reweighting within features already semantically constrained by ImageNet [27] pretraining, leaving the underlying representational limitation unaddressed. More recently, CTEN [12] takes a different tack by designing a temporal erasing module to force the model to exploit affective cues from sub-relevant frames beyond keyframes, demonstrating that contextual and background information matters for emotion recognition. However, its training objective still does not leverage the semantic content of emotion labels or the rich descriptive information embedded in natural language, leaving a representational gap that temporal manipulation alone cannot bridge. The present work addresses this limitation by reconsidering the pretraining paradigm underlying video emotion recognition. Rather than inheriting representations optimised for foreground object classification, we build upon vision-language pretraining, whose contrastive image–text objective naturally encodes both the semantic relationships between emotion categories and the affective cues distributed across background scenes.

2.3. Adapting CLIP to Videos

Recent advances in video understanding tasks (e.g., action recognition) have explored adapting CLIP to videos, primarily through a temporal aggregation module and directly mapping from frame-level features to task-specific spaces [28,29,30]. This success largely stems from the inherent alignment between action descriptions and visual semantics in CLIP. However, such direct adaptation faces significant challenges in video emotion analysis due to the affective gap. MM-VEMA [31] first applied CLIP to fit multiple scenes in user-generated videos, but it relies solely on full fine-tuning adaptation, failing to accept long video sequences due to the excessively large number of training parameters. TE [9] incorporates CLIP visual features into a multi-modal framework as a supplementary modality and enhances temporal information with a cross-modal temporal enhancement module. However, TE introduces multiple deep models to extract multi-modal features and achieves feature fusion through inter-modal interaction modules, which substantially increases both training time and computational costs. Unlike previous methods that introduce CLIP to serve as a spatial expert, our approach uniquely leverages CLIP’s textual task priors and modality alignment capabilities to establish a novel prototype evolution framework.

3. Methodology

The architecture of the proposed PE-CLIP framework is illustrated in Figure 2. The following subsections elaborate on each component in sequence.

3.1. Spatial-Temporal Encoding

Following recent practices in transfer learning for UGVs benchmarks [9], we adopt a frozen image encoder to retain its rich semantic prior knowledge. Given a preprocessed input video

V \in R^{T \times 224 \times 224 \times 3}

, where T refers to the number of frames, we first extract frame-level features using the frozen CLIP image encoder, taking its class token [CLS] to obtain the frame-level features

{z_{t}}_{t = 1}^{T}

where

z_{t} \in R^{D}

and D denotes the embedding dimension. These frame-level features then undergo temporal correlation enhancement through our proposed Multi-Head Mamba (MH-Mamba) architecture, producing refined features

{z_{t}^{'}}_{t = 1}^{T}

where

z_{t}^{'} \in R^{D}

. MH-Mamba is an enhanced variant of the state-space sequence model (SSM) [32] tailored for temporal dependency modeling. Unlike conventional SSMs that process sequences through a single hidden state transition, our MH-Mamba introduces parallel scanning with H heads, where each head

h \in {1, \dots, H}

maintains independent parameters

{A^{(h)}, B^{(h)}, C^{(h)}}

derived from

z_{t}^{(h)}

for the discretized state-space equation:

\begin{matrix} x_{t}^{(h)} & = {\bar{A}}^{(h)} x_{t - 1}^{(h)} + {\bar{B}}^{(h)} z_{t}^{(h)}, \\ y_{t}^{(h)} & = C^{(h)} x_{t}^{(h)}, \end{matrix}

(1)

where

{\bar{A}}^{(h)}, {\bar{B}}^{(h)}

are discretized parameters derived from continuous counterparts

A^{(h)}, B^{(h)}

via zero-order hold (ZOH) [32],

z_{t}^{(h)}

denotes the h-th head projection of

z_{t}

, and

x_{t}^{(h)}

is an implicit hidden state. The multi-head outputs

{y_{t}^{(h)}}_{h = 1}^{H}

are concatenated and linearly projected to form the refined temporal embedding

z_{t}^{'} \in R^{D}

:

\begin{matrix} z_{t}^{'} = {Concat}_{h = 1}^{H} (y_{t}^{(h)}) W_{O}, \end{matrix}

(2)

where

y_{t}^{(h)} \in R^{d_{h}}

is the output of the h-th head, and

W_{O} \in R^{(H d_{h}) \times D}

is a learnable projection matrix. Note that for simplicity of expression, we omit the detailed descriptions of auxiliary gated fusion branches and residual connections in this formulation. Our MH-Mamba design enables: (1) Memory efficiency via a multi-head setting; (2) Diverse temporal pattern capture through head specialization; (3) Linear complexity

O (T)

per head versus

O (T^{2})

in transformers.

These modules complete the spatial–temporal encoding task for the visual part, so we categorize them together as the encoder.

3.2. Transformer-Based Decoding

Recognizing that emotional peaks are sparse and critical, we aim to model them efficiently without processing all frames uniformly. Hence, we design a decoder that utilizes learnable prototype tokens and a randomly initialized transformer decoder. This setup constitutes the baseline for our subsequent enhancements. The interaction between prototype and visual tokens is accomplished by a multi-head cross-attention mechanism. Formally, given prototype

Q \in R^{S \times D}

(where S is the number of prototype tokens) and visual tokens

Z \in R^{T \times D}

as key and value, this interaction can be expressed as

\begin{matrix} Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V, \\ MultiHeadCrossAttention (Q, Z) = \\ {Concat}_{h = 1}^{H} (Attention (Q W_{q}^{(h)}, Z W_{k}^{(h)}, Z W_{v}^{(h)})), \end{matrix}

(3)

where

W_{q}^{(h)}, W_{k}^{(h)}, W_{v}^{(h)} \in R^{D \times d_{k}}

represent trainable projections for the h-th attention head and

d_{k} = D / H

, with H denoting the number of attention heads. The multi-head outputs are concatenated and linearly projected to the original embedding dimension. The multi-head mechanism enables simultaneous attention to different semantic subspaces, enhancing the model’s capacity to capture diverse emotional cues. Moreover, the decoder further enhances token interactions through self-attention layers and boosts the model’s representational capacity with a feedforward layer in a residual manner. Such a simple module can achieve performance comparable to the state-of-the-art method [9]. However, extending this module to deeper layers leads to performance degradation, probably due to severe overfitting. This limitation prevents the model from establishing deeper associations between emotional patterns and video semantics. In our experiments, we use 26 learnable prototype tokens, each with a dimension of 512.

3.3. Prototype-Derived Features Generation

We have introduced a baseline which maps the visual features into emotional representations in an encoder–decoder manner. However, it is difficult to train the decoder from scratch due to the lack of training data and the large affective gap. Thus we introduce CLIP’s text encoder to initialize our PE-CLIP decoder. We froze all the pretrained parameters primarily for two reasons: (1) Adopting full fine-tuning is computationally unfriendly for video tasks with long sequence inputs, and (2) limited training data can increase the risk of overfitting, especially in networks with numerous trainable parameters. Although most of the parameters are frozen, prior work [33] has mentioned the possibility of using a limited number of parameters for fine-tuning in downstream tasks. The trainable cross-attention layer in our decoder can serve as an adapter similar to ST-Adapter [33]. In this way, the tokens are forced to extract visual information that is most relevant to the text in each layer, refining the prototype into the final prototype-derived features.

In experiments, we observe that it is not necessary to apply cross-attention in all 12 layers. By modifying only a few top layers as PE-CLIP decoder layers, our model can reduce trainable parameters while achieving competitive performance. We denote the ID of starting decoder layer

L_{d}

as d and the number of modified layers as M. Thus, d equals the number of unmodified layers, and

M = 12 - d

. For example, when modification begins at layer zero (

d = 0

), all of the transformer layers are modified as PE-CLIP decoder layers. A detailed discussion of the impact of different d values can be found in the experiments. Note that we maintain the attention mask in self-attention to ensure task similarity between the PE-CLIP decoder and CLIP text encoder in each layer; hence only the [EOS] token is able to perceive all previous tokens. Therefore, we consistently use the [EOS] token to serve as the feature representation for further calculations.

3.4. Category-Aggregated Prompts Generation

To enhance inter-category correlations and semantic richness in prompts, we propose a novel category-aggregated prompt (CA prompt) for generating the classifier weights. In most traditional classification tasks [8,12], a trainable linear classifier is a simple way to yield the logits. The classifier weights can be formulated as a set of vectors

{w_{i}}_{i = 1}^{n} \in R^{D}

, where n is the number of classes. Both the classifier weights and the discriminant features must have the same dimension to compute their inner product. An existing approach [31] repurposed text features by training to serve as classifier weights, and we argue that this method realigns text–visual features in the video space, therefore it inevitably yields a transfer from the original multi-modal feature space. In contrast, we directly adopt the text features to serve as classifier weights, as prototype-derived features are inherently correlated with CLIP’s text feature space.

Moreover, we propose a novel way to build our category-aggregated prompt, which strengthens inter-class correlation through negative tagging contrast. Given all of the emotion classes

e m o_{1}, e m o_{2}, \dots, e m o_{n}

and a chosen emotion

e m o_{i}

, we build the corresponding prompt

p_{i}

as “without

e m o_{1}

; …; with

e m o_{i}

; …; without

e m o_{n}

”. Our prompting method resembles tagging videos, where the corresponding category uses “with” as the prefix, while other categories use “without”. By concatenating all these tags, we acquire the prompt

p_{i}

, and further yield corresponding classifier weights

w_{i}

through the CLIP text encoder. Typically, we leverage the [EOS] token, which undergoes the final text projection to serve as the classifier weights. Additionally, we introduce a special prompt

p_{0}

whose tags all use “without” as the prefix to serve as the prototype instead of the learnable tokens, producing prototype-derived features through the PE-CLIP decoder. This neutral prompt provides a semantically meaningful reference point in the CLIP text space, representing the absence of all emotion attributes. By replacing learnable tokens with

p_{0}

, we explicitly anchor the prototype with a human-interpretable concept, enhancing both the interpretability and stability of classification. The process of logits computation can be formulated as follows:

w_{i} = {CLIP}_{text} (p_{i}),

(4)

where

{CLIP}_{text} (\cdot)

denotes the CLIP text encoding process, using the projected output of the [EOS] token as the weight vector. Meanwhile,

w_{i}

denotes prompt-based classifier weights for class i.

s_{i} = \frac{\exp (\cos (v_{0}, w_{i}) / τ)}{\sum_{j = 1}^{n} \exp (\cos (v_{0}, w_{j}) / τ)},

(5)

where

s_{i}

indicates the final probability of the sample belonging to class i,

v_{0}

is the prototype-derived feature, and

τ

is a temperature parameter. Finally, we define the cosine similarity function

\cos (v, w)

for any two vectors

v

and

w

as follows:

\cos (v, w) = \frac{〈 v, w 〉}{{∥ v ∥}_{2} {∥ w ∥}_{2}},

(6)

where

〈 \cdot, \cdot 〉

represents the vector inner product, and

{∥ \cdot ∥}_{2}

denotes the L2 norm.

3.5. Bidirectional Supervision

In this section, we propose bidirectional supervision, a novel training strategy that jointly optimizes forward supervision and backward supervision. We maximize the similarity between prompt-based classifier weights

{w_{i}}_{i = 1}^{n}

and prototype-derived features

v_{0}

with Cross-Entropy loss:

L_{forward} = - \log \frac{\exp (\cos (v_{0}, w^{+}) / τ)}{\sum_{i = 1}^{n} \exp (\cos (v_{0}, w_{i}) / τ)},

(7)

where

w^{+}

are the positive prompt-based classifier weights corresponding to the ground-truth class. The backward supervision is similar to the forward supervision. Given a group of prompts

{p_{i}}_{i = 1}^{n}

, we introduce them as different kinds of prototypes sent to our PE-CLIP and generate a group of prototype-derived features

{v_{i}}_{i = 1}^{n}

. Meanwhile, we obtain specific prompt-based classifier weights

w^{+}

with ground truth prompt

p^{+}

. Formally, the backward supervision loss can be computed as

L_{backward} = - \log \frac{\exp (\cos (v^{+}, w^{+}) / τ)}{\sum_{i = 1}^{n} \exp (\cos (v_{i}, w^{+}) / τ)} .

(8)

The total training loss combines both supervision branches:

L = λ L_{forward} + (1 - λ) L_{backward},

(9)

where

λ

is a balancing hyperparameter.

4. Experiments

4.1. Datasets

Figure 3 shows the distribution of categories in the experimental datasets. Consistent with previous research, we evaluate our method on the Ekman-6 [34] and VideoEmotion-8 [35] datasets. The VideoEmotion-8 dataset contains 1101 videos sourced from video websites (YouTube and Flickr) with annotations of eight emotion classes, averaging 107 s in duration. In the original annotation procedure, each video was labeled by multiple human annotators based on the emotion they perceived the video would evoke in viewers. For example, the disgust category includes videos containing content (e.g., trypophobia-inducing imagery) that elicits psychological or physiological aversion in viewers. Thus, the labels directly operationalize viewer-evoked affect. For a fair comparison with existing works [8,12,31], we conduct ten runs of experiments with various 2:1 train-test splits following prior protocols, reporting the mean accuracy across all runs.

Similarly, the Ekman-6 dataset consists of 1637 YouTube and Flickr videos labeled with six basic emotions, featuring an average duration of 112 s. The construction of Ekman-6 involves a mixture of annotation sources. Some videos are labeled based on the emotion they directly evoke in viewers (e.g., landscape scenes annotated as joy), while others are labeled based on the emotions expressed by individuals appearing in the video (e.g., videos of happy children). Despite this heterogeneity, both types of labels ultimately correspond to emotions that viewers can plausibly experience—either through direct affective response to the content or through empathy with depicted individuals. Therefore, the dataset remains suitable for our viewer-evoked affect task. We follow the 1:1 train-test split configuration used in previous studies and experiment with various random seeds for averaging. The video samples collected from different users inherently ensure subject independence in user-generated video datasets.

Additionally, we employ the MusicVideo-6 dataset [36], which is composed of online-sourced music videos characterized by substantial diversity in geography, language, cultural context, and instrumentation. The available version includes 1979 music video clips, each approximately 30 s long. Following the emotion categories defined in references [37,38,39], the fine-grained emotions conveyed in these videos are categorized into six basic types: exciting, fear, neutral, relaxation, sad, and tension. These categories were originally selected to represent the emotional responses typically elicited in viewers by music videos, as validated by subjective listening tests, thereby grounding them in viewer-evoked affect. We conduct experiments under three different 2:1 training-test splits on the MusicVideo-6 dataset.

4.2. Implementation Details

For all datasets, we first extract frames at 30 fps using FFmpeg (Version 4.4). Each video is then uniformly divided into 64 segments along the temporal axis. During the training phase, we randomly sample one frame per segment, whereas during evaluation we adopt center-frame sampling within each segment to ensure reproducibility. For spatial normalization, frames are resized to a resolution of 224 × 224 while preserving aspect ratios by center-cropping during inference. To mitigate overfitting during training, we implement a light data augmentation strategy consisting of random resized cropping with scale parameters sampled from [0.64, 1] and aspect ratios constrained within [3/4, 4/3]. Notably, all frames within a given video sequence are subjected to identical spatial cropping coordinates to preserve temporal consistency.

PE-CLIP uses CLIP B/32 as the backbone to extract frame-level visual features. The temporal modeling module comprises three randomly initialized Mamba blocks. We consider an embedding dimension of 512 with 8 heads in both Transformer and Mamba. Specifically, the inner dimension, the projection dimension of

Δ

, and the state dimension in Mamba are set to 384, 16, and 4, respectively. The value of the balancing hyperparameter

λ

is set to 0.7. Additionally, the value of the temperature hyperparameter

τ

is set to 0.01, and the starting decoder layer d is set to 8.

For optimization, we utilize the Adam optimizer and train for 50 epochs with a batch size of 16 and a 2 × 10⁻⁴ learning rate. All experiments adopt single-label emotion classification following standard evaluation practices in emotion recognition, with performance quantified through Top-1 accuracy (%).

4.3. Comparison with SOTAs

4.3.1. Results on VideoEmotion-8 and Ekman-6

In Table 1, we compare PE-CLIP with other state-of-the-art approaches on the VideoEmotion-8 and Ekman-6 datasets. The superiority of our framework becomes particularly evident when compared with previous UGVs benchmarks. Our PE-CLIP establishes new state-of-the-art performance by achieving 64.5% and 66.7% on VideoEmotion-8 and Ekman-6, respectively, outperforming previous advanced non-CLIP approaches like MPEM [40] by significant margins of +6.9% and +5.4%. Notably, CLIP-based models exhibit inherent advantages over conventional visual backbones, where even a basic full fine-tuning method without temporal dependency modeling (MM-VEMA) or a simple baseline surpasses conventional visual pretrained methods on each dataset. This suggests that vision-language pretraining provides superior foundational representations for emotion recognition tasks. Note that most previous visual pretrained approaches adopt a multi-modal pipeline to capture modality-complementary features, except IFPN which proposes to use unimodal features. However, it introduces an auxiliary emotional image dataset to diminish the domain gap.

Among CLIP-based methods, our approach also shows significant improvement. While TE [9] achieves competitive results on UGVs benchmarks, our architecture demonstrates +4.7% and +1.9% improvements on VE-8 and EK-6, respectively, showing the outstanding performance of our PE-CLIP for emotion understanding in UGVs.

4.3.2. Results on MusicVideo-6

We also conduct SOTA experiments on the MusicVideo-6 dataset. The experimental results are reported in Table 2. We reproduce the most recent state-of-the-art approaches whose codes are released. We follow the best settings in previous papers and only adjust the number of segments to ensure each approach can make full use of memory during training. Although our PE-CLIP has advantages in saving training memory, we still experiment with a few-segment condition to promote a fair comparison. We again find that CLIP-based methods demonstrate significant improvement compared with other methods. Meanwhile, CLIP-based methods reduce the reliance on the number of frames as they do not require successive frames to capture 3D visual features. In contrast, 3D-CNN methods are restricted by the numerous input frames, struggling to find the balance between memory and performance. They also require 4 to 6 times longer training time than CLIP-based methods. The non-overlapping standard deviations between our method and the best competitor further corroborate the robustness of our approach. Given the consistent superiority across all datasets and all experimental runs, concerns about false positives due to multiple comparisons are effectively mitigated.

The confusion matrix in Figure 4 demonstrates the classification performance of PE-CLIP across six emotion categories. The model achieved strong overall performance. These results indicate that high-arousal emotions are effectively distinguished by the model. However, two categories presented notable challenges. The neutral emotion achieved moderate performance, with misclassifications distributed across fear, sad, and tension. This pattern suggests that neutral states may share overlapping features with low-arousal emotions. More significantly, sad demonstrated the lowest accuracy, with 7% of samples misclassified as neutral, 4% as fear, and 5% as tension. This indicates that sad emotional expressions exhibit ambiguous features overlapping with other low-valence, low-arousal states. The observed confusion patterns align with established psychological theories of emotion, where emotions sharing similar positions in the valence-arousal space exhibit higher misclassification rates. The primary confusion between sad and neutral emotions is theoretically consistent, as both represent low-arousal states with potentially overlapping features. Meanwhile, the minimal off-diagonal values between high-valence emotions (exciting) and low-valence emotions (sad, tension) demonstrate clear separation in the feature space.

Comparing the results on VideoEmotion-8/Ekman-6 and MusicVideo-6, we observe a substantial performance gap, with PE-CLIP achieving 91.5% on MusicVideo-6 versus 64.5% and 66.7% on the UGVs benchmarks. This disparity stems from dataset characteristics. MusicVideo-6 contains professionally produced music videos featuring consistent visual aesthetics within each emotion category and professional editing rhythms, whereas VideoEmotion-8 and Ekman-6 consist of noisy user-generated videos with diverse content and inconsistent filming quality. These results underscore the robustness of PE-CLIP to content variability, confirming the effectiveness of our framework for real-world emotion recognition tasks.

4.4. Ablation Study and Analysis

4.4.1. The Effects of Components

We report a few selected ablation studies in Table 3. We evaluate the effectiveness of our major extensions based on the proposed encoder–decoder framework. The component ablation study starts with our baseline, where we strip the temporal modeling module and bidirectional supervision, adopting learnable tokens instead of designed prompt embeddings. It only maintains the basic encoder–decoder framework and prompt-based classifier weights, but does not utilize pretrained weights in the 4-layer decoder, which means all parameters have to be trained except the frozen CLIP image encoder.

We first experiment with the core improvements of our method. The pretrained decoder can improve the accuracy by +1.7% on each dataset, which highlights the significance of textual priors for our approach. Then, we introduce a 3-layer Mamba to serve as a temporal modeling module, which improves the baseline by +2.0% and +2.6%, respectively. This result illustrates that MH-Mamba is effective in extracting temporal dependencies hidden in video clips. Interestingly, the type of prototype and bidirectional supervision demonstrate a complex interplay. As evidenced in the last four rows of Table 3, using a CA prompt instead of learnable tokens in the decoder decreases the top-1 accuracy by −0.1% and −0.5%. When we adopt the bidirectional supervision strategy, however, using a CA prompt overtakes learnable tokens, surpassing by +0.6% on each dataset. This phenomenon not only reflects the effectiveness of the proposed bidirectional supervision but also reveals the potential of “query engineering,” mirroring the evolution of “prompt engineering” in language models. We leave these parts for future exploration. Overall, under the optimal setting, PE-CLIP can boost the top-1 accuracy of the baseline from 61.1% and 88.3% to 64.5% and 91.5% on the two datasets, respectively.

4.4.2. Impact of Starting Decoder Layer

We analyze how different starting decoder layers

L_{d}

affect model performance and memory efficiency. The results are summarized in Figure 5. As can be seen in the figure, training parameters decrease linearly as the decoder layer starts later. Meanwhile, the performance of the model peaks at

d = 8

. Therefore, for an optimal trade-off between performance and memory efficiency, it is empirically better to use only a few of the top layers as the PE-CLIP decoder. Notably, the evolution pathway length is defined as

12 - d

, i.e., the number of decoder layers involved in the iterative prototype refinement. The consistent performance improvement observed as

12 - d

decreases from 11 to 8 (corresponding to longer evolution pathways) provides qualitative evidence for the effectiveness of the proposed prototype evolution mechanism. This trend suggests that allowing prototypes to evolve through more layers enables finer-grained iterative updates, which in turn enhances the model’s ability to capture nuanced emotional cues in user-generated videos.

4.4.3. Experiment on Various Classifier Weights

We implement this experiment on a variant of PE-CLIP to ensure architectural consistency, as different prompts would introduce prototype variables in the decoder when using PE-CLIP. Specifically, we strip the bidirectional supervision from PE-CLIP and adopt learnable tokens instead of a CA prompt to build the variant. This design isolates the impact of classifier weight generation strategies from other components. We follow the configurations described in the manuscript during the experiment.

We compare our category-aggregated prompt with three representative approaches: (1) Trainable linear classifier: The conventional approach that learns classifier weights through supervised training, where each weight vector is randomly initialized and optimized via backpropagation; (2) Emotional prompt-based classifier [31]: Generates classifier weights by fine-tuning the CLIP text encoder on category-specific text templates (e.g., “this picture evokes a feeling like {classname}”) and using the resulting embeddings; (3) CoOp prompt-based classifier [42]: Implements learnable context vectors that replace hand-crafted prompts. The method optimizes continuous prompt embeddings through gradient descent while keeping CLIP’s parameters frozen.

Table 4 demonstrates the superiority of our category-aggregated prompt across all datasets. First, the trainable linear classifier underperforms prompt-based methods, suggesting that learned weights struggle to capture semantic relationships in CLIP’s cross-modal embedding space. Second, we observe that CoOp’s learnable prompt shows improvements over the emotional prompt but remains inferior to our method, indicating that simple category name substitution lacks sufficient contextual modeling, while our explicit “with/without” contrast mechanism provides more discriminative context. The +0.3–0.6% margins over CoOp highlight the effectiveness of explicit category relationship modeling through negative tagging contrast.

4.4.4. Experiment on Temporal Modeling Methods

We compare various temporal modeling methods with the proposed MH-Mamba on both the VideoEmotion-8 and MusicVideo-6 datasets, as shown in Table 5. The results demonstrate the consistent advantage of our approach across different benchmarks.

Our evaluation begins with a 3-layer Transformer and a 3-layer Mamba module for temporal modeling. The Transformer achieves 61.5% on VideoEmotion-8 and 88.2% on MusicVideo-6, falling −1.3% and −1.8% below the non-temporal Identity setup. In contrast, Mamba surpasses the baseline by +0.3% and +0.9%, reaching 63.1% and 90.9%. This indicates that while Transformers may suffer from overfitting, Mamba proves more effective in capturing temporal dependencies.

We further examine traditional recurrent architectures including LSTM and GRU under various configurations. Both methods underperform compared to our approach, with LSTM achieving 60.6% and GRU obtaining 60.1% on VideoEmotion-8. The performance gap also appears on MusicVideo-6, where MH-Mamba achieves the highest accuracy of 90.9%, outperforming LSTM at 88.1% and GRU at 86.5%. These consistent improvements across datasets validate the effectiveness of our MH-Mamba in temporal emotion understanding.

4.4.5. Experiment on Number of Segments

We investigate the impact of different segment numbers on model performance, with comprehensive results presented in Table 6. The analysis reveals a consistent tendency across both the VideoEmotion-8 and MusicVideo-6 datasets, where performance initially improves with increasing segment number before eventually declining.

Our experiments demonstrate that 64 segments achieves optimal performance on both benchmarks, yielding 64.5% on VideoEmotion-8 and 91.5% on MusicVideo-6. When using fewer segments, the model captures insufficient temporal information, resulting in suboptimal performance of 61.3% and 89.9% with 16 segments on the respective datasets. Notably, extending to 128 segments leads to performance degradation, with accuracy dropping to 63.1% on VideoEmotion-8 and 90.8% on MusicVideo-6. This phenomenon suggests that excessive segmentation may introduce redundant information or noise that hinders effective temporal modeling.

4.5. Prompt Sensitivity Analysis

To examine whether the CA prompt is robust to linguistic variations and does not rely on superficial prompt artifacts, we conduct a series of controlled experiments on the MusicVideo-6 dataset. All results, obtained by sampling 16 frames per video, are averaged over three data splits. These experiments demonstrate that the model performance is reasonably stable across various prompt changes. As shown in Table 7, the performance variations caused by different prompt modifications are generally small; the largest drop is 0.6% when all three modifications are combined, while changing the separator even slightly improves accuracy. All variants maintain competitive performance. We therefore conclude that the proposed prompt design does not suffer from severe prompt artifacts.

4.5.1. Prefix Word

We replace the prefix word “without” with “no” and the prefix word “with” with “only” in the prompts. The original prompts containing “with” and “without” achieve the highest performance, while the alternative wordings cause a slight degradation. The relatively small performance gap suggests that the model is not overly sensitive to exact wording, although “without” appears to be a suitable choice.

4.5.2. Separator

We vary the delimiter used to concatenate category tags within prompts: (a) semicolon + space (original), and (b) comma + space. The separator has negligible impact, indicating that the decoder primarily relies on the semantic content rather than the specific punctuation.

4.5.3. Category Order

We reverse the order of all emotion tags while keeping the prefix and separator unchanged. The accuracy drops by 0.5%, which is slightly larger than the variation caused by prefix word changes but still within a small range. This suggests that the CA prompt functions reasonably well across different orders.

4.5.4. Length of Prototype

We also investigate the impact of the number of prototype tokens on model performance on the VideoEmotion-8 dataset. In this experiment, we sample 64 frames per video, remove the bidirectional supervision, and adopt learnable prototypes, which provides a suitable setting for studying the effect of prototype token length. As shown in Table 8, we evaluate prototypes with varying lengths: 2, 14, 26, 38, and 50 tokens. The results indicate that using only 2 tokens yields substantially lower performance, while increasing to 14 tokens achieves the best results. Further increasing the number of tokens to 26, 38, or 50 leads to a slight performance drop but remains stable around 63.1% accuracy. This suggests that the model is not highly sensitive to the exact number of prototype tokens as long as a sufficient length is provided.

4.6. Visualization

Figure 6 compares the t-SNE visualizations of feature representations between our PE-CLIP and the baseline model. Our method demonstrates significantly tighter and more distinct cluster formations in the affective feature space, as quantitatively validated by silhouette coefficients of 0.093 for PE-CLIP versus 0.045 for our baseline model. This improvement confirms the superior discriminative capability brought by our prototype evolution design, which better preserves the inherent affective structure and separability across emotional categories.

4.7. Memory Efficiency

Although simpler encoder-only approaches [8] can match the computational efficiency of PE-CLIP, they are weaker in performance. Among methods with competitive performance, PE-CLIP demonstrates lower computational requirements. TE [9], which ranks second in performance, introduces CLIP’s visual features as a third modality. Its multi-layer cross-modal temporal enhancement module, trained to enhance inter-modal correlation, requires substantially greater computational resources than any other approach. In contrast, our method achieves state-of-the-art performance while maintaining remarkable parameter efficiency, establishing an optimal balance between computational requirements and model effectiveness.

Although requiring more computational cost, PE-CLIP retains its advantage in memory efficiency. As illustrated in Table 9, MM-VEMA and ST-Adapter require approximately

2.7 \times

greater memory capacity when processing the same number of input frames. This efficiency advantage persists even when PE-CLIP handles longer sequences. In contrast, our PE-CLIP adopts a parameter-efficient architecture similar to ST-Adapter [33], preserving the pretrained backbone while only training a part of the modules with minimal additional parameters. This combination of parameter efficiency and operational effectiveness makes our method ideal for practical video understanding applications.

5. Conclusions

In this paper, we present PE-CLIP, a parameter-efficient framework for emotion recognition in UGVs that effectively bridges the affective gap through prototype evolution. By fully leveraging CLIP’s vision-language alignment and textual priors, our method achieves state-of-the-art performance on UGVs benchmarks while maintaining computational efficiency. The proposed prototype evolution pathway and category-aggregated prompts with bidirectional supervision enable robust affective modeling, overcoming limitations of the conventional encoder-only paradigm. Extensive experiments demonstrate significant improvements over existing methods. Beyond video emotion recognition, the proposed prototype evolution paradigm holds promise for a broader range of vision-language tasks that require fine-grained semantic understanding and cross-modal alignment. The core idea of leveraging an auxiliary modality prior to guide the gradual refinement of task-specific representations is inherently modular and can be adapted to other domains such as fine-grained video understanding. Future work could explore extending this framework to other foundation models (e.g., BLIP) and to tasks involving more complex multi-modal interactions, such as video question answering or multi-modal retrieval.

Author Contributions

Conceptualization, Y.L. (Yujie Liu) and Z.D.; methodology, Y.L. (Yante Li); software, Z.D.; validation, Y.L. (Yujie Liu) and Z.D.; formal analysis, Y.L. (Yujie Liu); investigation, Y.L. (Yujie Liu); resources, Z.D.; data curation, Z.D.; writing—original draft preparation, Z.D.; writing—review and editing, Y.L. (Yujie Liu), Y.L. (Yante Li) and G.Z.; visualization, Z.D.; supervision, Y.L. (Yujie Liu); project administration, Y.L. (Yante Li); funding acquisition, G.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The work was partly supported by the Research Council of Finland (grants 336116, 359894, 352788, 364905) and EU HORIZON-MSCA-SE-2022 project ACMod (grant 101130271).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 1103–1114. [Google Scholar]
Rahman, W.; Hasan, M.K.; Lee, S.; Zadeh, A.; Mao, C.; Morency, L.P.; Hoque, E. Integrating multimodal information in large pretrained transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; Volume 2020, pp. 2359–2369. [Google Scholar]
Caschera, M.C.; Grifoni, P.; Ferri, F. Emotion Classification from Speech and Text in Videos Using a Multimodal Approach. Multimodal Technol. Interact. 2022, 6, 28. [Google Scholar] [CrossRef]
Mai, S.; Zeng, Y.; Zheng, S.; Hu, H. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Trans. Affect. Comput. 2022, 14, 2276–2289. [Google Scholar] [CrossRef]
Fu, Z.; Liu, F.; Xu, Q.; Fu, X.; Qi, J. LMR-CBT: Learning modality-fused representations with CB-Transformer for multimodal emotion recognition from unaligned multimodal sequences. Front. Comput. Sci. 2024, 18, 184314. [Google Scholar] [CrossRef]
Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.P. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intell. Syst. 2016, 31, 82–88. [Google Scholar] [CrossRef]
Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
Zhao, S.; Ma, Y.; Gu, Y.; Yang, J.; Xing, T.; Xu, P.; Hu, R.; Chai, H.; Keutzer, K. An end-to-end visual-audio attention network for emotion recognition in user-generated videos. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Palo Alto, CA, USA, 2020; Volume 34, pp. 303–311. [Google Scholar]
Li, X.; Wang, S.; Huang, X. Temporal enhancement for video affective content analysis. In Proceedings of the 32nd ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2024; pp. 642–650. [Google Scholar]
Pan, J.; Wang, S.; Fang, L. Representation learning through multimodal attention and time-sync comments for affective video content analysis. In Proceedings of the 30th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2022; pp. 42–50. [Google Scholar]
Zhang, H.; Xu, M. Recognition of emotions in user-generated videos through frame-level adaptation and emotion intensity learning. IEEE Trans. Multimed. 2023, 25, 881–891. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, L.; Yang, J. Weakly supervised video emotion detection and prediction via cross-modal temporal erasing network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 18888–18897. [Google Scholar]
Zhang, Y.; Ding, X.; Gong, K.; Ge, Y.; Shan, Y.; Yue, X. Multimodal pathway: Improve transformers with irrelevant data from other modalities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 6108–6117. [Google Scholar]
Mahajan, D.; Girshick, R.; Ramanathan, V.; He, K.; Paluri, M.; Li, Y.; Bharambe, A.; van der Maaten, L. Exploring the Limits of Weakly Supervised Pretraining. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2018. [Google Scholar]
Xie, Q.; Luong, M.T.; Hovy, E.; Le, Q.V. Self-Training With Noisy Student Improves ImageNet Classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2020. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning; PMLR: New York, NY, USA, 2021; pp. 8748–8763. [Google Scholar]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning; PMLR: New York, NY, USA, 2021; pp. 4904–4916. [Google Scholar]
Zhang, R.; Zhang, W.; Fang, R.; Gao, P.; Li, K.; Dai, J.; Qiao, Y.; Li, H. Tip-adapter: Training-free adaption of clip for few-shot classification. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 493–510. [Google Scholar]
Yu, T.; Lu, Z.; Jin, X.; Chen, Z.; Wang, X. Task residual for tuning vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 10899–10909. [Google Scholar]
Udandarao, V.; Gupta, A.; Albanie, S. Sus-x: Training-free name-only transfer of vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2023; pp. 2725–2736. [Google Scholar]
Zhou, J.; Dong, L.; Gan, Z.; Wang, L.; Wei, F. Non-contrastive learning meets language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 11028–11038. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Machine Learning, ICML; PMLR: New York, NY, USA, 2022; pp. 12888–12900. [Google Scholar]
Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning; PMLR: New York, NY, USA, 2023; pp. 19730–19742. [Google Scholar]
Poria, S.; Cambria, E.; Bajpai, R.; Hussain, A. A review of affective computing: From unimodal analysis to multimodal fusion. Inf. Fusion 2017, 37, 98–125. [Google Scholar] [CrossRef]
Chen, C.; Wu, Z.; Jiang, Y.G. Emotion in Context: Deep Semantic Feature Fusion for Video Emotion Recognition. In MM ’16: Proceedings of the 24th ACM International Conference on Multimedia; ACM: New York, NY, USA, 2016; pp. 127–131. [Google Scholar]
Zhang, H.; Xu, M. Recognition of emotions in user-generated videos with kernelized features. IEEE Trans. Multimed. 2018, 20, 2824–2835. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2009; pp. 248–255. [Google Scholar]
Ju, C.; Han, T.; Zheng, K.; Zhang, Y.; Xie, W. Prompting visual-language models for efficient video understanding. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 105–124. [Google Scholar]
Kahatapitiya, K.; Arnab, A.; Nagrani, A.; Ryoo, M.S. Victr: Video-conditioned text representations for activity recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 18547–18558. [Google Scholar]
Yu, Y.; Cao, C.; Zhang, Y.; Lv, Q.; Min, L.; Zhang, Y. Building a multi-modal spatiotemporal expert for zero-shot action recognition with clip. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Palo Alto, CA, USA, 2025; Volume 39, pp. 9689–9697. [Google Scholar]
Pu, H.; Sun, Y.; Song, R.; Chen, X.; Jiang, H.; Liu, Y.; Cao, Z. Going Beyond Closed Sets: A Multimodal Perspective for Video Emotion Analysis. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV); Springer: Cham, Switzerland, 2023; pp. 233–244. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 11 April–10 May 2024. [Google Scholar]
Pan, J.; Lin, Z.; Zhu, X.; Shao, J.; Li, H. St-adapter: Parameter-efficient image-to-video transfer learning. Adv. Neural Inf. Process. Syst. 2022, 35, 26462–26477. [Google Scholar]
Xu, B.; Fu, Y.; Jiang, Y.G.; Li, B.; Sigal, L. Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization. IEEE Trans. Affect. Comput. 2016, 9, 255–270. [Google Scholar] [CrossRef]
Jiang, Y.G.; Xu, B.; Xue, X. Predicting Emotions in User-Generated Videos. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Palo Alto, CA, USA, 2014. [Google Scholar]
Pandeya, Y.R.; Lee, J. Deep learning-based late fusion of multimodal information for emotion classification of music video. Multimed. Tools Appl. 2021, 80, 2887–2905. [Google Scholar] [CrossRef]
Sun, K.; Yu, J.; Huang, Y.; Hu, X. An improved valence-arousal emotion space for video affective content representation and recognition. In Proceedings of the 2009 IEEE International Conference on Multimedia and Expo; IEEE: New York, NY, USA, 2009; pp. 566–569. [Google Scholar]
Ouyang, X.; Kawaai, S.; Goh, E.G.H.; Shen, S.; Ding, W.; Ming, H.; Huang, D.Y. Audio-visual emotion recognition using deep transfer learning and multiple temporal models. In Proceedings of the 19th ACM International Conference on Multimodal Interaction; ACM: New York, NY, USA, 2017; pp. 577–582. [Google Scholar]
Zhang, L.; Zhang, J. Synchronous prediction of arousal and valence using LSTM network for affective video content analysis. In Proceedings of the 2017 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD); IEEE: New York, NY, USA, 2017; pp. 727–732. [Google Scholar]
Zhang, L.; Sun, Y.; Guan, J.; Kang, S.; Huang, J.; Zhong, X. Multi-Scale Parallel Enhancement Module with Cross-Hierarchy Interaction for Video Emotion Recognition. Electronics 2025, 14, 1886. [Google Scholar] [CrossRef]
Yi, Y.; Zhou, J.; Wang, H.; Tang, P.; Wang, M. Emotion recognition in user-generated videos with long-range correlation-aware network. IET Image Process. 2024, 18, 3288–3301. [Google Scholar] [CrossRef]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]

Figure 1. Previous encoder-only paradigm (left) vs. our proposed PE-CLIP (right). We introduce a novel framework that evolves a neutral emotion prototype on the decoder side to gradually approximate the target emotional state. This innovation transforms the traditional encoder-only paradigm (left), which struggles to bridge the affective gap by directly mapping features from visual space to emotion space, into a new, more tractable paradigm (right) by establishing projections entirely within the emotion space.

Figure 2. Overall framework of our PE-CLIP. The main components of the framework are (1) Spatial-temporal Encoding: Frame-level visual features are extracted via a frozen CLIP image encoder, which are subsequently processed through stacked multi-head Mamba blocks for temporal modeling. (2) Transformer-based Decoding: Introducing a neutral prototype to extract emotion-relevant visual clues through cross-attention. (3) Prototype-derived Features Generation: The process of generating prototype-derived features in the PE-CLIP decoder. (4) Category-aggregated Prompts Generation: Category-aggregated prompts are generated via “with/without” tagging, boosting inter-class correlation modeling. (5) Bidirectional Supervision: Proposed bidirectional supervision (

L_{forward}

and

L_{backward}

) aligns features through dual learning paths optimization. Since the overall process of (3) contains numerous steps in the diagram, we have summarized this forward computation process using dashed arrows.

Figure 2. Overall framework of our PE-CLIP. The main components of the framework are (1) Spatial-temporal Encoding: Frame-level visual features are extracted via a frozen CLIP image encoder, which are subsequently processed through stacked multi-head Mamba blocks for temporal modeling. (2) Transformer-based Decoding: Introducing a neutral prototype to extract emotion-relevant visual clues through cross-attention. (3) Prototype-derived Features Generation: The process of generating prototype-derived features in the PE-CLIP decoder. (4) Category-aggregated Prompts Generation: Category-aggregated prompts are generated via “with/without” tagging, boosting inter-class correlation modeling. (5) Bidirectional Supervision: Proposed bidirectional supervision (

L_{forward}

and

L_{backward}

) aligns features through dual learning paths optimization. Since the overall process of (3) contains numerous steps in the diagram, we have summarized this forward computation process using dashed arrows.

Figure 3. Category distributions in the experimental datasets.

Figure 4. Confusion matrix of the experiment on MusicVideo-6.

Figure 5. Model performance and training parameters under different d values. We experiment on the VideoEmotion-8 dataset with

d \in {0, 2, 4, 6, 7, 8, 9, 10, 11}

.

Figure 5. Model performance and training parameters under different d values. We experiment on the VideoEmotion-8 dataset with

d \in {0, 2, 4, 6, 7, 8, 9, 10, 11}

.

Figure 6. Visualization of output embeddings of PE-CLIP (left) and baseline model (right).

Table 1. Comparison with state-of-the-art methods (Part 1). Top-1 accuracy (%) is reported on the VideoEmotion-8 (VE-8) and Ekman-6 (EK-6) datasets. V and A refer to the visual modality and audio modality, respectively. Auxiliary indicates if other datasets are used for training.

Method	Published	Modal	Auxiliary	VE-8	EK-6
Methods w/o CLIP
ITE [34]	TAC’16	V + A	✓	52.6	55.6
VAANet [8]	AAAI’20	V + A		54.5	55.3
TAM [10]	ACM MM’22	V + A	✓	57.5	61.0
CTEN [12]	CVPR’23	V + A		57.3	58.2
IFPN [11]	TMM’23	V	✓	57.6	60.4
LRCANet [41]	IET IP’24	V + A		57.4	59.8
MPEM [40]	Electronics’25	V + A		57.6	61.3
Methods w/ CLIP
ST-Adapter * [33]	NeurIPS’22	V		55.8	57.7
MM-VEMA [31]	PRCV’23	V		59.4	62.8
TE [9]	ACM MM’24	V + A		59.8	64.8
Baseline	Ours	V		61.1	62.4
PE-CLIP	Ours	V		64.5	66.7

* Reproduced method from action recognition. The results of other compared methods are sourced from previous studies.

Table 2. Comparison with state-of-the-art methods (Part 2). Top-1 accuracy (%) is reported on the MusicVideo-6 dataset. V and A refer to the visual modality and audio modality, respectively. The frame sampling strategy (#Frames) is denoted as

M \times N

, indicating division of the input video into M uniform segments with N successive frames sampled per segment. Reporting top-1 accuracy and macro-F1 score as mean ± std across the three test set splits.

Table 2. Comparison with state-of-the-art methods (Part 2). Top-1 accuracy (%) is reported on the MusicVideo-6 dataset. V and A refer to the visual modality and audio modality, respectively. The frame sampling strategy (#Frames) is denoted as

M \times N

, indicating division of the input video into M uniform segments with N successive frames sampled per segment. Reporting top-1 accuracy and macro-F1 score as mean ± std across the three test set splits.

Method	Published	Modal	#Frames	Top-1	Macro-F1
Methods w/o CLIP
VAANet	AAAI’20	V + A	$8 \times 16$	74.4 ± 1.2	73.5 ± 1.2
CTEN	CVPR’23	V + A	$8 \times 16$	80.8 ± 1.7	79.7 ± 1.6
Methods w/ CLIP
ST-Adapter	NeurIPS’22	V	$16 \times 1$	80.2 ± 0.9	79.6 ± 1.0
MM-VEMA	PRCV’23	V	$16 \times 1$	88.8 ± 0.2	88.4 ± 0.3
Baseline	Ours	V	$64 \times 1$	88.3 ± 0.5	87.9 ± 0.5
PE-CLIP	Ours	V	$16 \times 1$	89.9 ± 0.2	89.4 ± 0.3
PE-CLIP	Ours	V	$64 \times 1$	91.5 ± 0.3	90.9 ± 0.3

The results of compared methods are reproduced from official code.

Table 3. Component ablation study on VideoEmotion-8 and MusicVideo-6 datasets. We set the number of unmodified layers

d = 8

and add different components to see their impacts. CA prompt represents the category-aggregated prompt. Reporting top-1 accuracy as mean ± std across all test set splits.

Table 3. Component ablation study on VideoEmotion-8 and MusicVideo-6 datasets. We set the number of unmodified layers

d = 8

and add different components to see their impacts. CA prompt represents the category-aggregated prompt. Reporting top-1 accuracy as mean ± std across all test set splits.

Pretrained	Temporal	CA	Bidirectional	Top-1 Acc(%)
Decoder	Modeling	Prompt	Supervision	VE-8	MV-6
				61.1 ± 0.7	88.3 ± 0.5
✓				62.8 ± 0.8	90.0 ± 0.3
✓	✓			63.1 ± 0.7	90.9 ± 0.4
✓	✓		✓	63.9 ± 1.0	90.9 ± 0.3
✓	✓	✓		63.0 ± 0.8	90.4 ± 0.4
✓	✓	✓	✓	64.5 ± 0.8	91.5 ± 0.3

Table 4. Top-1 accuracy (%) comparison on various classifier weights. Reporting top-1 accuracy as mean ± std across all test set splits.

Method	VideoEmotion-8	Ekman-6
Trainable Linear	59.8 ± 1.4	62.7 ± 1.2
Emotional	61.2 ± 1.1	63.9 ± 0.8
CoOp	62.5 ± 0.6	65.0 ± 0.7
CA (Ours)	63.1 ± 0.7	65.3 ± 0.5

Table 5. Comparison of various temporal modeling modules. Reporting top-1 accuracy as mean ± std across all test set splits.

Method	VideoEmotion-8	MusicVideo-6
Identity	62.8 ± 0.8	90.0 ± 0.3
LSTM	60.6 ± 0.8	88.1 ± 0.5
GRU	60.1 ± 0.9	86.5 ± 1.2
Transformer	61.5 ± 1.0	88.2 ± 0.9
MH-Mamba	63.1 ± 0.7	90.9 ± 0.4

Table 6. Comparison of various numbers of segments. Reporting top-1 accuracy as mean ± std across all test set splits.

Dataset	Number of Segments
Dataset	16	32	64	128
VideoEmotion-8	61.3 ± 0.9	62.6 ± 0.8	64.5 ± 0.8	63.1 ± 0.6
MusicVideo-6	89.9 ± 0.2	90.3 ± 0.6	91.5 ± 0.3	90.8 ± 0.4

Table 7. Experiment on prompt variations. Reporting top-1 accuracy as mean ± std across all test set splits. ✓ indicates that the corresponding part of the prompt has been changed.

Prefix Word	Separator	Category Order	Top-1 Acc	Macro-F1
			89.9 ± 0.2	89.4 ± 0.3
✓			89.6 ± 0.3	89.1 ± 0.3
	✓		91.0 ± 0.1	89.6 ± 0.2
		✓	89.4 ± 0.2	88.9 ± 0.3
✓	✓	✓	89.3 ± 0.2	88.8 ± 0.4

Table 8. Comparison of various numbers of prototype tokens. Reporting top-1 accuracy as mean ± std across all test set splits.

Metric	Number of Tokens
Metric	2	14	26	38	50
Top-1 Acc	62.0 ± 0.8	63.4 ± 0.7	63.1 ± 0.7	62.9 ± 0.9	63.1 ± 0.5
Macro-F1	59.1 ± 0.8	60.3 ± 0.7	60.0 ± 0.8	59.8 ± 0.9	60.1 ± 0.6

Table 9. Model efficiency comparison (parameters and memory usage). The number of parameters, memory usage, and computational cost are evaluated in MB, GB, and GFLOPs, respectively. The frame sampling strategy (#Frames) is denoted as M × N, indicating division of the input video into M uniform segments with N successive frames sampled per segment.

Method	Trainable	#Frames	Memory	GFLOPs
Method	Parameters	#Frames	Usage	GFLOPs
VAANet [8]	48.2	$8 \times 16$	6.9	120
CTEN [12]	202.9	$8 \times 16$	12.5	360
MM-VEMA [31]	577.1	$16 \times 1$	16.4	87
ST-Adapter [33]	27.1	$16 \times 1$	16.7	72
PE-CLIP (Ours)	31.3	$16 \times 1$	6.0	108
PE-CLIP (Ours)	31.3	$64 \times 1$	9.3	317

The results of compared methods are reproduced from official code.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, Y.; Dong, Z.; Li, Y.; Zhao, G. Building Prototype Evolution Pathway for Emotion Recognition in User-Generated Videos. Big Data Cogn. Comput. 2026, 10, 73. https://doi.org/10.3390/bdcc10030073

AMA Style

Liu Y, Dong Z, Li Y, Zhao G. Building Prototype Evolution Pathway for Emotion Recognition in User-Generated Videos. Big Data and Cognitive Computing. 2026; 10(3):73. https://doi.org/10.3390/bdcc10030073

Chicago/Turabian Style

Liu, Yujie, Zhenyang Dong, Yante Li, and Guoying Zhao. 2026. "Building Prototype Evolution Pathway for Emotion Recognition in User-Generated Videos" Big Data and Cognitive Computing 10, no. 3: 73. https://doi.org/10.3390/bdcc10030073

APA Style

Liu, Y., Dong, Z., Li, Y., & Zhao, G. (2026). Building Prototype Evolution Pathway for Emotion Recognition in User-Generated Videos. Big Data and Cognitive Computing, 10(3), 73. https://doi.org/10.3390/bdcc10030073

Article Menu

Building Prototype Evolution Pathway for Emotion Recognition in User-Generated Videos

Abstract

1. Introduction

2. Related Work

2.1. Vision-Language Model

2.2. Video Emotion Recognition

2.3. Adapting CLIP to Videos

3. Methodology

3.1. Spatial-Temporal Encoding

3.2. Transformer-Based Decoding

3.3. Prototype-Derived Features Generation

3.4. Category-Aggregated Prompts Generation

3.5. Bidirectional Supervision

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Comparison with SOTAs

4.3.1. Results on VideoEmotion-8 and Ekman-6

4.3.2. Results on MusicVideo-6

4.4. Ablation Study and Analysis

4.4.1. The Effects of Components

4.4.2. Impact of Starting Decoder Layer

4.4.3. Experiment on Various Classifier Weights

4.4.4. Experiment on Temporal Modeling Methods

4.4.5. Experiment on Number of Segments

4.5. Prompt Sensitivity Analysis

4.5.1. Prefix Word

4.5.2. Separator

4.5.3. Category Order

4.5.4. Length of Prototype

4.6. Visualization

4.7. Memory Efficiency

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI