1. Introduction
Emotional cues and facial expressions play a fundamental role in human communication, as they modulate the dynamics of social interaction and directly influence individual behaviour [
1]. The ability to perceive and interpret these affective signals is therefore essential for facilitating natural exchanges. Importantly, human emotions are not communicated solely through facial expressions; they can also be inferred from vocal characteristics, prosody, body movements, gestures, and even the semantic content of speech [
2], underscoring the inherently multimodal nature of affective communication.
In contrast to humans, who possess an innate capacity for affect recognition, machines do not naturally exhibit this ability, although they can be trained to do so [
3,
4]. Equipping computational systems with emotion-recognition capabilities can significantly enhance the naturalness and fluidity of human–machine interactions (HMIs), improving user experience [
5]. This is one of the motivations underlying the development of different expression-recognition tasks.
Depending on the modalities employed, this problem takes different forms: Facial Expression Recognition (FER) when only visual cues are used, Speech Emotion Recognition (SER) when relying on audio signals, Text Emotion Recognition (TER) when analysing textual content, and Multimodal Emotion Recognition (MER) when multiple modalities are considered jointly [
2]. Most existing studies focus on the six basic emotions, anger, happiness, disgust, fear, surprise, and sadness, introduced by Ekman in [
3], typically supplemented with the neutral state [
6,
7]. However, such formulation often fails to capture the subtleties and complexities of real-world affective displays [
8].
Such a limitation often arises from the datasets commonly used in these tasks, which frequently annotate only the previously mentioned basic categories, such as MELD [
9], AffectNet [
10], and RAF-DB [
11]. Other datasets include a few additional categories (e.g.,
calm in RAVDESS [
12], or
excitement and
frustration in IEMOCAP [
13]), yet these do not fully address the compositional nature of human emotions. Even when intensity annotations are provided, they are often discarded in favour of a standard multi-class classification formulation.
In this context, the BlEmoRe dataset [
14] was introduced to encourage the development of models capable of recognising not only discrete emotions but also blended emotional states and their relative salience. This work presents a novel attention-based architecture for audio–video integration, along with a comprehensive analysis of the performance of the feature encoders provided with the BlEmoRe dataset.
This paper is structured as follows.
Section 2 reviews relevant literature on feature representations for audio and visual modalities, multimodal fusion strategies, and common limitations of existing affective computing datasets.
Section 3 outlines the characteristics of the BlEmoRe dataset and the specific considerations relevant to this work.
Section 4 details the proposed methodology, including the temporal pooling mechanisms, multimodal fusion architectures, loss functions, evaluation metrics, and experimental setup. Finally, the results obtained are presented and analyzed in
Section 5 while in,
Section 6 summarizes the main findings and discusses potential directions for future research.
The main contributions of this work are as follows:
A novel multimodal fusion architecture that surpasses the baseline introduced in the original BlEmoRe paper for both presence and salience prediction in blended emotion recognition.
A comprehensive analysis of the feature encoders provided in the BlEmoRe dataset, including an evaluation of their unimodal performance with different temporal pooling strategies.
The design and application of a combined loss function tailored for salience-aware blended emotion recognition.
3. Dataset
The BlEmoRe dataset [
14] is an audiovisual dataset consisting of video recordings in which professional actors are instructed to portray either single or blended emotions through facial expressions, body movements, and non-linguistic sounds. The database comprises 58 actors and a total of 3050 video samples. Among these, 1390 clips depict one of six discrete emotional states:
anger,
disgust,
fear,
happiness,
sadness, or
neutral. The remaining 1660 clips represent pairwise blends between two non-neutral emotions.
For blended samples, the dataset provides additional annotations specifying the relative salience, or proportions, of each constituent emotion. Three salience configurations are defined: (i) 50/50, indicating equal prominence of both emotions; (ii) 70/30, where the first emotion in the label is more dominant than the second; and (iii) 30/70, where the second emotion is more dominant. It is important to emphasize that actors were instructed to express blended emotions simultaneously rather than sequentially, such that the entire clip conveys a stable emotional mixture instead of a temporal transition between emotions.
The dataset follows a subject-independent train/test split, comprising 43 actors for training and 15 actors for testing, although the test split is currently private. The training split is divided into five folders, each containing approximately 500 videos, with no overlapping identities between folders and a balanced distribution of emotions. Of these five folders, one is used for validation, while the remaining four are used for training. In
Figure 1, several representative frames from the dataset are presented. The videos were recorded at 50 frames per second with a resolution of 1920 × 1080 pixels, and their duration ranges from 1 to 30 s, although the majority fall between 2 and 10 s. Finally, the dataset creators also provide pre-extracted features obtained from a set of visual, audio, and audiovisual encoders as baseline.
4. Methodology
In this section, the data preprocessing is first described in detail. In addition, the unimodal and multimodal architectures proposed are introduced. Finally, the implemented loss function, the evaluation metrics, and the experimental setup are presented to ensure full reproducibility.
4.1. Data Preprocessing
This work focuses on multimodal fusion strategies and model architectures. Therefore, only the pre-extracted features were used rather than training feature encoders end-to-end. The selection of encoders was based on the unimodal benchmark results reported in [
14]. Specifically, the two best-performing encoders for each modality in the unimodal evaluation and a single multimodal encoder were selected.
For the visual modality, the selected encoders were ImageBind [
35] and VideoMAE V2 [
36]. The former, ImageBind, is a model developed by Meta that learns a joint embedding space across six different modalities, however, in this work, only the image encoder was utilized. VideoMAE V2 is a video-based encoder designed to operate directly on the temporal dimension by processing spatio-temporal 3D patches. Specifically, the ViT-B/16 variant was employed, which produces one feature vector every 16 frames, thereby reducing temporal resolution while preserving coarse temporal information.
For the audio modality, WavLM [
27] and HuBERT [
26] were used. WavLM is a self-supervised learning model developed by Microsoft and trained on more than 90,000 h of audio for speech representation learning. HuBERT, developed by Meta, is trained using a masked prediction objective over k-means cluster assignments derived from the input audio for the same task.
In addition, HiCMAE [
43], a multimodal encoder designed for audiovisual emotion recognition, was also considered. This model employs an asymmetric encoder-decoder architecture in which audio and visual encoders are jointly learned and shared across modality-specific decoders. It was included due to its superior performance in salience prediction as reported in the original benchmark.
Consequently, the experimental setup considers five encoders in total: two visual encoders (ImageBind and VideoMAE V2), two audio encoders (WavLM and HuBERT), and one multimodal encoder (HiCMAE). The five selected encoders represent well-established models that have achieved state-of-the-art performance across multiple affective computing benchmarks. As reported in [
28], these encoders consistently rank among the top performers in unimodal evaluations across diverse databases. Consequently, selecting the models ensures that the multimodal fusion process is built upon high-quality and reliable feature representations, rather than weaker or unstable embeddings that could hinder downstream performance.
Since the original samples consist of videos and not all encoders are designed to handle temporal information, a preprocessing step is required. For each video, a sequence of feature vectors was extracted each one with dimensionality . Because the videos vary in duration, the resulting value of differs across samples. To address this issue, a temporal pooling procedure was applied.
Let denote a fixed target sequence length. Three cases were considered. When , the sequence was uniformly divided into segments of size , from which a random frame was selected as a single representative frame. Conversely, if , zero-padding was applied to extend the sequence until the target length was reached. Finally, if matches the original sequence length , no temporal preprocessing is applied to the input. Zero-padding was chosen over alternatives such as extrapolation because it preserves the original temporal structure without introducing synthetic data, ensuring that the architecture’s attention mechanisms focus exclusively on the real dynamics of the sequence.
This standardization step was particularly important because depends not only on the video duration but also on the encoder used. Among the five encoders utilized, ImageBind’s output frequency is frame-based (one vector per frame), while WavLM and HuBERT extract features at a temporal resolution of 20 ms. In contrast, VideoMAE V2 and HiCMAE are inherently designed to process video clips and therefore generate significantly fewer feature vectors, typically around 16 and 4 per video, respectively, which is more than an order of magnitude fewer than those produced by the other encoders.
4.2. Architecture
Two distinct architectures are proposed in this work: one designed to extract temporal information from each sample, and another dedicated to multimodal fusion.
4.2.1. Unimodal Branch
The first architecture is illustrated in
Figure 2. This module consists of two components: a temporal pooling module and a final multilayer perceptron (MLP) that projects the encoder-specific feature dimension
to a fixed latent dimension
.
Several temporal pooling strategies were explored. As a baseline, the approach proposed in the original work [
14] was adopted. This method performs a
Statistical Aggregation of all feature vectors by computing descriptive statistics for each feature dimension. Specifically, five percentiles (10th, 25th, 50th, 75th, and 90th) are concatenated with the mean and standard deviation. This results in a single vector representation with a dimensionality of seven times
.
In addition to this baseline, two temporal modeling approaches were employed: a Bidirectional LSTM (Bi-LSTM) and an attention-based architecture. The LSTM approach employed a Bi-LSTM module to model the temporal sequence of each sample.
The latter approach, which is illustrated in
Figure 2, consists of two stages. First, temporal dependencies are modeled using self-attention through Transformer encoder layers. Second, the temporal sequence is condensed into a single representation vector. Two alternative strategies were explored for this aggregation. The first strategy is inspired by the method used in BERT [
44] and employs a [CLS] token. In this case, a classification token of dimensionality
is prepended to the input sequence of length
. After the Transformer encoder processes the sequence, the final representation corresponding to the [CLS] token is used as a global representation of the entire sequence. The second strategy, referred to as
Attention Pooling, computes temporal weights for each position in the sequence. These weights are obtained by passing the sequence through a linear projection layer of dimension
. The resulting weights are then used to compute the final sample representation via a weighted sum across all temporal positions. These two strategies are referred to collectively as
Transformer Pooling in
Figure 2, although only the
Attention Pooling variant is illustrated. In the
CLS approach, the linear layer and multiplication are omitted, and the representation of the [CLS] token is directly extracted as the sequence-level feature.
For both attention-based approaches, an attention mask was applied to padded samples introduced during the temporal standardization step. This ensures that padded positions do not contribute to the attention computation.
4.2.2. Multimodal Architecture
In
Figure 3, the multimodal architecture proposed for this task is illustrated. Given an input composed of
k modalities, the architecture supports three different fusion strategies.
The first strategy, Concatenation, builds directly on the previously described unimodal branches. In this approach, the original classification head of the unimodal branches is eliminated and the features generated for each modality are concatenated into a single vector, which is then used to produce the final prediction.
The second strategy, referred to as Attention, departs from the unimodal architecture and instead performs fusion at the sequence level. Prior to fusion, each sequence is projected to a fixed dimensionality using a linear layer. A modality embedding is added to each sequence, via element-wise addition, and thus without altering the overall dimensionality. These embeddings play a similar role to positional encodings in typical transformer architectures but encode the modality of each feature rather than its temporal position. This allows the model to distinguish between modalities when learning cross-modal interactions, which is particularly important since informative patterns in different modalities may occur at different temporal positions.
After the modality embeddings ( are added, the sequences from all modalities are concatenated along the temporal dimension and processed by a set of transformer encoder layers. The resulting representation is then aggregated using the same attention pooling mechanism described for the unimodal branch, which involved passing the sequence through a linear layer and then using the resulting weights to compute a weighted mean, emphasizing the most important parts of the sequence.
Finally, the Mixed fusion approach combines the previous two strategies. Specifically, it concatenates the representations produced by the concatenation-based and attention-based branches, and applies an additional attention mechanism to emphasize the most informative features from each branch before producing the final output.
In all three approaches, the final output vector is divided by a temperature parameter
T before leaving the model. This parameter is a single trainable scalar that rescales the logits without altering the relation between them. Its role is to control the prediction scale, effectively normalizing the threshold used to compute the salience of each emotion. This architectural decision was motivated by the findings reported in [
45], where multiple calibration methods were evaluated on image and document classification tasks to improve the reliability of predicted probability distributions. Among these techniques, temperature scaling demonstrated the highest effectiveness and simplicity, motivating its use in this work to calibrate the confidence of the model’s final output probabilities.
It is important to note that the parameters of the unimodal branches remained frozen during the training of the multimodal architecture. Additionally, when employing the Mixed approach, the parameters of both the Concatenation and Attention branches were also kept frozen. By freezing the individual branches, the total number of learnable parameters is substantially reduced, allowing the model to focus its capacity on learning the fusion layers and thereby improving performance.
4.3. Evaluation Metrics
In order to compare with the original benchmarks, the two evaluation metrics used are Presence Accuracy () and Salience Accuracy ().
As originally defined in [
14],
measures the correctness of emotion detection without considering salience information. A prediction is considered correct only if all present emotions are correctly identified, while avoiding both false negatives (e.g., missing an emotion in a blend) and false positives (e.g., predicting blends when only one emotion is portrayed).
extends by additionally evaluating the relative prominence of emotions in blended expressions. This metric assesses whether the predicted salience rankings correctly reflect the ground truth distribution.
To formalize these metrics, both predictions (
p) and ground-truth labels (
d) of the
k:th sample are represented as six-dimensional vectors such that:
where
correspond to the basic emotions (anger, disgust, fear, happiness, sadness),
represents the neutral state, and the coefficients denote relative salience proportions.
Two auxiliary functions are additionally defined. The first,
, standardizes predictions by discarding salience information, mapping any blended label to its presence–only equivalent. Specifically, it converts any blended configuration into a unified
representation for presence evaluation. In this expression,
denotes the salience value associated with a given emotion, constrained to
. Conversely,
acts as an indicator function, returning 1 when the predicted and target vectors match exactly and 0 otherwise. In these definitions,
represents a vector in
, either a prediction or ground, truth label for a given sample, while
x denotes the norm of such a vector, yielding a scalar in the range
.
Using these definitions, the two metrics are computed as:
therefore provides a general evaluation criterion for blended emotion recognition, enabling comparison with methods trained on alternative datasets. In contrast, captures the salience dimension unique to this blended emotion dataset, distinguishing this benchmark from conventional multi-class or multi-label emotion recognition tasks.
Finally, and only for training-monitoring purposes, a combined score metric was employed, defined as
4.4. Loss Function
To train the models, a combination of three loss functions was employed, defined as
where each term targets a different aspect of the prediction task.
denotes the Kullback–Leibler (KL) divergence between the predicted and ground-truth salience distributions,
is the binary cross-entropy (BCE) loss applied to the emotion-presence logits, and
is a margin-ranking (rank) loss encouraging correct salience ordering in blended expressions. The coefficients
control the relative contribution of each component, their values are shown in
Table 1.
The ranking loss
was applied only to blended samples with unequal salience ratios (70/30 and 30/70), and is given by
where
and
are the predicted scores for the dominant and subdominant emotions, respectively, and
is the enforced margin.
By leveraging these three complementary loss functions, multiple objectives can be optimized simultaneously. The BCE loss improves the independent detection of each emotion, thereby enhancing presence accuracy, while the KL divergence term, also employed in the original baseline, preserves the soft distribution associated with blended emotional expressions. Finally, the inclusion of the ranking loss explicitly enforces the correct ordering between dominant and secondary emotions, which is important for the asymmetric mixtures present in the dataset.
Finally, label smoothing, originally introduced in [
46] to facilitate the training of the Inception architecture by replacing one-hot vectors with softened target distributions, was applied to the target labels. This technique reduces overconfidence by subtracting a small factor (
) from the ground-truth class and distributing it among the remaining classes, thereby improving the stability and generalization of the model.
4.5. Experimental Setup and Post-Process
To ensure consistent model selection and evaluation, the training partition is further divided into five non-overlapping folds at the actor level. Unless otherwise specified, all reported results correspond to the mean and standard deviation obtained via 5-fold cross-validation on the training set.
During training, a dynamic learning-rate schedule was employed, starting at and gradually decaying to over the course of the optimization process. The batch size was set to 64, and training was performed for a maximum of 500 epochs with early stopping if the validation Score metric did not improve for 15 consecutive epochs, ensuring convergence. The Adam optimizer was used for all experiments.
To mitigate overfitting, three forms of dropout were incorporated: (i) layer dropout , applied to Transformer layers, LSTMs, and linear projections; (ii) modality dropout , applied at the multimodal level to randomly remove the contribution of an entire modality for a given sample, and (iii) temporal dropout , applied by randomly masking temporal positions within each sequence. The selected values for these hyperparameters were , , and .
A grid search was conducted to select configurations yielding strong performance. The final hyperparameters values used in all experiments are reported in
Table 1.
Finally, in order to convert the probability vector produced by the model into discrete predictions of emotion presence and salience, a post-processing step based on thresholding was applied. Two thresholds are required: (i) a presence threshold , which determines whether an emotion is considered present, and (ii) a salience threshold , which determines the salience configuration of the assigned blend.
For each output vector, the two most probable classes are selected. If the highest-probability emotion is below the presence threshold , the sample is treated as a non-blended instance and the top emotion is returned with full salience. The same occurs if the top emotion exceeds but the second-ranked emotion does not. If both of the two most probable emotions exceed the presence threshold, the absolute difference between their probabilities is computed. When this difference is below , the emotions are treated as a 50/50 blend. Otherwise, the blend is assigned as either 70/30 or 30/70 depending on which emotion has the higher predicted probability. This procedure is not applied when the top-ranked emotion is neutral, since neutral states do not form blended expressions in this dataset. In such cases, the sample is automatically classified as neutral with full salience. Instead of using fixed values, these threshold parameters were optimized on the training set and subsequently applied to the validation split.
5. Results and Discussion
5.1. Unimodal Results
Table 2 reports the results obtained when training the proposed unimodal models with each encoder on the 5-fold experiment. Among all encoders, HiCMAE achieves the best overall performance, outperforming the second-best encoder, ImageBind, by approximately 6–8% in
and 3–4% in
on all aggregation strategies. This advantage is expected, as HiCMAE is a multimodal encoder and therefore benefits from inherently richer representations, whereas the remaining encoders operate on a single modality. The other visual encoder, VideoMAE V2, yields performance comparable to ImageBind, ranking third and second, respectively. In contrast, WavLM and, to a greater extent, HuBERT are consistently outperformed by the visual encoders. These results suggest that, for this task, the visual modality provides the most informative cues, leading to notably stronger representations and higher performance.
Regarding the temporal aggregation strategies, Statistical Aggregation and the two Transformer-based approaches achieve similar overall results, with an average of around 30% and an average of approximately 14.4%. Performance differences between these three methods are noted only for encoders with very few temporal features, such as HiCMAE and VideoMAE V2. In these cases, Attention Pooling outperforms Simple Aggregation by nearly 2% in for HiCMAE, and yields an improvement of roughly 1% in for VideoMAE V2 while maintaining its score. In contrast, the LSTM-based approach performs noticeably worse, with decreases of approximately 5% in and 3% in . Overall, the LSTM method ranks last among all temporal aggregation strategies.
Finally, it is worth noting the low performance achieved by the LSTM method when applied to the masked autoencoder models VideoMAE and HiCMAE. This outcome is a consequence of these architectures already condensing temporal information natively, producing output sequences with too few temporal units for the Bi-LSTM module to learn meaningful temporal dependencies and converge effectively. These two results were omitted when calculating the mean of the performances.
Based on these results, the Attention Pooling is the method chosen to be used on the multimodal architecture.
5.2. Multimodal Results
The multimodal results obtained on each configuration of the architecture can be seen in
Table 3, corresponding to the mean and standard deviation of all five folders. The experiments began by fusing two encoders (ImageBind and WavLM) and progressively added additional encoders until all five were included. Results for the three fusion strategies employed are reported. Overall, the multimodal approaches substantially outperformed the unimodal ones: The difference between their respective best-performing (
Attention Pooling with HiCMAE versus the five-encoder
Mixed approach) configurations is approximately 6% in
and about 4% on
.
It can also be seen that the Concatenation strategy is consistently outperformed by the Attention and Mixed strategies. Its average and scores are 37.67% and 15.10%, respectively, compared to approximately 39.5% and 19.1% for the other two methods.
Another trend is that performance generally improves as more encoders are incorporated into the fusion module, with the highest results obtained when all five encoders are used simultaneously. One exception arises with HuBERT, whose inclusion slightly degrades
for the ImageBind + WavLM fusion under the
Concatenation and, consequently, under the
Mixed configuration. This behaviour can be attributed to the non-attention fusion model being negatively influenced by the less informative HuBERT embeddings, an effect that does not occur in the
Attention approach. Finally, it is important to mention that the final best reported performance of 43.97% of
beats the accuracy achieved by human judges of 43% reported by the authors in [
14].
5.3. Discussion
Figure 4 shows the confusion matrix obtained by the best-performing model, the
Mixed multimodal architecture on fold two, on the validation split. The overall
was 47.86% and the
was 27.92%. For this analysis, the blended emotions were treated as distinct classes independent of their salience levels. This representation enables a clearer examination of the strengths and limitations of the model.
Among the individual emotions, the highest accuracies are achieved for neutral and happy, with correct detection rates of 92% and 62%, respectively. In contrast, fear is the most challenging class, being frequently misclassified as its blended categories fear + anger and fear + sadness. Another noteworthy confusion pattern occurs for happiness, which is incorrectly labeled as happiness + sadness in 24% of the cases.
For the blended categories, the model performs particularly well on mixtures involving happiness. Indeed, three of the best-recognized blended classes are anger + happiness, disgust + happiness, and happiness + sadness, each achieving more than 60% accuracy. Conversely, and consistent with the unimodal results, blends involving fear remain difficult for the model. All fear-based mixtures appear among the five worst-performing blended classes, together with the combination anger + sadness, which is often mistaken for other blends, such as anger + happiness or sadness + disgust, both detected at more than 10% rate, and with the individual emotion sadness at a 12%. Something similar happens with anger + disgust which is commonly mistaken for other disgust blends.
5.4. Limitations and Future Work
Two main limitations are identified in this study. First, there is difficulty in separating blended emotions from their constituent pure forms, particularly for the case of fear, as evidenced by the confusion matrix. Similar confusions are also observed for happiness and disgust. Second, a performance gap is observed between non-negative emotions, such as happiness and neutral, and negative emotions, such as disgust, sadness, and especially fear, all of which remain below the 45% accuracy threshold. This bias may originate either from the pre-trained encoders employed or from the inherent difficulty of the dataset, where human annotators themselves achieve only 43% performance in a comparable setting.
A possible direction for future work is the development of dataset-specific encoders for both modalities, either trained from scratch or fine-tuned using contrastive learning objectives such as triplet loss or angular margin losses. Such approaches could enhance inter-class separability, particularly between negative emotions and their blended counterparts, thereby addressing a key bottleneck that currently limits system performance.
6. Conclusions
The proposed method aimed to maximize the performance of the feature representations used in the baseline by focusing on multimodal fusion and placing particular emphasis on temporal modeling strategies. The experimental results demonstrated that multimodal processing substantially outperforms unimodal approaches, highlighting the complementary nature of audio and visual cues in the recognition of blended affective states. Unimodal experiments revealed that visual encoders consistently outperformed audio-based encoders, indicating that visual cues encapsulate more discriminative information for this task than the audio modality. As expected, the multimodal encoder HiCMAE further surpassed all single-modality encoders, benefiting from its ability to jointly integrate information across modalities during pretraining.
Among the evaluated fusion strategies, the Mixed architecture exhibited the strongest overall performance, while the simple Concatenation approach achieved the weakest results, particularly in the salience estimation task. This behaviour suggests that naive feature concatenation is insufficient to capture the complexity of the problem, whereas attention-based mechanisms are more effective at modeling the cross-modal dependencies required for accurate blended emotion recognition. This interpretation is further supported by the observation that the gap in between the Mixed and Attention architectures is below 1%, whereas in the difference exceeds 2%. This indicates that, for the presence prediction task, the Concatenation branch contributes useful information comparable to the attention-based component, while for salience estimation it becomes a limiting factor.
Finally, the results validate the effectiveness of multimodal attention-based fusion for the BlEmoRe dataset, demonstrating its ability to produce robust predictions across subject-independent splits and confirming its suitability for both emotion presence detection and salience estimation.