Attention-Based Multimodal Fusion for Salience-Aware Blended Emotion Recognition

Salas-Cáceres, José; Castrillón-Santana, Modesto; Santana, Oliverio J.; Hernández-Sosa, Daniel; Lorenzo-Navarro, Javier

doi:10.3390/mti10050056

Open AccessArticle

Attention-Based Multimodal Fusion for Salience-Aware Blended Emotion Recognition

by

José Salas-Cáceres

^*

,

Modesto Castrillón-Santana

,

Oliverio J. Santana

,

Daniel Hernández-Sosa

and

Javier Lorenzo-Navarro

Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI), Universidad de Las Palmas de Gran Canaria, 35017 Las Palmas, Spain

^*

Author to whom correspondence should be addressed.

Multimodal Technol. Interact. 2026, 10(5), 56; https://doi.org/10.3390/mti10050056

Submission received: 24 April 2026 / Revised: 12 May 2026 / Accepted: 15 May 2026 / Published: 20 May 2026

Download

Browse Figures

Versions Notes

Abstract

Blended emotion recognition introduces the challenge of identifying not only which emotions are present in an expressive display but also their relative salience. The proposed methodology builds upon the pre-extracted features provided with the dataset and enhances performance through a combination of temporal modeling and multimodal fusion strategies. Unimodal experiments revealed that visual encoders consistently outperformed audio ones, with the multimodal HiCMAE encoder achieving the strongest single-encoder results with 34% presence accuracy and 18.23% salience accuracy. Multimodal fusion further improved performance, with the best validation results obtained using a combination of simple concatenation and attention-based fusion, reaching 47.86% in presence accuracy and 27.92% in salience accuracy. Overall, the proposed methodology surpasses the chosen baseline introduced in the original paper across a k-fold experiment, confirming the effectiveness of multimodal attention-based fusion for the accurate prediction of both emotion presence and salience in blended affective behaviour. The experimental results further indicate that multimodal expression recognition consistently outperforms unimodal approaches, highlighting the complementary nature of cross-modal information.

Keywords:

biometry; multimodal emotion recognition; blended emotions; multimodal fusion; human–machine interaction

1. Introduction

Emotional cues and facial expressions play a fundamental role in human communication, as they modulate the dynamics of social interaction and directly influence individual behaviour [1]. The ability to perceive and interpret these affective signals is therefore essential for facilitating natural exchanges. Importantly, human emotions are not communicated solely through facial expressions; they can also be inferred from vocal characteristics, prosody, body movements, gestures, and even the semantic content of speech [2], underscoring the inherently multimodal nature of affective communication.

In contrast to humans, who possess an innate capacity for affect recognition, machines do not naturally exhibit this ability, although they can be trained to do so [3,4]. Equipping computational systems with emotion-recognition capabilities can significantly enhance the naturalness and fluidity of human–machine interactions (HMIs), improving user experience [5]. This is one of the motivations underlying the development of different expression-recognition tasks.

Depending on the modalities employed, this problem takes different forms: Facial Expression Recognition (FER) when only visual cues are used, Speech Emotion Recognition (SER) when relying on audio signals, Text Emotion Recognition (TER) when analysing textual content, and Multimodal Emotion Recognition (MER) when multiple modalities are considered jointly [2]. Most existing studies focus on the six basic emotions, anger, happiness, disgust, fear, surprise, and sadness, introduced by Ekman in [3], typically supplemented with the neutral state [6,7]. However, such formulation often fails to capture the subtleties and complexities of real-world affective displays [8].

Such a limitation often arises from the datasets commonly used in these tasks, which frequently annotate only the previously mentioned basic categories, such as MELD [9], AffectNet [10], and RAF-DB [11]. Other datasets include a few additional categories (e.g., calm in RAVDESS [12], or excitement and frustration in IEMOCAP [13]), yet these do not fully address the compositional nature of human emotions. Even when intensity annotations are provided, they are often discarded in favour of a standard multi-class classification formulation.

In this context, the BlEmoRe dataset [14] was introduced to encourage the development of models capable of recognising not only discrete emotions but also blended emotional states and their relative salience. This work presents a novel attention-based architecture for audio–video integration, along with a comprehensive analysis of the performance of the feature encoders provided with the BlEmoRe dataset.

This paper is structured as follows. Section 2 reviews relevant literature on feature representations for audio and visual modalities, multimodal fusion strategies, and common limitations of existing affective computing datasets. Section 3 outlines the characteristics of the BlEmoRe dataset and the specific considerations relevant to this work. Section 4 details the proposed methodology, including the temporal pooling mechanisms, multimodal fusion architectures, loss functions, evaluation metrics, and experimental setup. Finally, the results obtained are presented and analyzed in Section 5 while in, Section 6 summarizes the main findings and discusses potential directions for future research.

The main contributions of this work are as follows:

A novel multimodal fusion architecture that surpasses the baseline introduced in the original BlEmoRe paper for both presence and salience prediction in blended emotion recognition.
A comprehensive analysis of the feature encoders provided in the BlEmoRe dataset, including an evaluation of their unimodal performance with different temporal pooling strategies.
The design and application of a combined loss function tailored for salience-aware blended emotion recognition.

2. Related Work

Emotion recognition systems rely heavily on the quality of the extracted features. Effective embeddings enable models to capture the perceptual cues and patterns that characterize the different emotional states [15]. Given their central importance, this section provides a comprehensive review of how the two modalities considered in this work, audio and video, are typically processed in the literature. Additionally, several strategies for multimodal fusion are discussed, followed by a brief overview of the main datasets commonly used for emotion recognition research.

2.1. Audio Representation

The audio modality often contains valuable cues about an individual’s emotional state, such as variations in tone, pitch, intensity, and non-verbal vocalizations, all of which exhibit acoustic patterns that differ across emotions [16,17,18,19]. Traditionally, two main strategies have been employed to extract this information: computing spectro-temporal features such as Mel-frequency cepstral coefficients (MFCCs), or transforming the waveform into a visual representation (e.g., Mel-spectrograms) to leverage image-based models, although some approaches directly analyze the raw waveform.

MFCCs are derived through a sequence of processing steps that include high-frequency pre-emphasis, temporal framing, computation of the discrete Fourier transform, application of Mel filterbanks [20], and final decorrelation using the discrete cosine transform [21]. Because MFCCs retain temporal structure, they are typically paired with recurrent architectures such as Long Short-Term Memory (LSTM) networks or Gated Recurrent Units (GRUs). MFCC-based methods have been successfully applied in tasks such as active speaker detection [22], speaker recognition [23], and MER [24].

Spectrogram-based representations enable the use of vision-focused architectures such as 2D convolutional neural networks or Vision Transformers. For example, ref. [25] employed a Swin Transformer architecture to model spectrogram relationships, achieving state-of-the-art results on multiple speech emotion recognition benchmarks.

In MER, however, the dominant approach is to rely on pretrained audio encoders for representation learning [15]. In this work, two such encoders are used: HuBERT [26] and WavLM [27]. Both models are trained using self-supervised masked prediction objectives applied directly to the raw waveform. HuBERT employs cluster-based pseudo-labels derived from MFCC features, while WavLM extends this framework with denoising and speech–noise mixture prediction tasks. Both encoders rank among the top-performing methods in many speech-related benchmarks [28]. Preliminary experiments also explored Wav2Vec 2.0 [29], which employs a convolutional front-end and a Transformer encoder to process raw audio, but it was ultimately discarded due to inferior performance in the present task.

2.2. Video Extraction

Traditionally, the visual modality has been the primary cue for expression recognition, relying on representations derived from facial images to characterize subtle muscular movements [15]. Early efforts attempted to describe facial expressions through action units or predefined facial configurations [30,31], providing structured descriptions of the underlying anatomy of expressions.

Modern visual feature extraction methods typically follow one of three main strategies. The first group employs convolutional neural networks (CNNs), such as VGG or ResNet, combined with recurrent networks to model temporal dynamics. For example, in [32], frame-level CNN embeddings are fed into an LSTM, achieving state-of-the-art results on several dynamic emotion datasets. A second family of approaches relies on 3D convolutional architectures, such as ResNet3D variants or SlowFast networks [33]. Simone et al. [24] follow this direction, extracting visual features with a SlowFast architecture that processes video through one of the two pathways the model have. These 3D models learn spatio–temporal features directly from raw video sequences, reducing the need for explicit sequence pooling. Finally, the introduction of Vision Transformers (ViTs) enabled the use of attention mechanisms to capture long-range dependencies across facial regions. In [34], a ViT-based model is used to extract facial features for assessing student engagement, demonstrating the effectiveness of attention-driven architectures in handling complex, unconstrained environments.

In this work, the visual encoders adopted are ImageBind [35] and VideoMAE V2 [36], both built upon ViT-style architectures for image extraction. In the preliminary testing phase, the Swin-base encoder SwinFace [37] was evaluated but ultimately discarded due to its inferior performance compared with the previously mentioned models.

2.3. Modality Fusion

Biometric and affect recognition systems that rely on a single modality often experience substantial performance degradation when exposed to external factors such as noise, occlusions, or poor signal quality [38]. A common strategy to mitigate these issues is to incorporate multiple modalities and fuse their complementary information. Several taxonomies have been proposed to categorize fusion strategies, however, the most widely adopted classification is based on the stage of the processing pipeline at which fusion occurs. In early fusion, modalities are combined before the inference stage, whereas late fusion integrates modality-specific predictions at the decision level. A third category, hybrid fusion, combines elements of both strategies [15,19].

Early fusion can be implemented in various ways, ranging from a simple concatenation of modality-specific feature vectors, as in [24], to more sophisticated approaches that incorporate cross-attention mechanisms to better model inter-modal relationships. For instance, Lei et al. [39] employed a fusion module based on multi-head attention layers that jointly process audio and visual sequences to derive the final prediction, enabling richer temporal and contextual interactions between modalities.

Late fusion methods, on the other hand, combine information at the decision level, allowing modular training of each modality but failing to exploit fine-grained temporal or contextual dependencies. Common strategies include weighting the output scores of each modality, as demonstrated in [40], or applying majority voting schemes when a sufficient number of modalities is available. These approaches are especially useful in large-scale or continual systems, where the flexibility of the modular scheme is key [15].

In this work, two different early fusion approaches are tested, one focused on simple concatenation and the other on applying attention across temporal sequences.

2.4. Datasets

Although the use of multiple modalities generally improves robustness and overall performance, multimodal systems are often constrained by the availability and quality of datasets. For instance, beyond audio and visual cues, textual information can also be incorporated by analysing the semantic content of spoken utterances [19]. However, many widely used emotion datasets, such as CREMA-D [41] or RAVDESS [12], contain predefined scripts that remain constant across emotions, making text-based features uninformative or unusable.

Two additional limitations hinder the generalization of MER systems in real-world applications. First, only a limited number of languages are represented in existing datasets, with most benchmarks predominantly focusing on English [42]. This leaves other widely spoken languages, such as Spanish or French, significantly underrepresented. Second, most datasets adhere to Ekman’s basic emotion categories, enforcing a discrete view of affect that restricts the modelling of nuanced or intermediate emotional expressions [14].

The dataset employed in this work addresses both limitations by relying on audio signals composed exclusively of non-verbal cues and by providing blended emotional expressions annotated with salience information. This structure enables a more flexible and realistic representation of human emotions.

3. Dataset

The BlEmoRe dataset [14] is an audiovisual dataset consisting of video recordings in which professional actors are instructed to portray either single or blended emotions through facial expressions, body movements, and non-linguistic sounds. The database comprises 58 actors and a total of 3050 video samples. Among these, 1390 clips depict one of six discrete emotional states: anger, disgust, fear, happiness, sadness, or neutral. The remaining 1660 clips represent pairwise blends between two non-neutral emotions.

For blended samples, the dataset provides additional annotations specifying the relative salience, or proportions, of each constituent emotion. Three salience configurations are defined: (i) 50/50, indicating equal prominence of both emotions; (ii) 70/30, where the first emotion in the label is more dominant than the second; and (iii) 30/70, where the second emotion is more dominant. It is important to emphasize that actors were instructed to express blended emotions simultaneously rather than sequentially, such that the entire clip conveys a stable emotional mixture instead of a temporal transition between emotions.

The dataset follows a subject-independent train/test split, comprising 43 actors for training and 15 actors for testing, although the test split is currently private. The training split is divided into five folders, each containing approximately 500 videos, with no overlapping identities between folders and a balanced distribution of emotions. Of these five folders, one is used for validation, while the remaining four are used for training. In Figure 1, several representative frames from the dataset are presented. The videos were recorded at 50 frames per second with a resolution of 1920 × 1080 pixels, and their duration ranges from 1 to 30 s, although the majority fall between 2 and 10 s. Finally, the dataset creators also provide pre-extracted features obtained from a set of visual, audio, and audiovisual encoders as baseline.

4. Methodology

In this section, the data preprocessing is first described in detail. In addition, the unimodal and multimodal architectures proposed are introduced. Finally, the implemented loss function, the evaluation metrics, and the experimental setup are presented to ensure full reproducibility.

4.1. Data Preprocessing

This work focuses on multimodal fusion strategies and model architectures. Therefore, only the pre-extracted features were used rather than training feature encoders end-to-end. The selection of encoders was based on the unimodal benchmark results reported in [14]. Specifically, the two best-performing encoders for each modality in the unimodal evaluation and a single multimodal encoder were selected.

For the visual modality, the selected encoders were ImageBind [35] and VideoMAE V2 [36]. The former, ImageBind, is a model developed by Meta that learns a joint embedding space across six different modalities, however, in this work, only the image encoder was utilized. VideoMAE V2 is a video-based encoder designed to operate directly on the temporal dimension by processing spatio-temporal 3D patches. Specifically, the ViT-B/16 variant was employed, which produces one feature vector every 16 frames, thereby reducing temporal resolution while preserving coarse temporal information.

For the audio modality, WavLM [27] and HuBERT [26] were used. WavLM is a self-supervised learning model developed by Microsoft and trained on more than 90,000 h of audio for speech representation learning. HuBERT, developed by Meta, is trained using a masked prediction objective over k-means cluster assignments derived from the input audio for the same task.

In addition, HiCMAE [43], a multimodal encoder designed for audiovisual emotion recognition, was also considered. This model employs an asymmetric encoder-decoder architecture in which audio and visual encoders are jointly learned and shared across modality-specific decoders. It was included due to its superior performance in salience prediction as reported in the original benchmark.

Consequently, the experimental setup considers five encoders in total: two visual encoders (ImageBind and VideoMAE V2), two audio encoders (WavLM and HuBERT), and one multimodal encoder (HiCMAE). The five selected encoders represent well-established models that have achieved state-of-the-art performance across multiple affective computing benchmarks. As reported in [28], these encoders consistently rank among the top performers in unimodal evaluations across diverse databases. Consequently, selecting the models ensures that the multimodal fusion process is built upon high-quality and reliable feature representations, rather than weaker or unstable embeddings that could hinder downstream performance.

Since the original samples consist of videos and not all encoders are designed to handle temporal information, a preprocessing step is required. For each video, a sequence of

n_{f}

feature vectors was extracted each one with dimensionality

e_{d}

. Because the videos vary in duration, the resulting value of

n_{f}

differs across samples. To address this issue, a temporal pooling procedure was applied.

Let

s_{l}

denote a fixed target sequence length. Three cases were considered. When

n_{f} > s_{l}

, the sequence was uniformly divided into

s_{l}

segments of size

⌊\frac{n_{f}}{s_{l}}⌋

, from which a random frame was selected as a single representative frame. Conversely, if

n_{f} < s_{l}

, zero-padding was applied to extend the sequence until the target length

s_{l}

was reached. Finally, if

s_{l}

matches the original sequence length

n_{f}

, no temporal preprocessing is applied to the input. Zero-padding was chosen over alternatives such as extrapolation because it preserves the original temporal structure without introducing synthetic data, ensuring that the architecture’s attention mechanisms focus exclusively on the real dynamics of the sequence.

This standardization step was particularly important because

n_{f}

depends not only on the video duration but also on the encoder used. Among the five encoders utilized, ImageBind’s output frequency is frame-based (one vector per frame), while WavLM and HuBERT extract features at a temporal resolution of 20 ms. In contrast, VideoMAE V2 and HiCMAE are inherently designed to process video clips and therefore generate significantly fewer feature vectors, typically around 16 and 4 per video, respectively, which is more than an order of magnitude fewer than those produced by the other encoders.

4.2. Architecture

Two distinct architectures are proposed in this work: one designed to extract temporal information from each sample, and another dedicated to multimodal fusion.

4.2.1. Unimodal Branch

The first architecture is illustrated in Figure 2. This module consists of two components: a temporal pooling module and a final multilayer perceptron (MLP) that projects the encoder-specific feature dimension

e_{d}

to a fixed latent dimension

u_{d}

.

Several temporal pooling strategies were explored. As a baseline, the approach proposed in the original work [14] was adopted. This method performs a Statistical Aggregation of all feature vectors by computing descriptive statistics for each feature dimension. Specifically, five percentiles (10th, 25th, 50th, 75th, and 90th) are concatenated with the mean and standard deviation. This results in a single vector representation with a dimensionality of seven times

e_{d}

.

In addition to this baseline, two temporal modeling approaches were employed: a Bidirectional LSTM (Bi-LSTM) and an attention-based architecture. The LSTM approach employed a Bi-LSTM module to model the temporal sequence of each sample.

The latter approach, which is illustrated in Figure 2, consists of two stages. First, temporal dependencies are modeled using self-attention through Transformer encoder layers. Second, the temporal sequence is condensed into a single representation vector. Two alternative strategies were explored for this aggregation. The first strategy is inspired by the method used in BERT [44] and employs a [CLS] token. In this case, a classification token of dimensionality

e_{d}

is prepended to the input sequence of length

s_{l}

. After the Transformer encoder processes the sequence, the final representation corresponding to the [CLS] token is used as a global representation of the entire sequence. The second strategy, referred to as Attention Pooling, computes temporal weights for each position in the sequence. These weights are obtained by passing the sequence through a linear projection layer of dimension

u_{d}

. The resulting weights are then used to compute the final sample representation via a weighted sum across all temporal positions. These two strategies are referred to collectively as Transformer Pooling in Figure 2, although only the Attention Pooling variant is illustrated. In the CLS approach, the linear layer and multiplication are omitted, and the representation of the [CLS] token is directly extracted as the sequence-level feature.

For both attention-based approaches, an attention mask was applied to padded samples introduced during the temporal standardization step. This ensures that padded positions do not contribute to the attention computation.

4.2.2. Multimodal Architecture

In Figure 3, the multimodal architecture proposed for this task is illustrated. Given an input composed of k modalities, the architecture supports three different fusion strategies.

The first strategy, Concatenation, builds directly on the previously described unimodal branches. In this approach, the original classification head of the unimodal branches is eliminated and the features generated for each modality are concatenated into a single vector, which is then used to produce the final prediction.

The second strategy, referred to as Attention, departs from the unimodal architecture and instead performs fusion at the sequence level. Prior to fusion, each sequence is projected to a fixed dimensionality

u_{d}

using a linear layer. A modality embedding

M_{e_{i}}

is added to each sequence, via element-wise addition, and thus without altering the overall dimensionality. These embeddings play a similar role to positional encodings in typical transformer architectures but encode the modality of each feature rather than its temporal position. This allows the model to distinguish between modalities when learning cross-modal interactions, which is particularly important since informative patterns in different modalities may occur at different temporal positions.

After the modality embeddings (

M_{e_{i}})

are added, the sequences from all modalities are concatenated along the temporal dimension and processed by a set of transformer encoder layers. The resulting representation is then aggregated using the same attention pooling mechanism described for the unimodal branch, which involved passing the sequence through a linear layer and then using the resulting weights to compute a weighted mean, emphasizing the most important parts of the sequence.

Finally, the Mixed fusion approach combines the previous two strategies. Specifically, it concatenates the representations produced by the concatenation-based and attention-based branches, and applies an additional attention mechanism to emphasize the most informative features from each branch before producing the final output.

In all three approaches, the final output vector is divided by a temperature parameter T before leaving the model. This parameter is a single trainable scalar that rescales the logits without altering the relation between them. Its role is to control the prediction scale, effectively normalizing the threshold used to compute the salience of each emotion. This architectural decision was motivated by the findings reported in [45], where multiple calibration methods were evaluated on image and document classification tasks to improve the reliability of predicted probability distributions. Among these techniques, temperature scaling demonstrated the highest effectiveness and simplicity, motivating its use in this work to calibrate the confidence of the model’s final output probabilities.

It is important to note that the parameters of the unimodal branches remained frozen during the training of the multimodal architecture. Additionally, when employing the Mixed approach, the parameters of both the Concatenation and Attention branches were also kept frozen. By freezing the individual branches, the total number of learnable parameters is substantially reduced, allowing the model to focus its capacity on learning the fusion layers and thereby improving performance.

4.3. Evaluation Metrics

In order to compare with the original benchmarks, the two evaluation metrics used are Presence Accuracy (

{Acc}_{presence}

) and Salience Accuracy (

{Acc}_{salience}

).

As originally defined in [14],

{Acc}_{presence}

measures the correctness of emotion detection without considering salience information. A prediction is considered correct only if all present emotions are correctly identified, while avoiding both false negatives (e.g., missing an emotion in a blend) and false positives (e.g., predicting blends when only one emotion is portrayed).

{Acc}_{salience}

extends

{Acc}_{presence}

by additionally evaluating the relative prominence of emotions in blended expressions. This metric assesses whether the predicted salience rankings correctly reflect the ground truth distribution.

To formalize these metrics, both predictions (p) and ground-truth labels (d) of the k:th sample are represented as six-dimensional vectors such that:

\begin{matrix} p^{(k)}, d^{(k)} \in {γ e_{i} + δ e_{j} ∣ i, j \in {1, \dots, 5}, i \neq j, \\ (γ, δ) \in {(1, 0), (0.5, 0.5), (0.3, 0.7)}} \cup {e_{6}} \end{matrix}

(1)

where

e_{1}, \dots, e_{5}

correspond to the basic emotions (anger, disgust, fear, happiness, sadness),

e_{6}

represents the neutral state, and the coefficients denote relative salience proportions.

Two auxiliary functions are additionally defined. The first,

π : R^{6} \to R^{6}

, standardizes predictions by discarding salience information, mapping any blended label to its presence–only equivalent. Specifically, it converts any blended configuration into a unified

0.5 / 0.5

representation for presence evaluation. In this expression,

λ

denotes the salience value associated with a given emotion, constrained to

λ \in {0.3, 0.5, 0.7}

. Conversely,

σ : R \to {0, 1}

acts as an indicator function, returning 1 when the predicted and target vectors match exactly and 0 otherwise. In these definitions,

V

represents a vector in

R^{6}

, either a prediction or ground, truth label for a given sample, while x denotes the norm of such a vector, yielding a scalar in the range

[0, 1]

.

π (V) = \{\begin{matrix} e_{i} & if V = e_{i} \\ 0.5 e_{i} + 0.5 e_{j} & if V = λ e_{i} + (1 - λ) e_{j} and i \neq j \\ 0 & otherwise \end{matrix}

(2)

σ (x) = \{\begin{matrix} 1 & if x = 0 \\ 0 & otherwise \end{matrix}

(3)

Using these definitions, the two metrics are computed as:

{Acc}_{presence} = \frac{1}{N} \sum_{k = 1}^{N} σ (∥π (p^{(k)}) - π (d^{(k)})∥)

(4)

{Acc}_{salience} = \frac{1}{N} \sum_{k = 1}^{N} σ (∥p^{(k)} - d^{(k)}∥)

(5)

{Acc}_{presence}

therefore provides a general evaluation criterion for blended emotion recognition, enabling comparison with methods trained on alternative datasets. In contrast,

{Acc}_{salience}

captures the salience dimension unique to this blended emotion dataset, distinguishing this benchmark from conventional multi-class or multi-label emotion recognition tasks.

Finally, and only for training-monitoring purposes, a combined score metric was employed, defined as

Score = 0.5 {Acc}_{presence} + 0.5 {Acc}_{salience} .

(6)

4.4. Loss Function

To train the models, a combination of three loss functions was employed, defined as

L = w_{KL} L_{KL} + w_{BCE} L_{BCE} + w_{rank} L_{rank},

(7)

where each term targets a different aspect of the prediction task.

L_{KL}

denotes the Kullback–Leibler (KL) divergence between the predicted and ground-truth salience distributions,

L_{BCE}

is the binary cross-entropy (BCE) loss applied to the emotion-presence logits, and

L_{rank}

is a margin-ranking (rank) loss encouraging correct salience ordering in blended expressions. The coefficients

w_{KL}, w_{BCE}, w_{rank}

control the relative contribution of each component, their values are shown in Table 1.

The ranking loss

L_{rank}

was applied only to blended samples with unequal salience ratios (70/30 and 30/70), and is given by

L_{rank} = max (0, δ - (p_{dom} - p_{sub})),

(8)

where

p_{dom}

and

p_{sub}

are the predicted scores for the dominant and subdominant emotions, respectively, and

δ

is the enforced margin.

By leveraging these three complementary loss functions, multiple objectives can be optimized simultaneously. The BCE loss improves the independent detection of each emotion, thereby enhancing presence accuracy, while the KL divergence term, also employed in the original baseline, preserves the soft distribution associated with blended emotional expressions. Finally, the inclusion of the ranking loss explicitly enforces the correct ordering between dominant and secondary emotions, which is important for the asymmetric mixtures present in the dataset.

Finally, label smoothing, originally introduced in [46] to facilitate the training of the Inception architecture by replacing one-hot vectors with softened target distributions, was applied to the target labels. This technique reduces overconfidence by subtracting a small factor (

ϵ

) from the ground-truth class and distributing it among the remaining classes, thereby improving the stability and generalization of the model.

4.5. Experimental Setup and Post-Process

To ensure consistent model selection and evaluation, the training partition is further divided into five non-overlapping folds at the actor level. Unless otherwise specified, all reported results correspond to the mean and standard deviation obtained via 5-fold cross-validation on the training set.

During training, a dynamic learning-rate schedule was employed, starting at

3 \times 10^{- 4}

and gradually decaying to

5 \times 10^{- 6}

over the course of the optimization process. The batch size was set to 64, and training was performed for a maximum of 500 epochs with early stopping if the validation Score metric did not improve for 15 consecutive epochs, ensuring convergence. The Adam optimizer was used for all experiments.

To mitigate overfitting, three forms of dropout were incorporated: (i) layer dropout

d_{l}

, applied to Transformer layers, LSTMs, and linear projections; (ii) modality dropout

d_{m}

, applied at the multimodal level to randomly remove the contribution of an entire modality for a given sample, and (iii) temporal dropout

d_{t}

, applied by randomly masking temporal positions within each sequence. The selected values for these hyperparameters were

d_{l} = 0.1

,

d_{m} = 0.3

, and

d_{t} = 0.1

.

A grid search was conducted to select configurations yielding strong performance. The final hyperparameters values used in all experiments are reported in Table 1.

Finally, in order to convert the probability vector produced by the model into discrete predictions of emotion presence and salience, a post-processing step based on thresholding was applied. Two thresholds are required: (i) a presence threshold

α

, which determines whether an emotion is considered present, and (ii) a salience threshold

β

, which determines the salience configuration of the assigned blend.

For each output vector, the two most probable classes are selected. If the highest-probability emotion is below the presence threshold

α

, the sample is treated as a non-blended instance and the top emotion is returned with full salience. The same occurs if the top emotion exceeds

α

but the second-ranked emotion does not. If both of the two most probable emotions exceed the presence threshold, the absolute difference between their probabilities is computed. When this difference is below

β

, the emotions are treated as a 50/50 blend. Otherwise, the blend is assigned as either 70/30 or 30/70 depending on which emotion has the higher predicted probability. This procedure is not applied when the top-ranked emotion is neutral, since neutral states do not form blended expressions in this dataset. In such cases, the sample is automatically classified as neutral with full salience. Instead of using fixed values, these threshold parameters were optimized on the training set and subsequently applied to the validation split.

5. Results and Discussion

5.1. Unimodal Results

Table 2 reports the results obtained when training the proposed unimodal models with each encoder on the 5-fold experiment. Among all encoders, HiCMAE achieves the best overall performance, outperforming the second-best encoder, ImageBind, by approximately 6–8% in

{Acc}_{presence}

and 3–4% in

{Acc}_{salience}

on all aggregation strategies. This advantage is expected, as HiCMAE is a multimodal encoder and therefore benefits from inherently richer representations, whereas the remaining encoders operate on a single modality. The other visual encoder, VideoMAE V2, yields performance comparable to ImageBind, ranking third and second, respectively. In contrast, WavLM and, to a greater extent, HuBERT are consistently outperformed by the visual encoders. These results suggest that, for this task, the visual modality provides the most informative cues, leading to notably stronger representations and higher performance.

Regarding the temporal aggregation strategies, Statistical Aggregation and the two Transformer-based approaches achieve similar overall results, with an average

{Acc}_{presence}

of around 30% and an average

{Acc}_{salience}

of approximately 14.4%. Performance differences between these three methods are noted only for encoders with very few temporal features, such as HiCMAE and VideoMAE V2. In these cases, Attention Pooling outperforms Simple Aggregation by nearly 2% in

{Acc}_{presence}

for HiCMAE, and yields an improvement of roughly 1% in

{Acc}_{salience}

for VideoMAE V2 while maintaining its

{Acc}_{presence}

score. In contrast, the LSTM-based approach performs noticeably worse, with decreases of approximately 5% in

{Acc}_{presence}

and 3% in

{Acc}_{salience}

. Overall, the LSTM method ranks last among all temporal aggregation strategies.

Finally, it is worth noting the low performance achieved by the LSTM method when applied to the masked autoencoder models VideoMAE and HiCMAE. This outcome is a consequence of these architectures already condensing temporal information natively, producing output sequences with too few temporal units for the Bi-LSTM module to learn meaningful temporal dependencies and converge effectively. These two results were omitted when calculating the mean of the performances.

Based on these results, the Attention Pooling is the method chosen to be used on the multimodal architecture.

5.2. Multimodal Results

The multimodal results obtained on each configuration of the architecture can be seen in Table 3, corresponding to the mean and standard deviation of all five folders. The experiments began by fusing two encoders (ImageBind and WavLM) and progressively added additional encoders until all five were included. Results for the three fusion strategies employed are reported. Overall, the multimodal approaches substantially outperformed the unimodal ones: The difference between their respective best-performing (Attention Pooling with HiCMAE versus the five-encoder Mixed approach) configurations is approximately 6% in

{Acc}_{presence}

and about 4% on

{Acc}_{salience}

.

It can also be seen that the Concatenation strategy is consistently outperformed by the Attention and Mixed strategies. Its average

{Acc}_{presence}

and

{Acc}_{salience}

scores are 37.67% and 15.10%, respectively, compared to approximately 39.5% and 19.1% for the other two methods.

Another trend is that performance generally improves as more encoders are incorporated into the fusion module, with the highest results obtained when all five encoders are used simultaneously. One exception arises with HuBERT, whose inclusion slightly degrades

{Acc}_{presence}

for the ImageBind + WavLM fusion under the Concatenation and, consequently, under the Mixed configuration. This behaviour can be attributed to the non-attention fusion model being negatively influenced by the less informative HuBERT embeddings, an effect that does not occur in the Attention approach. Finally, it is important to mention that the final best reported performance of 43.97% of

{Acc}_{presence}

beats the accuracy achieved by human judges of 43% reported by the authors in [14].

5.3. Discussion

Figure 4 shows the confusion matrix obtained by the best-performing model, the Mixed multimodal architecture on fold two, on the validation split. The overall

{Acc}_{presence}

was 47.86% and the

{Acc}_{salience}

was 27.92%. For this analysis, the blended emotions were treated as distinct classes independent of their salience levels. This representation enables a clearer examination of the strengths and limitations of the model.

Among the individual emotions, the highest accuracies are achieved for neutral and happy, with correct detection rates of 92% and 62%, respectively. In contrast, fear is the most challenging class, being frequently misclassified as its blended categories fear + anger and fear + sadness. Another noteworthy confusion pattern occurs for happiness, which is incorrectly labeled as happiness + sadness in 24% of the cases.

For the blended categories, the model performs particularly well on mixtures involving happiness. Indeed, three of the best-recognized blended classes are anger + happiness, disgust + happiness, and happiness + sadness, each achieving more than 60% accuracy. Conversely, and consistent with the unimodal results, blends involving fear remain difficult for the model. All fear-based mixtures appear among the five worst-performing blended classes, together with the combination anger + sadness, which is often mistaken for other blends, such as anger + happiness or sadness + disgust, both detected at more than 10% rate, and with the individual emotion sadness at a 12%. Something similar happens with anger + disgust which is commonly mistaken for other disgust blends.

5.4. Limitations and Future Work

Two main limitations are identified in this study. First, there is difficulty in separating blended emotions from their constituent pure forms, particularly for the case of fear, as evidenced by the confusion matrix. Similar confusions are also observed for happiness and disgust. Second, a performance gap is observed between non-negative emotions, such as happiness and neutral, and negative emotions, such as disgust, sadness, and especially fear, all of which remain below the 45% accuracy threshold. This bias may originate either from the pre-trained encoders employed or from the inherent difficulty of the dataset, where human annotators themselves achieve only 43% performance in a comparable setting.

A possible direction for future work is the development of dataset-specific encoders for both modalities, either trained from scratch or fine-tuned using contrastive learning objectives such as triplet loss or angular margin losses. Such approaches could enhance inter-class separability, particularly between negative emotions and their blended counterparts, thereby addressing a key bottleneck that currently limits system performance.

6. Conclusions

The proposed method aimed to maximize the performance of the feature representations used in the baseline by focusing on multimodal fusion and placing particular emphasis on temporal modeling strategies. The experimental results demonstrated that multimodal processing substantially outperforms unimodal approaches, highlighting the complementary nature of audio and visual cues in the recognition of blended affective states. Unimodal experiments revealed that visual encoders consistently outperformed audio-based encoders, indicating that visual cues encapsulate more discriminative information for this task than the audio modality. As expected, the multimodal encoder HiCMAE further surpassed all single-modality encoders, benefiting from its ability to jointly integrate information across modalities during pretraining.

Among the evaluated fusion strategies, the Mixed architecture exhibited the strongest overall performance, while the simple Concatenation approach achieved the weakest results, particularly in the salience estimation task. This behaviour suggests that naive feature concatenation is insufficient to capture the complexity of the problem, whereas attention-based mechanisms are more effective at modeling the cross-modal dependencies required for accurate blended emotion recognition. This interpretation is further supported by the observation that the gap in

{Acc}_{salience}

between the Mixed and Attention architectures is below 1%, whereas in

{Acc}_{presence}

the difference exceeds 2%. This indicates that, for the presence prediction task, the Concatenation branch contributes useful information comparable to the attention-based component, while for salience estimation it becomes a limiting factor.

Finally, the results validate the effectiveness of multimodal attention-based fusion for the BlEmoRe dataset, demonstrating its ability to produce robust predictions across subject-independent splits and confirming its suitability for both emotion presence detection and salience estimation.

Author Contributions

Conceptualization, J.S.-C., J.L.-N. and M.C.-S.; Formal analysis, J.S.-C.; investigation, J.S.-C.; Methodology, J.S.-C.; Project administration, J.L.-N. and M.C.-S.; Supervision, J.L.-N. and M.C.-S.; Validation, J.L.-N., M.C.-S., O.J.S. and D.H.-S.; Visualization, J.S.-C.; Writing—Original draft, J.S.-C.; Writing—Review and editing, J.L.-N., M.C.-S., O.J.S. and D.H.-S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study because the research involved the use of an existing, publicly available dataset. No new data were collected, and all experimental protocols, consent procedures, and ethical approvals were handled by the original dataset providers.

Informed Consent Statement

Informed consent was obtained from all participants by the original dataset providers. No new data were collected for this study.

Data Availability Statement

The data is available at Zenodo upon request at https://zenodo.org/records/17787362 (accessed on 12 March 2026).

Acknowledgments

This publication is part of the project PID2021-122402OB-C22, funded by MCIN/AEI/10.13039/501100011033/FEDER, EU, the ACIISI-Gobierno de Canarias and FEDER under project ULPGC Facilities Net and Grant EIS 2021 04, and by the Consejería de Universidades, Ciencia e Innovación y Cultura (Gobierno de Canarias) and the European Social Fund Plus (FSE+) under the funding framework for doctoral research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Picard, R.W. Affective Computing; MIT Press: Cambridge, MA, USA, 1997. [Google Scholar]
Ezzameli, K.; Mahersia, H. Emotion recognition from unimodal to multimodal analysis: A review. Inf. Fusion 2023, 99, 101847. [Google Scholar] [CrossRef]
Ekman, P. Basic emotions. In Handbook of Cognition and Emotion; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 1999; Volume 98, p. 16. [Google Scholar]
Russell, J.A. Pancultural aspects of the human conceptual organization of emotions. J. Personal. Soc. Psychol. 1983, 45, 1281. [Google Scholar] [CrossRef]
Picard, R.W. Toward computers that recognize and respond to user emotion. IBM Syst. J. 2000, 39, 705–719. [Google Scholar] [CrossRef]
Khare, S.K.; Blanes-Vidal, V.; Nadimi, E.S.; Acharya, U.R. Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations. Inf. Fusion 2024, 102, 102019. [Google Scholar] [CrossRef]
Kollias, D. Multi-Label Compound Expression Recognition: C-EXPR Database & Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 5589–5598. [Google Scholar]
Martínez, B.; Valstar, M.F. Advances, Challenges, and Opportunities in Automatic Facial Expression Recognition. In Advances in Face Detection and Facial Image Analysis; Springer: Cham, Switzerland, 2016. [Google Scholar]
Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; Korhonen, A., Traum, D., Màrquez, L., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 527–536. [Google Scholar] [CrossRef]
Mollahosseini, A.; Hasani, B.; Mahoor, M.H. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Trans. Affect. Comput. 2019, 10, 18–31. [Google Scholar] [CrossRef]
Li, S.; Deng, W. Reliable Crowdsourcing and Deep Locality-Preserving Learning for Unconstrained Facial Expression Recognition. IEEE Trans. Image Process. 2019, 28, 356–370. [Google Scholar] [CrossRef] [PubMed]
Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef]
Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower Provost, E.; Kim, S.; Chang, J.; Lee, S.; Narayanan, S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
Lachmann, T.; Israelsson, A.; Tornberg, C.; Saghinadze, T.; Balazia, M.; Müller, P.; Laukka, P. Not all Blends are Equal: The BlEmoRe Dataset of Blended Emotion Expressions with Relative Salience Annotations. arXiv 2026, arXiv:2601.13225. [Google Scholar] [CrossRef]
Moon, A.S.; Kim, H.; Park, Y.C.; Lee, J. A Survey on Multimodal Emotion Recognition: Methods, Datasets, and Future Directions. Comput. Mater. Contin. 2026, 87, 1. [Google Scholar] [CrossRef]
Portnova, G.V.; Liaukovich, K.; Sharaev, M. The acoustic features of sincere joyful, sad, and fearful human non-verbal vocalizations and its effect on the emotional valence of cat’s meowing. Cogn. Neurodyn. 2025, 19, 181. [Google Scholar] [CrossRef]
Juslin, P.N.; Laukka, P.; Harmat, L.; Ovsiannikow, M. Spontaneous vocal expressions from everyday life convey discrete emotions to listeners. Emotion 2021, 21, 1281. [Google Scholar] [CrossRef]
Israelsson, A.; Seiger, A.; Laukka, P. Blended Emotions can be Accurately Recognized from Dynamic Facial and Vocal Expressions. J. Nonverbal Behav. 2023, 47, 267–284. [Google Scholar] [CrossRef]
Wu, Y.; Mi, Q.; Gao, T. A Comprehensive Review of Multimodal Emotion Recognition: Techniques, Challenges, and Future Directions. Biomimetics 2025, 10, 418. [Google Scholar] [CrossRef] [PubMed]
Yin, H.; Hohmann, V.; Nadeu, C. Acoustic features for speech recognition based on Gammatone filterbank and instantaneous frequency. Speech Commun. 2011, 53, 707–715. [Google Scholar] [CrossRef]
Abdul, Z.K.; Al-Talabani, A.K. Mel Frequency Cepstral Coefficient and its Applications: A Review. IEEE Access 2022, 10, 122136–122158. [Google Scholar] [CrossRef]
Liao, J.; Duan, H.; Feng, K.; Zhao, W.; Yang, Y.; Chen, L.; Chen, Y. LR-ASD: Lightweight and Robust Network for Active Speaker Detection. Int. J. Comput. Vis. 2025, 133, 4749–4769. [Google Scholar] [CrossRef]
Arshid, M.; Azam, M.R.; Danyal, M.; Gulfama, S.M. Text-Independent Speaker Recognition and Audio Integrity Verification in Next-Generation Communication Networks using MFCCs and Machine Learning. In Proceedings of the 2025 International Conference on Frontiers of Information Technology (FIT), Islamabad, Pakistan, 15–16 December 2025; pp. 1–6. [Google Scholar] [CrossRef]
De Simone, G.; Greco, L.; Saggese, A.; Vento, M. Multimodal Audio-Visual Emotion Recognition for Social Robotics. In Computer Analysis of Images and Patterns; Castrillón-Santana, M., Travieso-González, C.M., Deniz Suarez, O., Freire-Obregón, D., Hernández-Sosa, D., Lorenzo-Navarro, J., Santana, O.J., Eds.; Springer: Cham, Switzerland, 2026; pp. 245–255. [Google Scholar]
Liu, T.; Wang, M.; Yang, B.; Liu, H.; Yi, S. ESERNet: Learning spectrogram structure relationship for effective speech emotion recognition with swin transformer in classroom discourse analysis. Neurocomputing 2025, 612, 128711. [Google Scholar] [CrossRef]
Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
Wu, X.; Xue, J.; Yin, X.; Shi, Y.; Fu, L.; Huang, D.; Wang, Y.; Zhang, J.; Nie, J.; Wang, J. Scalable Audiovisual Masked Autoencoders for Efficient Affective Video Facial Analysis. Intell. Comput. 2026, 5, 0246. [Google Scholar] [CrossRef]
Hsu, W.; Sriram, A.; Baevski, A.; Likhomanenko, T.; Xu, Q.; Pratap, V.; Kahn, J.; Lee, A.; Collobert, R.; Synnaeve, G.; et al. Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training. arXiv 2021, arXiv:2104.01027. [Google Scholar]
Friesen, W.V.; Ekman, P. EMFACS-7: Emotional Facial Action Coding System. 1984; Unpublished Manual.
Zhi, R.; Liu, M.; Zhang, D. A comprehensive survey on automatic facial action unit analysis. Vis. Comput. 2020, 36, 1067–1093. [Google Scholar] [CrossRef]
Salas-Cáceres, J.; Lorenzo-Navarro, J.; Freire-Obregón, D.; Castrillón-Santana, M. Multimodal emotion recognition based on a fusion of audiovisual information with temporal dynamics. Multimed. Tools Appl. 2025, 84, 27327–27343. [Google Scholar] [CrossRef]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. SlowFast Networks for Video Recognition. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6201–6210. [Google Scholar] [CrossRef]
Ran, K.; Cheng, F.; Li, X. ViT-RF: A Two-Stage Transformer-Based Framework for Robust Classroom Learning Emotion Recognition. In Proceedings of the 2025 8th International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, 23–26 May 2025; pp. 513–518. [Google Scholar] [CrossRef]
Girdhar, R.; El-Nouby, A.; Liu, Z.; Singh, M.; Alwala, K.V.; Joulin, A.; Misra, I. ImageBind: One Embedding Space To Bind Them All. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 15180–15190. [Google Scholar]
Wang, L.; Huang, B.; Zhao, Z.; Tong, Z.; He, Y.; Wang, Y.; Wang, Y.; Qiao, Y. VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 14549–14560. [Google Scholar]
Qin, L.; Wang, M.; Deng, C.; Wang, K.; Chen, X.; Hu, J.; Deng, W. SwinFace: A Multi-Task Transformer for Face Recognition, Expression Recognition, Age Estimation and Attribute Estimation. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 2223–2234. [Google Scholar] [CrossRef]
Singh, M.; Singh, R.; Ross, A. A comprehensive overview of biometric fusion. Inf. Fusion 2019, 52, 187–205. [Google Scholar] [CrossRef]
Lei, J.; Ye, K.; Wang, Y. Cross-attention fusion for audio-visual emotion recognition with shared transformer. Speech Commun. 2026, 178, 103356. [Google Scholar] [CrossRef]
Boitel, E.; Mohasseb, A.; Haig, E. MIST: Multimodal emotion recognition using DeBERTa for text, Semi-CNN for speech, ResNet-50 for facial, and 3D-CNN for motion analysis. Expert Syst. Appl. 2025, 270, 126236. [Google Scholar] [CrossRef]
Keutmann, M.K.; Moore, S.L.; Savitt, A.; Gur, R.C. Generating an item pool for translational social cognition research: Methodology and initial validation. Behav. Res. Methods 2015, 47, 228–234. [Google Scholar] [CrossRef]
Ortega-Beltrán, E.; Cabacas-Maso, J.; Benito-Altamirano, I.; Ventura, C. Overlooked romance languages: In-the-wild emotion recognition in italian and spanish speakers. Pattern Recognit. Lett. 2026, 202, 170–175. [Google Scholar] [CrossRef]
Sun, L.; Lian, Z.; Liu, B.; Tao, J. HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition. Inf. Fusion 2024, 108, 102382. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70; ICML’17; JMLR.org: Norfolk, MA, USA, 2017; pp. 1321–1330. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]

Figure 1. Exemplary frames of the dataset. Extracted from [14].

Figure 2. Unimodal branch used to extract temporal features.

Figure 3. Mixed fusion multimodal architecture, integrating both the Attention and Concatenation branches. When used independently, each branch includes an MLP that produces its corresponding output before fusion.

Figure 4. Confusion matrix of the best-performing model on the validation set.

Table 1. Hyperparameters used in the experiments.

Hyperparameter	Symbol	Value
Fixed modality dimension	$u_{d}$	512
Fixed number of temporal units	$s_{l}$	64
Number of LSTM layers	$n_{L S T M}$	3
Number of Transformer layers	$n_{t}$	3
Number of heads in Transformer layers	$n_{h}$	16
Weight of KL divergence	$w_{KL}$	0.5
Weight of BCE loss	$w_{BCE}$	0.5
Weight of rank loss	$w_{rank}$	0.3
Enforced margin of rank loss	$δ$	0.1
Label Smoothing	$ϵ$	0.05

Table 2. Unimodal results (%) obtained on the 5-fold experiment by the encoders. Best performance on each category is in bold. LSTM results were excluded from the mean performance calculation of VideoMAE and HiCMAE.

Encoder	Statistical Aggregation		LSTM		CLS Token		Attention Pooling
Encoder	Presence	Salience	Presence	Salience	Presence	Salience	Presence	Salience
ImageBind	30.54 ± 1.9	14.33 ± 1.5	28.84 ± 1.2	12.59 ± 1.2	29.94 ± 1.6	14.45 ± 1.5	30.08 ± 1.2	14.35 ± 1.5
WavLM	28.49 ± 3.1	13.73 ± 1.5	25.18 ± 1.8	11.55 ± 1.2	27.11 ± 2.7	13.62 ± 1.0	27.48 ± 1.7	12.88 ± 1.5
HuBERT	25.94 ± 2.7	13.15 ± 1.0	22.45 ± 1.9	11.27 ± 0.9	26.35 ± 2.3	12.98 ± 0.9	25.60 ± 2.3	12.91 ± 1.6
HiCMAE	36.69 ± 2.5	17.76 ± 1.4	06.53 ± 1.2	02.88 ± 1.5	37.17 ± 2.5	18.05 ± 0.6	38.34 ± 3.0	18.23 ± 1.7
VideoMAE V2	29.23 ± 2.1	12.91 ± 1.1	08.74 ± 4.1	03.28 ± 1.8	29.88 ± 1.8	12.92 ± 1.4	29.63 ± 2.3	13.80 ± 1.9
Means	30.18	14.36	25.48	11.80	30.09	14.40	30.23	14.43

Table 3. Multimodal results (%) obtained on the 5-fold experiment by the different aggregation of encoders. Best performance on each category is in bold.

Encoders	Concatenation		Attention		Mixed
Encoders	Presence	Salience	Presence	Salience	Presence	Salience
ImageBind + WavLM	35.15 ± 2.0	14.30 ± 2.4	37.23 ± 3.1	17.41 ± 2.9	37.54 ± 2.0	17.02 ± 3.3
ImageBind + WavLM + HuBERT	34.87 ± 2.0	13.38 ± 0.7	37.27 ± 1.8	18.27 ± 2.3	36.06 ± 2.2	17.72 ± 1.5
ImageBind + WavLM + HuBERT + HiCMAE	39.34 ± 2.9	15.40 ± 1.9	41.41 ± 3.5	19.76 ± 2.3	42.00 ± 3.4	20.63 ± 2.7
ImageBind + WavLM + HuBERT + HiCMAE + VideoMAE	41.30 ± 2.3	17.31 ± 1.5	42.17 ± 2.3	21.02 ± 2.7	43.97 ± 3.8	21.95 ± 4.0
Means	37.67	15.10	39.52	19.12	39.89	19.33

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Salas-Cáceres, J.; Castrillón-Santana, M.; Santana, O.J.; Hernández-Sosa, D.; Lorenzo-Navarro, J. Attention-Based Multimodal Fusion for Salience-Aware Blended Emotion Recognition. Multimodal Technol. Interact. 2026, 10, 56. https://doi.org/10.3390/mti10050056

AMA Style

Salas-Cáceres J, Castrillón-Santana M, Santana OJ, Hernández-Sosa D, Lorenzo-Navarro J. Attention-Based Multimodal Fusion for Salience-Aware Blended Emotion Recognition. Multimodal Technologies and Interaction. 2026; 10(5):56. https://doi.org/10.3390/mti10050056

Chicago/Turabian Style

Salas-Cáceres, José, Modesto Castrillón-Santana, Oliverio J. Santana, Daniel Hernández-Sosa, and Javier Lorenzo-Navarro. 2026. "Attention-Based Multimodal Fusion for Salience-Aware Blended Emotion Recognition" Multimodal Technologies and Interaction 10, no. 5: 56. https://doi.org/10.3390/mti10050056

APA Style

Salas-Cáceres, J., Castrillón-Santana, M., Santana, O. J., Hernández-Sosa, D., & Lorenzo-Navarro, J. (2026). Attention-Based Multimodal Fusion for Salience-Aware Blended Emotion Recognition. Multimodal Technologies and Interaction, 10(5), 56. https://doi.org/10.3390/mti10050056

Article Menu

Attention-Based Multimodal Fusion for Salience-Aware Blended Emotion Recognition

Abstract

1. Introduction

2. Related Work

2.1. Audio Representation

2.2. Video Extraction

2.3. Modality Fusion

2.4. Datasets

3. Dataset

4. Methodology

4.1. Data Preprocessing

4.2. Architecture

4.2.1. Unimodal Branch

4.2.2. Multimodal Architecture

4.3. Evaluation Metrics

4.4. Loss Function

4.5. Experimental Setup and Post-Process

5. Results and Discussion

5.1. Unimodal Results

5.2. Multimodal Results

5.3. Discussion

5.4. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI