Context_Driven Emotion Recognition: Integrating Multi_Cue Fusion and Attention Mechanisms for Enhanced Accuracy on the NCAER_S Dataset

Elkorchi, Merieme; Hdioud, Boutaina; Oulad Haj Thami, Rachid; Merzouk, Safae

doi:10.3390/info16100834

Open AccessArticle

Context_Driven Emotion Recognition: Integrating Multi_Cue Fusion and Attention Mechanisms for Enhanced Accuracy on the NCAER_S Dataset

by

Merieme Elkorchi

^1,2,*

,

Boutaina Hdioud

¹,

Rachid Oulad Haj Thami

¹

and

Safae Merzouk

²

¹

Advanced digital enterprise modeling and information retrieval (ADMIR) laboratory, Rabat IT Center, Information retrieval and data analytics team (IRDA), ENSIAS, Mohammed V University in Rabat, Rabat 11000, Morocco

²

SMARTiLab Laboratory, Moroccan School of Engineering Sciences (EMSI Rabat/SMARTILAB), Rabat 11000, Morocco

^*

Author to whom correspondence should be addressed.

Information 2025, 16(10), 834; https://doi.org/10.3390/info16100834

Submission received: 10 August 2025 / Revised: 10 September 2025 / Accepted: 14 September 2025 / Published: 26 September 2025

(This article belongs to the Special Issue Multimodal Human-Computer Interaction)

Download

Browse Figures

Versions Notes

Abstract

In recent years, most conventional emotion recognition approaches have concentrated primarily on facial cues, often overlooking complementary sources of information such as body posture and contextual background. This limitation reduces their effectiveness in complex, real-world environments. In this work, we present a multi-branch emotion recognition framework that separately processes facial, bodily, and contextual information using three dedicated neural networks. To better capture contextual cues, we intentionally mask the face and body of the main subject within the scene, prompting the model to explore alternative visual elements that may convey emotional states. To further enhance the quality of the extracted features, we integrate both channel and spatial attention mechanisms into the network architecture. Evaluated on the challenging NCAER-S dataset, our model achieves an accuracy of 56.42%, surpassing the state-of-the-art GLAMOUR-Net. These results highlight the effectiveness of combining multi-cue representation and attention-guided feature extraction for robust emotion recognition in unconstrained settings. The findings also highlight the importance of accurate emotion recognition for human–computer interaction, where affect detection enables systems to adapt to users and deliver more effective experiences.

Keywords:

emotion recognition; spatial and channel attention; NCAER_S; fusion; context

1. Introduction

Humans interpret emotions through multiple signals, including facial expression, physical motion and gaits [1,2], human pose [3], environmental context, and speech [4] to interpret emotions, which goes far beyond the monofunctional approach typically used by machines. Despite considerable advances in research, machines still struggle to match our natural ability to recognize nuanced, adaptive emotions. However, extracting human emotions from visual data has become a source of significant interest in computer vision research, leading to advanced applications in sectors such as surveillance, driving monitoring, healthcare, and human–computer interaction systems [5,6,7,8,9].

Emotion recognition also plays a critical role in human–computer interaction (HCI). By detecting emotions from the face, body, and surrounding context, interactive systems can adapt their responses to user needs in real-time. For instance, a tutoring application can recognize when a learner shows signs of frustration and provide additional guidance, or a virtual assistant can identify positive engagement and maintain the current dialogue. Strengthening emotion recognition methods, therefore, contributes directly to improving adaptability, usability, and the overall quality of user experience in human–computer interaction.

Earlier methods [10,11,12,13] focus only on facial expressions and consider them as the main form of non-verbal communication. In this context, datasets like AFEW [14] and FER2013 [15] provide only cropped and aligned facial images. However, these approaches struggle in real-world situations because they overlook crucial contextual details beyond the face.

According to an experimental psychology study, Stefano et al. [16] found that body language conveys affective information that facial expressions alone cannot fully capture. Even more so, methods of emotion recognition demonstrated the importance of facial expressions, body language, movements, postures, and gestures. However, these cues alone are often insufficient in real-world scenarios, where contextual elements also influence emotions. The context includes background information, interactions, and situational cues that provide a more complete and human-like perception of emotions. Several studies [17,18,19] have been incorporating this context to improve emotion recognition, like [17], who combined features extracted from the face and the context, which is the entire image excluding the face. Despite these advances, the challenge of generalizing across diverse environments with heterogeneous elements remains significant. Also [18] combined face, body, and context features to capture different aspects of emotional expression. However, they incorporated the body within the context cue, which introduces redundancy. This overlap causes the model to learn similar information from both cues, leading it to focus on repetitive features and ultimately reducing the network’s overall performance.

To deal with the above problems, we remove both the body and the face from the context image, forcing the model to focus on more important and relevant features. Our goal is to enable the model to learn the key elements necessary for emotion recognition. Additionally, we apply spatial and channel attention mechanisms in the context and face to guide the model’s focus toward critical regions, thereby enhancing its ability to extract meaningful information and improving overall performance.

In this paper, we present a multi-source emotion recognition model that detects human emotions using image data. In our approach, we identified three distinct streams that are exploited to improve classification performance in various emotion categories. These streams are the face-encoding stream, which focuses on capturing facial cues, the body-encoding stream, which analyzes posture, and the context-encoding stream, which extracts scene-level information. The key idea in our approach is the removal of the face and body from the context images, enabling the model to focus more on background details and surrounding elements, which may offer additional emotional cues. Moreover, spatial and channel attention mechanisms are incorporated after each convolutional block in the face- and context-encoding streams, enabling the model to prioritize the most pertinent features for emotion recognition. We considered the work of Willams Costa et al. [18] as a baseline and developed an enhanced pipeline that improves emotion recognition accuracy by integrating these three streams. We summarize our main contributions as follows:

We employ YOLOv8 to extract the faces and bodies of the main actor with greater precision.
We enhance the accuracy of the combined model by applying spatial and channel attention mechanisms after each convolution.
We addressed class imbalances using Focal Loss, which significantly enhanced the model’s ability to learn from underrepresented classes.
The context image is constructed by removing the face and body of the primary subject, encouraging the model to explore other visual elements within the scene.
The proposed framework leverages complementary cues from face, body, and context to address the limitations of single-modality approaches.
Contextual understanding is enhanced by directing the model’s attention toward background objects and secondary actors in the scene.
Emotion recognition performance is improved by isolating redundant features and forcing the network to learn from diverse visual signals.

To facilitate readability, the main notations and abbreviations used in this paper are summarized in Table 1.

2. Related Work

2.1. Emotion Recognition

The aim of emotion recognition is to accurately understand human emotion, as human–machine interactions are constantly evolving. Most approaches to human emotion recognition began with efforts to analyze facial expressions as the primary source of emotional cues. However, facial expressions alone often fail to capture the complexity of emotions in real-world scenarios, even though the effort to extract meaningful features from the face has advanced [10,13,20,21].

In order to solve this limitation, some methods that use body language have been proposed [22,23]. Mei Si et al. [22] used the features representing posture and body movements to automatically detect people’s emotions in non-gaming scenarios. They focused on four emotions often observed when people play video games: triumph, frustration, defeat, and concentration. Ahmed et al. [23] proposed a two-layer method to select emotion-specific features from a comprehensive list of body movement descriptors. This approach addresses the challenge of identifying relevant features from a vast dataset of human body movements and achieved a very high emotion recognition rate.

Combinations of these features have been also applied to enhance emotion recognition. Multimodal approaches often achieve a better classification performance than uni-modal methods. David Griol et al. [24] introduced an emotion recognition system that integrates both speech and facial cues; they combined these two modalities with a late fusion strategy. G. et al. [25] achieved an improvement in the recognition of eight emotions in ten subjects by fusing the features extracted from facial expressions, body movement, and gestures and speech. Recently, much work has focused on exploring context-sensitive information for emotion recognition. The first challenge involves creating a new dataset of images featuring people in real-world contexts, a novel data set call the “Emotions in Context Database” (EMOTIC) [26], containing people in context in non-controlled environments; this dataset motivates further research [27] to extract meaningful information from both the person and the surrounding scene, which provides contextual information. They then combined the extracted features to enhance the emotion recognition. Lee [17] introduced a novel benchmark for context-aware emotion recognition, called CAER; they also presented a novel deep learning architecture called CAER-Net, which integrates both facial expression analysis and contextual information that hides faces in scenes to focus on other contextual elements using attention mechanisms. Yang, D. et al. [28] identified four contexts to improve emotion recognition: (1) multimodal context combines facial expressions, landmarks, gestures, and gait for representation; (2) scene context uses attention modules to analyse surroundings and extract emotion semantics; (3) surrounding agent context explores emotional influence among agents in the same scene; (4) agent–object context examines interactions between agents and objects. Le et al. [29] proposed a new global local attention mechanism that extracts features from facial and contextual regions independently, then learns them simultaneously using the attention module. Willams Costaa et al. [18] proposed a new direction for emotion recognition based on multiple cues using face and context information and body poses. A comparative summary of the related studies discussed above is presented in Table 2.

2.2. Attention Mechanisms

Attention mechanisms originated in neural machine translation [30]. Researchers soon generalized this concept, integrating attention into transformer architectures that revolutionized natural language processing [31], and later adapting it to enhance image classification [32,33,34] and image segmentation [35,36] in computer vision. In recent years, using the attentional mechanism, an important component of neural networks, Aminbeidokhti et al. [37] applied spatial attention to emotion recognition by generating masks that highlight the most informative facial regions. Building on this, Squeeze-and-Excitation blocks later introduced channel-wise recalibration to weight informative feature maps [38]. More recently, researchers have embedded combined channel and spatial attention modules within convolutional backbones to refine feature selection, enabling models to emphasize both the most informative filters and the most informative spatial locations simultaneously and thus bolster performance under occlusion and complex backgrounds [39].

3. Methodology

3.1. Proposed Network Architecture

We describe our proposed method in this section, as illustrated in Figure 1, which utilizes a single dataset called NCAER_S. For each sample, three images are extracted. Our proposed method is structured into three streams: the face-encoding stream, body-encoding stream, and context-encoding stream. The features extracted from these streams are combined and processed through two dense layers; then, the softmax function is used to classify the output into seven emotions. The following subsections present a detailed explanation of all the encoding streams.

3.2. Preprocessing Pipeline

Before we begin training, we first apply a preprocessing step to add some variability between epochs while keeping the important features intact. Each input image, whether from the face-, context-, or body-encoding streams, is resized to a consistent 112 × 112 resolution.

3.3. Face-Encoding Stream

Our approach aims to identify the main actor’s face, which is present in each image. We assume that each scene corresponds to a single emotion; however, the NCAER-S dataset [29] does not clarify which actor is associated with the annotated emotion, making the task particularly complex. The challenge arises from the fact that several faces can appear in the same scene, each with a different emotion.

To address this, we first apply YOLOv8m_200e, an intermediate version of YOLOv8 [40] trained for 200 epochs. The model effectively identifies and localizes faces in complex scenes within the NCAER-S dataset, which includes multiple people, varying light conditions, angles, and occlusions. It generates bounding boxes around the detected faces, denoted by

B_{j} = [x_{j}, y_{j}, w_{j}, h_{j}]

for

j = 1, \dots, N

, where N is the number of detected faces in the image. We assume that the face of the main actor is the one with the largest bounding box. The area

A_{j}

is calculated using Equation (1) once the index

i n d e x_{\max}

of the face with the largest area is found, based on Equation (2). We use this index to retrieve the corresponding bounding box

B_{f}

using (3). This bounding box is then used to extract the face region from the original image

I_{original}

, as shown in (4). Finally, we associate the extracted face with the scene’s emotion annotation. The overall process of detecting and extracting the main actor’s face is illustrated in Figure 2.

A_{j} = w_{j} \times h_{j} .

(1)

i n d e x_{\max} = arg max (A_{j}) .

(2)

B_{f} = B_{{index}_{\max}} .

(3)

I_{f} = I_{original} [y_{f} : y_{f} + h_{f}, x_{f} : x_{f} + w_{f}] .

(4)

The cropped face of the main actor, denoted as

I_{f}

, is then used as input for the face-encoding stream, as illustrated in the architecture of Figure 3. This face encoding stream consists of five layers of 5 × 5 convolutional layers, followed by Batch Normalization and rectified linear unit (ReLU) activation. Motivated by [39], our face-encoding stream uses CBAM architecture after each convolution and batch Normalization to direct the model toward the most relevant features at every stage, and eliminates irrelevant information before the next layer processes it to improve both local and global feature extraction.

The CBAM architecture operates by sequentially applying channel attention and spatial attention modules. It first applies channel attention to the input feature map, denoted as F, to refine features based on their importance. The channel attention map

M_{C}

is computed by first applying average pooling and maximum pooling operations on the input feature map F to obtain two feature vectors. These vectors are then passed through a multi-layer perceptron (MLP) [41]. Finally, the outputs of the

M L P (AvgPool (F))

and

M L P (MaxPool (F))

are summed using element-wise summation, and the Sigmoid activation function

σ

is applied to this sum to normalize the values to a range between 0 and 1. In short, the channel attention is computed as follows:

M_{C} (F) = σ (MLP (AvgPool (F)) + MLP (MaxPool (F))) .

(5)

where

σ

represents the Sigmoid activation function.

The refined feature map

F^{'}

is obtained by multiplying the channel attention map

M_{C} (F)

with the input feature map F to enhance the most important channels and suppress less relevant ones. The specific calculation formula is as follows:

F^{'} = M_{C} (F) \cdot F .

(6)

Next, the architecture applies spatial attention to emphasize key regions in the image. The spatial attention map

M_{s} (F^{'})

is computed by first applying average pooling and maximum pooling operations to the feature map

F^{'}

. Then, the two pooled feature maps are concatenated and a convolution operation with a 7 × 7 kernel is applied. Finally, the Sigmoid activation function

σ

is applied to the result to normalize the values to a range between 0 and 1. The specific calculation formula is as follows:

M_{S} (F^{'}) = σ ({Conv 2 D}_{7 \times 7} (Concat (AvgPool (F^{'}), MaxPool (F^{'})))) .

(7)

where

σ

represents the Sigmoid activation function.

Finally,

F^{''}

results from the element-wise multiplication of the spatial attention map and

F^{'}

. This process allows the network to focus on significant spatial regions.

F^{″} = M_{S} (F^{'}) \cdot F^{'}

(8)

As shown in Figure 3, we then use maximum-pooling with a stride of 2, and finally, to avoid overfitting, we use dropout.

3.4. Body-Encoding Stream

Figure 4 illustrates the overall procedure for detecting and extracting the main actor’s body using YOLOv8 [40]. In this process, and in order to incorporate body cues for non-verbal emotion recognition, we relied on the pre-trained YOLOv8 [40] model, which was trained on diverse benchmark datasets to automatically detect and segment human bodies within each scene.

As can be seen in Figure 2, once the main actor’s face with bounding box

B_{f}

was identified, we moved on to extract their body. We used YOLOv8 [40] to detect and segment the bodies in the image

I_{original}

, with each having its own bounding box expressed using Equation (9). Each bounding box is defined by the coordinates of its top-left corner and its width and height.

B_{i} = [x_{i}, y_{i}, w_{i}, h_{i}], for i = 1, \dots, N .

(9)

where N is the number of detected bodies in the image

I_{original}

.

Thus, in order to obtain the main actor’s body, we calculated the intersection area between the face bounding box of the main actor

B_{f}

and each bounding box of the detected bodies

B_{i}

. The formula in Equation (10) computes the intersection areas and returns the largest intersection area, from which we extract the index of the corresponding bounding box, selected as the box of the body of the primary actor

B_{body}

.

\begin{matrix} A_{inter} (B_{f}, B_{i}) & = max (0, min (x_{f} + w_{f}, x_{i} + w_{i}) - max (x_{f}, x_{i})) \\ \times max (0, min (y_{f} + h_{f}, y_{i} + h_{i}) - max (y_{f}, y_{i})) \end{matrix}

(10)

where

A_{inter} (B_{f}, B_{i})

represents the maximum area of intersection between the face bounding box

B_{f}

and each of the body bounding boxes

B_{i}

.

Each bounding box is associated with a segmentation mask

M_{i}

, where

M_{i}

is a binary mask indicating the presence of the detected object within the bounding box. After identifying the bounding box of the primary actor’s body denoted as

B_{body}

, the corresponding segmentation mask

M_{body}

, is applied to the original image

I_{original}

, resulting in the image

I_{b}

, where only the primary actor’s body is isolated and the background is set to black. In short, this can be expressed as:

I_{b} = I_{original} \times M_{body}

(11)

We used the main actor’s body denoted as

I_{b}

, as input for the body-encoding stream, as illustrated in our architecture in Figure 1.

As shown in Figure 5, the body-encoding module consists of a six-layer convolutional network with size 128, 32, 128, 128, 96, and 192, followed by batch normalization and rectified linear unit (ReLU) activation functions, with three-layer max pooling with a kernel size of 2. Finally, the output is flattened to prepare for downstream processing.

3.5. Context-Encoding Stream

Figure 6 illustrates the procedure for generating the context image used in our model. To extract the context image, we utilized the previously obtained segmentation mask

M_{body}

of the body of the primary actor. The segmented body region was then blacked out in the original image, which can be expressed mathematically as follows:

I_{c} = I_{original} \times (1 - M_{body})

(12)

This operation removed the actor and left only the surrounding environment. The final image, labeled as

I_{c}

, kept important details like objects, background elements, and other people, providing useful context for the scene. This step ensures that this stream does not learn the duplicated information, as in the body-encoding stream.

In order to extract meaningful contextual information, we removed body regions from the images, and the extracted image, denoted as

I_{c}

, was used an input to a context-encoding stream. As shown in Figure 7, our stream consists of seven convolutional layers with filter sizes of 32, 32, 64, 64, 128, 128, and 256. Every two convolutional layers except the last one follow Channel Attention [39], which allows the model to focus on important feature channels and assign higher weights to the most relevant ones. The network captures the relationships between different feature maps and ensures that significant details contribute more to the final representation. After this, we integrated Spatial Attention [39] in order to detect crucial regions within the image and directs the model’s focus to the most informative areas. After the attention mechanisms, we added rectified linear unit activation functions, Average Pooling operations to refine feature extraction, and dropout to prevent overfitting.

Both attention mechanisms [39] play a key role in improving accuracy as they help the model prioritize relevant features. The model applies them to filter out irrelevant information and retain critical scene details.

3.6. Adaptive Fusion Networks

According to our proposed architecture, before final classification, it is necessary to combine the three features extracted using the three encoding streams, as shown in Figure 1; the facial features vector

V_{f}

, the body features vector

V_{b}

, and the context features vector

V_{c}

are concatenated in a single vector

V_{Concat}

shown in (14) and then used as input to the two dense layers; then, the Softmax layer is used to obtain the final classification.

V_{f} = (\begin{matrix} f_{1} \\ f_{2} \\ ⋮ \\ f_{n_{f}} \end{matrix}) \in R^{n_{f}}, V_{b} = (\begin{matrix} b_{1} \\ b_{2} \\ ⋮ \\ b_{n_{b}} \end{matrix}) \in R^{n_{b}}, V_{c} = (\begin{matrix} c_{1} \\ c_{2} \\ ⋮ \\ c_{n_{c}} \end{matrix}) \in R^{n_{c}} .

(13)

V_{Concat} = (\begin{matrix} f_{1} \\ f_{2} \\ ⋮ \\ f_{n_{f}} \\ b_{1} \\ b_{2} \\ ⋮ \\ b_{n_{b}} \\ c_{1} \\ c_{2} \\ ⋮ \\ c_{n_{c}} \end{matrix}) \in R^{n_{f} + n_{b} + n_{c}} .

(14)

This means that a new vector is created by sequentially placing the components of

V_{f}

, followed by those of

V_{b}

, and finally those of

V_{c}

. The concatenation operation merges these three vectors into a single vector.

4. Experimental Setup

4.1. Datasets

We experimented with our proposed approach using the NCAER-S dataset. Nhat et al. [29] extracted NCAER-S from CAER-S [17] to improve its robustness. CAER-S [17] lacked generalization because many images in its training and test sets came from the same videos. This similarity made the model less effective at handling new data. The new dataset ensured a better separation between training and test images, which allow for more reliable evaluations. According to Figure 8, we can observe that some classes did not have enough images in the test set; therefore, we combined the training, test, and validation datasets and split them again for a balanced distribution.

4.2. Visualization

The following figure represents the distribution of images for each category in NCAER-S dataset. We can observe that the emotion “Neutral” has the highest number of data and “Happy” has the lowest amount of data. We can see from Figure 8 that the original NCAER-S dataset is unbalanced, and some classes have very few images. Also, some test set classes contain less than 10 percent of the total dataset, which makes evaluation unreliable. To fix this, we merged the training, validation, and test sets, then split the dataset again to allocate 10 percent of the total images to the test set. This ensured a sufficient number of evaluation samples but did not fully resolve the imbalance across emotion categories. To mitigate this limitation, we applied data augmentation to the face, body, and context images. After resizing each image to 112 × 112 pixels, we introduced variability by performing random horizontal flipping, which generates mirrored views of body, context, and facial orientations to account for natural variations in posture and head direction, brightness adjustments (up to a maximum delta of 0.1) to emulate different lighting conditions, and contrast adjustments (within a range of 0.9–1.1) to capture variations in visual intensity. These transformations enriched the underrepresented classes while preserving their emotional content. Figure 9 presents the new dataset distribution after this change.

4.3. Implementation Details

We implemented our networks using the TensorFlow 2.10 framework [42] and trained them with the Adam optimizer [43]. All experiments were performed on a workstation equipped with an NVIDIA GeForce RTX 4060 GPU (8 GB VRAM), Intel Core i7-13620H CPU, and 16 GB RAM, running windows 10. The initial learning rate is modulated by a cyclic learning rate schedule based on a cosine function to enhance training dynamics. This strategy dynamically adjusts the learning rate within a predefined range, oscillating smoothly between a minimum value of 0.0001 and a maximum of 0.01 over cycles of 60 epochs. We also use early stopping to avoid overfitting by monitoring the validation loss and terminating training when no improvement is observed within a specified number of epochs. To deal with the class imbalance in our dataset, we use the focal loss function expressed as (15), introduced by Lin et al. in [44]. Unlike standard cross-entropy loss, focal loss reduces the weight of easy-to-classify examples and pushes the model to focus on harder, misclassified ones.

FocalLoss = \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} [- α {(1 - p_{i c})}^{γ} y_{i c} ln (p_{i c})] .

(15)

where:

N is the number of samples.
C is the number of classes.
$p_{i c}$ is the predicted probability for sample i and class c.
$y_{i c}$ is the ground-truth label for sample i and class c.
$α$ is a weighting factor that adjusts the importance of each class, and $γ$ , which controls the degree of emphasis placed on hard-to-classify examples. In our implementation, we set $α = 0.25$ to address the class imbalance, and $γ = 2$ to direct the model’s attention towards more difficult examples.

The term

{(1 - p_{i c})}^{γ}

increases the emphasis on samples where

p_{i c}

for the correct class is low, while

α

balances the influence of each class.

5. Results and Discussion

After our experiments on the NCAER_S dataset, we found that integrating YOLOv8 into the face-selector algorithm mentioned in Section 3.3 significantly improves the identification of the main actor compared to Dlib. Dlib, an open-source toolkit widely used for face detection and landmark localization [45], is known for its efficiency, but often struggles in complex scenarios with multiple faces, occlusions, or extreme head poses. This limitation arises because Dlib relies on traditional regression tree methods for face alignment rather than deep learning-based architectures, which restricts its robustness in unconstrained environments. In contrast, YOLOv8, as a modern deep learning detector, reliably captured every visible face in complex scenes and dynamic images of the NCAER_S dataset, which made it much easier to isolate the main actor’s face. This robustness against occlusions and extreme poses directly strengthened our proposed emotion classification pipeline by ensuring that the correct facial region was always selected. Similarly, since the body selector algorithm is based on the face-selector algorithm, we can conclude that it also effectively segments the correct body of the main actor.

Table 3 presents the quantitative results of our proposed experiment, which are examined in the subsequent discussion. Following standard practice in the literature [17,46], we evaluated all experiments using the accuracy metric.

Our first proposed experiment focuses on the fusion of three cues: face, body, and context. This combination is essential for achieving the best performance. Each cue plays a unique role in enhancing the results. For example, the body-encoding stream becomes important in complex situations where the context alone may not provide sufficient information. Likewise, the context stream improves the model’s accuracy when the body pose is unavailable.

In addition to the multi-cue fusion, we integrated attention mechanisms within the NCAER_S dataset by applying channel and spatial attention modules after each convolutional block in the face- and context-encoding streams, while leaving the body-encoding stream unchanged. This architecture yielded a significant improvement in classification accuracy. To assess the specific impact of the attention modules on the fusion of these cues, we trained an identical model without them, keeping all other parameters and training conditions constant. As shown in Table 3, the model achieved an accuracy of 56.42% with the attention mechanisms, compared to 54.29% without them. This demonstrates the importance of attention in refining the model’s ability to focus on the most relevant features and further enhancing the fusion of face, body, and context cues. For the context-encoding stream, we deliberately removed the body information and added attention modules to reduce feature redundancy and to encourage the model to extract complementary cues from surrounding actors and objects. As illustrated in Figure 10, our method allows the context-encoding stream to focus on essential contextual details, such as the faces and the bodies of other actors, which might be overshadowed when the body is present. This approach was especially beneficial in cases where multiple individuals appeared in the same scene and exhibited similar emotional expressions, as shown in Figure 11. In contrast, GLAMORNet [29] leave the body inside the context image, which might distract the model from focusing clearly on important background details.

Another experiment focused on addressing class imbalances using focal loss. We observed that its use resulted in a more rapid decrease in loss compared to the standard cross-entropy loss function. This behavior is primarily due to the inherent characteristics of focal loss [44], which places greater emphasis on hard-to-classify examples while downweighting easy ones. In contrast, standard cross-entropy loss, which is used in most existing work in the field of emotion recognition [18,29,47], treats all examples equally. This can cause the model to converge slower, especially when it is confident about easy examples.

The confusion matrix in Figure 12 shows that the model performs best on Fear, which reached 85% accuracy. It also performs well on Disgust (75%) and Sad (61%). In contrast, it struggles to correctly classify Happy (35%), Surprise (35%), Neutral (49%), and Anger (38%). These four emotions are frequently confused with one another, especially Neutral. The model misclassifies Happy as Neutral in 22% of cases, Surprise as Neutral in 21% of cases, and Anger as Neutral in 25% of cases. This pattern suggests both feature overlap and a strong class imbalance. The training set contains over 2500 Neutral examples, as shown in Figure 9, while the Happy and Surprise examples each contain almost half that number. This imbalance is probably behind the default choice of the Neutral model in cases of uncertainty. Anger, which shares certain visual cues with Disgust and Neutral, also suffers from this effect.

The results show that our model achieves the best performance among all evaluated architectures.

Our proposed method was compared with several existing baselines, such as the MCF_NET [48] and GLAMORNet [29] methods. The results, shown in in Table 4, indicate that our model outperforms all baseline architectures. Specifically, our network increases classification accuracy by 8.02% compared to GLAMORNet [29]. The reasons for the high performance of our proposed method are twofold: (1) MCF_NET [48] and GLAMORNet [29] process the entire scene, including body information, through a single context module, without explicitly isolating the body cues. In contrast, our architecture introduces a dedicated body-encoding stream that receives segmented body inputs. This approach directs the model’s attention to gestures, stance, and motion, rather than expecting the context encoder to infer these features indirectly. As a result, the network learns pose-specific representations and subtle body movements that a holistic representation of the context might not capture. (2) Our architecture removes the body from the scene context because it has already extracted the body as input for the body-encoding stream. Moreover, our approach applies channel and spatial attention mechanisms at every convolution layer in order to direct the encoding stream for both context and face to focus on the most relevant features at every stage.

Summary of Achieved Results

The experiments on the NCAER-S dataset demonstrate the effectiveness of the proposed multi-cue emotion recognition architecture. The main findings can be summarized as follows:

Face and Body Selection: Replacing Dlib with YOLOv8m_200e improved the robustness of actor detection, ensuring accurate face localization even under occlusion and complex poses. This enhancement also benefited the body-extraction stage, as the body region was derived from the detected face.
Multimodal Fusion: Combining face, body, and context features led to higher recognition accuracy than any unimodal configuration. Each modality contributed complementary information: body cues improved recognition in challenging scenarios, while context compensated for missing or ambiguous body cues.
Attention Mechanisms: The integration of channel and spatial attention (CBAM) within the face and context streams refined feature representation. The ablation study (Table 3) showed an accuracy gain from 54.29% (without attention) to 56.42% (with attention).
Comparison with State-of-the-Art: As shown in Table 4, the proposed model surpassed existing multimodal baselines such as MCF_NET and GLAMOR-Net. It achieved an accuracy of 56.42%, representing an improvement of 8.02% over GLAMOR-Net.
Per-Class Performance: The confusion matrix (Figure 12) revealed strong recognition of emotions such as Fear (85%), and Disgust (75%), but persistent confusion among visually similar categories like Happy, Surprise, and Neutral, largely due to dataset imbalance.
Loss Function: Employing focal loss instead of cross-entropy accelerated convergence and improved classification for minority classes.

These results confirm that robust actor selection, attention-guided feature refinement, and the adaptive fusion of face, body, and context streams jointly enhance recognition accuracy and robustness. The comparative findings also demonstrate that the proposed architecture provides a significant improvement over prior multimodal approaches.

6. Conclusions

This paper demonstrates that combining face, body, and context cues improves emotion recognition. Each element contributes uniquely: facial expressions reveal detailed emotions, body posture offers additional context, and the surrounding scene helps determine the emotional context. We enhanced feature extraction through spatial and channel attention mechanisms and designed the context stream to exclude the body region, forcing the model to focus on non-redundant scene elements. These design choices led to a clear improvement in classification accuracy. Our approach outperforms existing methods and demonstrates the effectiveness of integrating multiple visual cues with attention-based strategies for robust and scalable emotion recognition in real-world settings. Beyond classification, the proposed framework strengthens the role of emotion recognition in human–computer interaction, where robust emotion detection enables more adaptive, natural, and user-centered experiences.

Our method only applies to static images, which restricts the range of emotional information the model can capture. In the future, we intend to use temporal information coming from video sequences. Also, audio signals are another source of useful information, especially when there is uncertainty in the visual input. In addition, we intend to evaluate our method on a more complex dataset that includes multiple faces and complex scenes. This setup better reflects real-world conditions and enables the model to interpret a wider set of emotional cues that are closer to how humans recognize emotions.

Author Contributions

Conceptualization, M.E. and B.H.; methodology, M.E. and B.H.; software, S.M.; validation, B.H. and R.O.H.T.; formal analysis, B.H.; investigation, S.M.; resources, S.M.; writing—original draft preparation, M.E.; writing—review and editing, B.H.; supervision, B.H. and R.O.H.T.; project administration, B.H. and R.O.H.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset

N C A E R_S

used in this study is publicly available at [29].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Randhavane, T.; Bhattacharya, U.; Kapsaskis, K.; Gray, K.; Bera, A.; Manocha, D. Identifying emotions from walking using affective and deep features. arXiv 2019, arXiv:1906.11884. [Google Scholar]
Stathopoulou, I.O.; Tsihrintzis, G.A. Emotion recognition from body movements and gestures. In Intelligent Interactive Multimedia Systems and Services, Proceedings of the 4th International Conference on Intelligent Interactive Multimedia Systems and Services (IIMSS 2011), Piraeus, Greece, 20–22 July 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 295–303. [Google Scholar]
Schindler, K.; Van Gool, L.; De Gelder, B. Recognizing emotions expressed by body pose: A biologically inspired neural model. Neural Netw. 2008, 21, 1238–1246. [Google Scholar] [CrossRef]
Han, K.; Yu, D.; Tashev, I. Speech emotion recognition using deep neural network and extreme learning machine. In Proceedings of the Interspeech 2014, Singapore, 14–18 September 2014. [Google Scholar]
Clavel, C.; Vasilescu, I.; Devillers, L.; Richard, G.; Ehrette, T. Fear-type emotion recognition for future audio-based surveillance systems. Speech Commun. 2008, 50, 487–503. [Google Scholar] [CrossRef]
Yang, D.; Huang, S.; Xu, Z.; Li, Z.; Wang, S.; Li, M.; Wang, Y.; Liu, Y.; Yang, K.; Chen, Z.; et al. Aide: A vision-driven multi-view, multi-modal, multi-tasking dataset for assistive driving perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 20459–20470. [Google Scholar]
Ali, M.; Mosa, A.H.; Al Machot, F.; Kyamakya, K. EEG-based emotion recognition approach for e-healthcare applications. In Proceedings of the 2016 Eighth International Conference on Ubiquitous and Future Networks (ICUFN), Vienna, Austria, 5–8 July 2016; pp. 946–950. [Google Scholar]
Fragopanagos, N.; Taylor, J.G. Emotion recognition in human–computer interaction. Neural Netw. 2005, 18, 389–405. [Google Scholar] [CrossRef]
Fukui, K.; Yamaguchi, O. Face recognition using multi-viewpoint patterns for robot vision. In Proceedings of the Robotics Research. The Eleventh International Symposium: With 303 Figures; Springer: Berlin/Heidelberg, Germany, 2005; pp. 192–201. [Google Scholar]
Ma, X.; Lin, W.; Huang, D.; Dong, M.; Li, H. Facial emotion recognition. In Proceedings of the 2017 IEEE 2nd International Conference on Signal and Image Processing (ICSIP), Singapore, 4–6 August 2017; pp. 77–81. [Google Scholar] [CrossRef]
Jain, D.K.; Shamsolmoali, P.; Sehdev, P. Extended deep neural network for facial emotion recognition. Pattern Recognit. Lett. 2019, 120, 69–74. [Google Scholar] [CrossRef]
Wang, K.; Peng, X.; Yang, J.; Meng, D.; Qiao, Y. Region attention networks for pose and occlusion robust facial expression recognition. IEEE Trans. Image Process. 2020, 29, 4057–4069. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Zeng, J.; Shan, S.; Chen, X. Occlusion Aware Facial Expression Recognition Using CNN with Attention Mechanism. IEEE Trans. Image Process. 2019, 28, 2439–2450. [Google Scholar] [CrossRef] [PubMed]
Dhall, A.; Goecke, R.; Lucey, S.; Gedeon, T. Acted Facial Expressions in the Wild Database; ANU Computer Science Technical Report Series; TR-CS-11-02; The Australian National University: Canberra, Australia, 2011; Volume 2. [Google Scholar]
Goodfellow, I.J.; Erhan, D.; Carrier, P.L.; Courville, A.; Mirza, M.; Hamner, B.; Cukierski, W.; Tang, Y.; Thaler, D.; Lee, D.H.; et al. Challenges in representation learning: A report on three machine learning contests. In Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, 3–7 November 2013. Proceedings, Part III 20; Springer: Berlin/Heidelberg, Germany, 2013; pp. 117–124. [Google Scholar]
Piana, S.; Staglianò, A.; Odone, F.; Verri, A.; Camurri, A. Real-time Automatic Emotion Recognition from Body Gestures. arXiv 2014, arXiv:1402.5047. [Google Scholar] [CrossRef]
Lee, J.; Kim, S.; Kim, S.; Park, J.; Sohn, K. Context-aware emotion recognition networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10143–10152. [Google Scholar]
Costa, W.L.; Macêdo, D.; Zanchettin, C.; Figueiredo, L.S.; Teichrieb, V. Multi-Cue Adaptive Emotion Recognition Network. arXiv 2021, arXiv:2111.02273. [Google Scholar] [CrossRef]
Wang, Z.; Lao, L.; Zhang, X.; Li, Y.; Zhang, T.; Cui, Z. Context-dependent emotion recognition. J. Vis. Commun. Image Represent. 2022, 89, 103679. [Google Scholar] [CrossRef]
Meng, D.; Peng, X.; Wang, K.; Qiao, Y. Frame attention networks for facial expression recognition in videos. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 3866–3870. [Google Scholar]
Georgescu, M.I.; Ionescu, R.T.; Popescu, M. Local learning with deep and handcrafted features for facial expression recognition. IEEE Access 2019, 7, 64827–64836. [Google Scholar] [CrossRef]
Garber-Barron, M.; Si, M. Using body movement and posture for emotion detection in non-acted scenarios. In Proceedings of the 2012 IEEE International Conference on Fuzzy Systems, Brisbane, Australia, 10–15 June 2012; pp. 1–8. [Google Scholar]
Ahmed, F.; Bari, A.H.; Gavrilova, M.L. Emotion recognition from body movement. IEEE Access 2019, 8, 11761–11781. [Google Scholar] [CrossRef]
Luna-Jiménez, C.; Griol, D.; Callejas, Z.; Kleinlein, R.; Montero, J.M.; Fernández-Martínez, F. Multimodal emotion recognition on RAVDESS dataset using transfer learning. Sensors 2021, 21, 7665. [Google Scholar] [CrossRef]
Caridakis, G.; Castellano, G.; Kessous, L.; Raouzaiou, A.; Malatesta, L.; Asteriadis, S.; Karpouzis, K. Multimodal emotion recognition from expressive faces, body gestures and speech. In Artificial Intelligence and Innovations 2007: From Theory to Applications, Proceedings of the 4th IFIP International Conference on Artificial Intelligence Applications and Innovations (AIAI 2007), Athens, Greece, 19–21 September 2007; Springer: New York, NY, USA, 2007; pp. 375–388. [Google Scholar]
Kosti, R.; Alvarez, J.M.; Recasens, A.; Lapedriza, A. Emotion recognition in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1667–1675. [Google Scholar]
Kosti, R.; Alvarez, J.M.; Recasens, A.; Lapedriza, A. Context based emotion recognition using emotic dataset. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2755–2766. [Google Scholar] [CrossRef] [PubMed]
Yang, D.; Huang, S.; Wang, S.; Liu, Y.; Zhai, P.; Su, L.; Li, M.; Zhang, L. Emotion recognition for multiple context awareness. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 144–162. [Google Scholar]
Le, N.; Nguyen, K.; Nguyen, A.; Le, B. Global-local attention for emotion recognition. Neural Comput. Appl. 2022, 34, 21625–21639. [Google Scholar] [CrossRef]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on visual transformer. arXiv 2020, arXiv:2012.12556. [Google Scholar]
Bello, I.; Zoph, B.; Vaswani, A.; Shlens, J.; Le, Q.V. Attention augmented convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3286–3295. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Vedaldi, A. Gather-excite: Exploiting feature context in convolutional neural networks. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
Aminbeidokhti, M.; Pedersoli, M.; Cardinal, P.; Granger, E. Emotion Recognition with Spatial Attention and Temporal Softmax Pooling. In Proceedings of the Image Analysis and Recognition, Waterloo, ON, Canada, 27–29 August 2019; pp. 323–331. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 14 January 2024).
Ramchoun, H.; Ghanou, Y.; Ettaouil, M.; Janati Idrissi, M.A. Multilayer perceptron: Architecture optimization and training. Int. J. Interact. Multimed. Artif. Intell. 2016, 4, 26–30. [Google Scholar] [CrossRef]
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. {TensorFlow}: A system for {Large-Scale} machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), Savannah, GA, USA, 2–4 November 2016; pp. 265–283. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar] [CrossRef]
King, D.E. Dlib-ml: A machine learning toolkit. J. Mach. Learn. Res. 2009, 10, 1755–1758. [Google Scholar]
Dhall, A.; Goecke, R.; Joshi, J.; Hoey, J.; Gedeon, T. Emotiw 2016: Video and group-level emotion recognition challenges. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, Tokyo, Japan, 12–16 November 2016; pp. 427–432. [Google Scholar]
Do, N.T.; Kim, S.H.; Yang, H.J.; Lee, G.S.; Yeom, S. Context-aware emotion recognition in the wild using spatio-temporal and temporal-pyramid models. Sensors 2021, 21, 2344. [Google Scholar] [CrossRef]
Xu, H.; Kong, J.; Kong, X.; Li, J.; Wang, J. MCF-Net: Fusion Network of Facial and Scene Features for Expression Recognition in the Wild. Appl. Sci. 2022, 12, 10251. [Google Scholar] [CrossRef]

Figure 1. Our proposed architecture for emotion recognition in the NCAER_S dataset.

Figure 2. Face detection and extraction of the main actor using YOLOv8.

Figure 3. Architecture of the CNN-based face-feature extractor.

Figure 4. Segmentation and extraction of the main actor’s body using YOLOv8.

Figure 5. Architecture of the CNN-based body-feature extractor.

Figure 6. Context image generation by masking the body of the main actor using the segmentation mask, preserving only the surrounding scene.

Figure 7. Architecture of the CNN-based context-feature extractor.

Figure 8. Distribution of images across classes (training and test) of NCAER-S.

Figure 9. Distribution of images across classes (training and test) of NCAER-S after splitting.

Figure 10. Visualization of the attention heat maps highlighting the most crucial environmental cues on the NCAER_S dataset of the last convolutional layer of the context-encoding stream trained with CNN, channel, and spatial attention.

Figure 11. Example from the “happy” class in the NCAER-S dataset. The main actor’s body is masked to construct the context input, while another visible individual in the scene expresses a similar emotion. This highlights the relevance of contextual cues for emotion recognition, especially when other people in the environment reflect the same emotional state.

Figure 12. The confusion matrix of our proposed model on the NCAER-S dataset test set.

Table 1. Notations and their meanings, as used in the proposed method.

Notation	Meaning
N	Number of detected faces in an image.
$B_{j} = [x_{j}, y_{j}, w_{j}, h_{j}]$	Bounding box of the j-th detected face, defined by top-left corner $(x_{j}, y_{j})$ , width $w_{j}$ , and height $h_{j}$ .
$w_{j}, h_{j}$	Width and height of the j-th bounding box.
$A_{j}$	Area of the j-th face bounding box.
$i n d e x_{\max}$	Index of the face with the maximum bounding box area.
$B_{f}$	Bounding box of the main actor’s face (largest detected face).
$I_{original}$	Original input image from the dataset.
$I_{f}$	Cropped face region of the main actor extracted from $I_{original}$ .
$B_{i} = [x_{i}, y_{i}, w_{i}, h_{i}]$	Bounding box of the i-th detected body.
$B_{body}$	Bounding box of the main actor’s body (with largest overlap with $B_{f}$ ).
$A_{inter} (B_{f}, B_{i})$	Intersection area between the face bounding box $B_{f}$ and a body bounding box $B_{i}$ .
$M_{i}$	Binary segmentation mask of the i-th detected body.
$M_{body}$	Segmentation mask corresponding to the main actor’s body.
$I_{b}$	Body image of the main actor.
$I_{c}$	Context image obtained by masking the main actor’s body.
F	Input feature map of the attention module.
$M_{C} (F)$	Channel attention map.
$F^{'}$	Refined feature map after applying channel attention.
$M_{S} (F^{'})$	Spatial attention map.
$F^{″}$	Final refined feature map after applying spatial attention to $F^{'}$ .
$V_{f} = (f_{1}, f_{2}, \dots, f_{n_{f}})$	Feature vector extracted from the face encoding stream, where $f_{k}$ denotes the k-th component.
$V_{b} = (b_{1}, b_{2}, \dots, b_{n_{b}})$	Feature vector extracted from the body encoding stream, where $b_{k}$ denotes the k-th component.
$V_{c} = (c_{1}, c_{2}, \dots, c_{n_{c}})$	Feature vector extracted from the context encoding stream, where $c_{k}$ denotes the k-th component.
$V_{Concat}$	Concatenated vector of $V_{f}$ , $V_{b}$ , and $V_{c}$ used for fusion.
$n_{f}, n_{b}, n_{c}$	Dimensions of the face, body, and context feature vectors.
$σ$	Sigmoid activation function used in channel and spatial attention.

Table 2. Summary of related work on approaches to emotion recognition.

References	Authors	Problems	Solving Methods
[22]	Mei Si et al.	Limited emotion recognition when using only facial expressions.	Used posture and body movement features to detect four emotions (triumph, frustration, defeat, concentration).
[23]	Ahmed et al.	Challenge of selecting relevant features from a large set of body movement descriptors.	Two-layer method to extract emotion-specific features of the body.
[24,25]	David Griol et al. Caridakis et al.	Facial cues alone are insufficient for robust recognition of multiple emotions.	(1) Combined speech and facial cues using a late fusion strategy. (2) Fused facial expressions, body movements, gestures, and speech to improve recognition.
[17,18,26,27]	Kosti et al. Lee et al. Willams Costa et al.	Lack of datasets with contextual information in real-world scenarios.	(1) Created EMOTIC dataset and extracted features from person and context for improved recognition. (2) Proposed CAER benchmark and CAER-Net, integrating facial and contextual features with attention to non-facial elements. (3) Introduced a multimodal approach combining face, context, and body pose cues.
[28,29]	Yang et al. Le et al.	Context in emotion recognition is inherently vast and multifaceted, making it difficult to model consistently across different environments and scenarios.	(1) Defined four context types: multimodal, scene, surrounding agent, and agent object; used attention modules. (2) Proposed global local attention to extract facial and contextual features separately, then learn them jointly.

Table 3. Ablation experiment results of our proposed emotion recognition network. ✓ indicates that the component is included; x indicates that the component is excluded.

Method	Body & Face Selector Algorithm (YoLov8)	Body Enc. (Ch + Sp Att.) ¹	Context Enc. (Ch + Sp Att.) ²	Face Enc. (Ch + Sp Att.) ³	Acc (%d)
Our method	✓	x	x	x	54.28%
	✓	x	x	✓	55.92%
	✓	x	✓	✓	56.62%

¹ Body-encoding stream with channel and spatial attention. ² context-encoding stream with channel and spatial attention. ³ face-encoding stream with channel and spatial attention.

Table 4. Comparison of our network with recent state-of-the-art methods on the NCAER-S dataset.

Methods	Accuracy (%)
MCF_NET [48] (Face + Context)	45.59
GLAMOR-Net (Original) [29] (Face + Context)	46.91
GLAMOR-Net (MobileNetV2) [29] (Face + Context)	47.52
GLAMOR-Net (ResNet-18) [29] (Face + Context)	48.40
Our Method (body + Context)	49.89
Our Method (Face + Context + Body)	56.42

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Elkorchi, M.; Hdioud, B.; Oulad Haj Thami, R.; Merzouk, S. Context_Driven Emotion Recognition: Integrating Multi_Cue Fusion and Attention Mechanisms for Enhanced Accuracy on the NCAER_S Dataset. Information 2025, 16, 834. https://doi.org/10.3390/info16100834

AMA Style

Elkorchi M, Hdioud B, Oulad Haj Thami R, Merzouk S. Context_Driven Emotion Recognition: Integrating Multi_Cue Fusion and Attention Mechanisms for Enhanced Accuracy on the NCAER_S Dataset. Information. 2025; 16(10):834. https://doi.org/10.3390/info16100834

Chicago/Turabian Style

Elkorchi, Merieme, Boutaina Hdioud, Rachid Oulad Haj Thami, and Safae Merzouk. 2025. "Context_Driven Emotion Recognition: Integrating Multi_Cue Fusion and Attention Mechanisms for Enhanced Accuracy on the NCAER_S Dataset" Information 16, no. 10: 834. https://doi.org/10.3390/info16100834

APA Style

Elkorchi, M., Hdioud, B., Oulad Haj Thami, R., & Merzouk, S. (2025). Context_Driven Emotion Recognition: Integrating Multi_Cue Fusion and Attention Mechanisms for Enhanced Accuracy on the NCAER_S Dataset. Information, 16(10), 834. https://doi.org/10.3390/info16100834

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Context_Driven Emotion Recognition: Integrating Multi_Cue Fusion and Attention Mechanisms for Enhanced Accuracy on the NCAER_S Dataset

Abstract

1. Introduction

2. Related Work

2.1. Emotion Recognition

2.2. Attention Mechanisms

3. Methodology

3.1. Proposed Network Architecture

3.2. Preprocessing Pipeline

3.3. Face-Encoding Stream

3.4. Body-Encoding Stream

3.5. Context-Encoding Stream

3.6. Adaptive Fusion Networks

4. Experimental Setup

4.1. Datasets

4.2. Visualization

4.3. Implementation Details

5. Results and Discussion

Summary of Achieved Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI