1. Introduction
Emotion is a fundamental aspect of human communication, playing a pivotal role in interpersonal interactions, decision-making, and social understanding. With the rapid growth of artificial intelligence (AI) and human–computer interaction (HCI), accurate emotion recognition from visual data has become an essential component in applications such as mental health monitoring, driver assistance systems, social robotics, and affective computing. As machines increasingly interact with humans in naturalistic settings, the demand for robust and interpretable emotion recognition systems has intensified.
Most existing facial expression recognition approaches operate under controlled laboratory conditions where lighting, pose, and background remain uniform. Handcrafted descriptors, including Local Binary Patterns (LBPs), Histogram of Oriented Gradients (HOG), and Gabor filters, offer computational efficiency but often perform poorly in real-world scenarios characterized by occlusion, variable lighting, and cultural differences. Deep learning, particularly Convolutional Neural Networks (CNNs), has transformed the field by enabling end-to-end hierarchical feature learning directly from raw pixel data [
1,
2]. Recent advances demonstrate that CNNs progressively extract increasingly abstract representations, from low-level edges in shallow layers to high-level semantic concepts in deeper layers, facilitating robust visual recognition across diverse conditions [
3]. However, CNNs emphasizing global representations often overlook critical localized facial cues and relevant contextual information, both of which are essential for nuanced emotion perception.
Emotion interpretation is inherently context-dependent. A single expression can convey disparate meanings depending on the situation; for instance, a smile at a celebration carries a different valence than a smile at a protest. This highlights the need to move beyond facial analysis and incorporate scene, object, and environmental semantics into emotion recognition frameworks. To address this, recent multi-level feature fusion architectures integrate local, mid-level, and high-level cues, leading to more resilient and interpretable analysis in unconstrained conditions.
This study presents a hierarchical multi-level feature fusion framework for visual emotion recognition in unconstrained environments. The framework operates through systematic integration of three complementary information sources: Low-level features: LBP descriptors capture micro-textural details of facial expressions. Mid-level features: Facial Action Units (FAUs) derived from facial landmarks encode muscle movement and structural information. High-level features: Scene semantics are represented using Places365, complemented by global image embeddings from ResNet-50.
This hierarchical integration fuses facial detail with contextual understanding. The result is optimized emotion classification that accounts for both what a person’s face expresses and the environment in which they are expressing it.
The proposed framework is evaluated on EmoSet-3.3M, a comprehensive visual emotion dataset containing 3.3 million images from eight discrete emotion categories, annotated with rich facial and contextual metadata. Experiments show that combining LBPs, FAUs, scene semantics, and ResNet-50 features yields substantial improvements in accuracy and macro-averaged F1 scores compared to single-stream CNN baselines. This outcome affirms the value of hierarchical fusion strategies.
The primary contributions of this work are as follows:
- (1)
Novel Hierarchical Fusion Framework: Unlike existing approaches that concatenate features ad hoc, we propose a principled three-level hierarchy (texture → muscle dynamics → scene) that mirrors cognitive emotion processing, with explicit feature alignment mechanisms addressing cross-modal heterogeneity.
- (2)
First Large-Scale Evaluation on EmoSet-3.3M: This represents the first comprehensive multi-level fusion study on a dataset of this scale (3.3 million images), providing statistically robust validation unavailable in smaller benchmark evaluations.
- (3)
Interpretable FAU Integration: We introduce geometric rule-based FAU computation from facial landmarks, providing transparent muscle activation features traceable to the Facial Action Coding System (FACS), enhancing model interpretability compared to black-box deep learning approaches.
- (4)
Systematic Ablation Analysis: Comprehensive evaluation of five fusion configurations provides empirical evidence for optimal feature combinations, establishing design principles for future multi-modal emotion recognition systems.
Visual Emotion Analysis (VEA) remains a focal point in Human–Computer Interaction (HCI), psychological diagnostics, and intelligent surveillance. Traditional models extracted facial features from static images and often neglected contextual signals. Early deep learning models created holistic visual representations but lacked sensitivity to localized and structural patterns. Recent advances have introduced context-aware and hierarchical architectures that integrate spatial, semantic, and temporal information for richer emotion representation [
4,
5,
6].
Traditional machine learning techniques played a foundational role in VEA by leveraging handcrafted features such as color, texture, and composition, paired with classifiers including support vector machines (SVMs), decision trees, random forests, and K-nearest neighbors (KNNn). Approaches such as webly supervised and curriculum-guided training have been effective in addressing dataset bias, but deep architectures now dominate for handling unstructured data [
7,
8,
9,
10].
Deep learning enables extraction of high-level semantic and emotional features. Networks such as MldrNet, hybrid CNN–RNN models, and attention-based graph frameworks illustrate the benefits of this approach [
8,
9,
10]. Multimodal fusion has further improved robustness and accuracy by combining visual, audio, and physiological signals, with applications ranging from EEG-based systems to audiovisual classifiers [
11,
12,
13,
14,
15,
16,
17,
18,
19].
LBP remains a widely used descriptor due to its computational efficiency and discriminative capacity [
20,
21,
22,
23]. FAU-based models provide interpretable structural analysis of facial muscle dynamics [
24,
25,
26]. Progress in facial landmark detection, hybrid fusion designs, and efficient real-time frameworks like MediaPipe Face Mesh continues to enhance practical usability [
27,
28,
29,
30,
31,
32,
33,
34,
35]. Multi-level and hierarchical representations have also benefited applications across building segmentation, point cloud classification, and occlusion-robust facial recognition.
Image preprocessing and enhancement techniques have demonstrated significant impact on visual recognition tasks. Recent advances in image dehazing and restoration can improve input quality for emotion recognition in challenging environmental conditions. Liu et al. proposed efficient wavelet-based methods for real-time image enhancement with reduced computational overhead, potentially applicable as preprocessing stages for emotion recognition in degraded imagery [
36]. Self-supervised learning approaches for image restoration offer promising directions for handling noisy or low-quality inputs without requiring paired training data [
37].
Hierarchical feature learning improves generalization and interpretability in vision tasks, supporting organized representation spaces and advanced class integration [
38,
39,
40,
41]. Jiang et al. demonstrated that hierarchical dense recursive networks effectively exploit coarse-and-fine features throughout the network architecture, achieving superior image reconstruction through systematic multi-level feature aggregation [
42]. Such hierarchical representations enable models to capture both local details and global context, essential for nuanced visual understanding tasks including emotion recognition. Scene semantics from datasets such as Places365 and robust global feature extractors like ResNet-50 promote comprehensive context-aware emotion analysis [
43,
44,
45,
46,
47]. Real-time pipelines, exemplified by the YOLO series, further streamline object detection in affective computing [
48,
49,
50,
51].
Image preprocessing and enhancement techniques have demonstrated significant impact on visual recognition tasks. Recent advances in 2025 have introduced attention-enhanced emotion recognition frameworks combining transformer architectures with multimodal fusion. Wu et al. demonstrated that cross-modal transformers achieve state-of-the-art performance by jointly learning representations from multiple modalities [
52], while Paz-Arbaizar et al. showed that attention mechanisms improve interpretability and capture long-term dependencies in real-time emotion forecasting [
53]. Concurrently, self-improved privilege learning paradigms have proven effective for image restoration; Wu et al. introduced a framework that extends privileged information utility to the inference stage, enabling iterative self-refinement that could serve as a robust preprocessing step for degraded emotional imagery [
54].
The EmoSet dataset [
55,
56,
57] sets a benchmark for visual emotion analysis, offering scale and attribute depth essential for multi-level fusion studies. Recent research that blends handcrafted and deep descriptors demonstrates a transition toward context-rich hybrid models [
58,
59,
60,
61,
62]. Building on these advances, the presented framework integrates micro-texture, muscle activation, and scene semantics, raising the standard for unconstrained visual emotion recognition.
2. Materials and Methods
2.1. Overview
This study proposes a hierarchical multi-level feature fusion framework designed for context-aware visual emotion recognition in unconstrained environments. The system integrates complementary visual information across three semantic levels: low-level micro-textural features, mid-level facial geometry and muscle dynamics, and high-level scene semantics, combined with global image representations. This modular architecture enables simultaneous extraction of diverse emotional cues while maintaining computational efficiency suitable for real-world applications. The complete processing pipeline consists of parallel feature extraction pathways that converge into a unified classification module for final emotion prediction, as illustrated in
Figure 1.
2.2. Dataset
This study is based on EmoSet-3.3M, a large-scale image corpus comprising 3.3 million photographs collected from diverse online sources and public repositories, specifically designed for visual emotion recognition in authentic, unconstrained environments. The dataset spans exceptionally broad spectra of settings, demographics, and affective contexts, with each image annotated into one of eight discrete emotion categories: amusement, awe, contentment, excitement, anger, disgust, fear, and sadness. These categories reflect both high- and low-arousal emotional states grounded in psychological emotion theory, capturing the full spectrum of human affective experience.
A critical strength of EmoSet-3.3M lies in its preservation of natural variability in lighting, backgrounds, occlusions, and viewpoints, making it particularly suitable for developing models that generalize to real-world conditions. Unlike controlled laboratory datasets, this corpus reflects authentic emotional expressions as they naturally occur across diverse contexts and individuals. Example images from the EmoSet dataset illustrating all eight emotion categories are presented in
Figure 2.
Beyond primary emotion labels, the dataset provides exceptionally rich auxiliary annotations, including facial landmarks, bounding boxes, scene category predictions, and object-level metadata. These detailed annotations enable multi-level feature extraction spanning local facial micro-textures, structural cues, and high-level semantic context. This annotation depth combined with large scale and diversity creates an ideal foundation for evaluating hierarchical fusion architectures and investigating how complementary feature channels ranging from micro-expressions to environmental semantics contribute to robust emotion recognition.
For experimental validation, the dataset was partitioned using stratified sampling to maintain balanced class distributions: 70% for training, 20% for validation, and 10% for testing. All images were resized to 256 × 256 pixels and cropped to 224 × 224 for uniform input dimensions, while a standardized augmentation pipeline including random horizontal flips, rotation (±10°), and random crops with padding was applied to enhance robustness against variations in pose, viewpoint, and illumination while preserving critical facial and contextual details.
Input images are standardized to 224 × 224 pixels following established conventions for CNN-based visual recognition. While this resolution may lose fine-grained details present in higher-resolution inputs, it provides an optimal balance between computational efficiency and feature preservation for the employed ResNet-50 architecture. The 224 × 224 resolution retains sufficient information for LBP micro-texture extraction (computed on the facial region) and FAU landmark detection, as validated by our experimental results. Future work may explore multi-scale processing or super-resolution preprocessing for applications requiring finer detail preservation.
2.3. Feature Extraction
To leverage complementary visual cues at multiple semantic levels, we designed a modular feature extraction strategy that processes emotion images through specialized channels. Each feature type contributes uniquely to emotion understanding, ranging from micro-textures and facial geometry to contextual scene semantics.
Let the input image be denoted as
. The final fused feature vector
is defined as the concatenation of selected feature subsets:
2.3.1. Low-Level Features: LBPs
LBP was selected as the primary low-level texture descriptor following a systematic comparative analysis against alternative descriptors including Histogram of Oriented Gradients (HOG), Gabor filters, and Local Phase Quantization (LPQ). LBP demonstrated superior performance across three critical criteria: (1) computational efficiency (256-dimensional feature vector versus 3780 for HOG), (2) robustness to monotonic illumination changes essential for unconstrained environments, and (3) proven effectiveness in capturing micro-textural facial patterns [
19,
20,
21,
22]. Unlike HOG, which emphasizes edge orientation and struggles with subtle skin texture variations, LBP effectively encodes fine-grained textural details such as crow’s feet, nasolabial folds, and forehead furrows that are critical for distinguishing nuanced emotional expressions. Alternative descriptors were benchmarked during preliminary experiments, with LBP achieving 2.3% higher accuracy than HOG and 1.8% higher than Gabor filters on the validation set. LBP encodes local spatial structure by thresholding the intensities of neighboring pixels relative to a center pixel, summarizing local texture patterns into compact binary codes particularly effective for capturing micro-textural variations such as crow’s feet, nasolabial folds, and forehead furrows that are imperceptible to global feature extractors but critical for distinguishing nuanced emotional expressions.
The standard LBP operator over a 3 × 3 neighborhood compares each surrounding pixel to the center pixel
, assigning binary value 1 if
and 0 otherwise. The resulting binary sequence is interpreted as a decimal number representing the local pattern:
In this work, we generate a 256-dimensional LBP histogram vector F_LBP ∈ ℝ256 representing the normalized distribution of local texture patterns across the facial region, serving as the low-level feature descriptor.
2.3.2. Mid-Level Features: Facial Landmarks and FAUs
Mid-level features bridge low-level texture cues and high-level semantic information by encoding both geometric structure and dynamic facial movements. This study employs two complementary mid-level descriptors: Facial Landmarks and FAUs. Facial landmarks represent spatial arrangement of key facial components providing stable reference points under varying imaging conditions, while FAUs abstract this structure into meaningful motion-based emotion indicators grounded in facial physiology.
Facial Landmarks
Facial landmarks provide anatomically interpretable representations of facial geometry. Landmarks are extracted using MediaPipe Face Mesh, which predicts 468 predefined 3D coordinates corresponding to critical facial regions including eyes, eyebrows, nose, mouth, cheeks, and jawline. Each landmark l_i is defined as a 3D point
, where
and
are image-plane coordinates and
denotes relative depth. The complete facial configuration is represented as follows:
These points are concatenated into a 1404-dimensional feature vector. To ensure geometric consistency and invariance to pose and scale, landmark coordinates undergo normalization: (1) facial alignment using affine transformation based on inner eye corners and nose tip, and (2) scaling by inter-ocular distance
and translation such that the origin corresponds to the nose tip
.
FAUs
FAUs represent psychologically grounded abstractions of facial muscle movements rooted in the FACS. Rather than relying on pretrained black-box emotion decoders, we implement geometric rule-based mapping to infer muscle activations directly from landmark displacements, enhancing model transparency and interpretability. Each FAU is computed based on relative distance or angular relationships between landmark pairs. Given relevant landmark
, the activation score
for the k-th action unit is calculated using a normalized Euclidean difference:
where
is the Euclidean distance between key points, and
represent mean and standard deviation under neutral baseline, ensuring scale-invariance and cross-subject comparability. The framework derives 25 FAUs through bilateral differences and angular measurements, producing
providing compact, semantically rich representation with reduced dimensionality while retaining discriminative power.
2.3.3. High-Level Features: Scene Context and Object Semantics
Scene Features (Places365)
Emotional interpretation relies fundamentally on environmental context, extending beyond facial expressions alone. We employ scene features extracted from the Places365 CNN, a model specifically designed to recognize and encode visual environments across 365 scene categories including diverse indoor and outdoor locations. Each input image is passed through the Places365-ResNet model pretrained on over 10 million images, and we extract activation from the final average pooling layer, yielding a 2048-dimensional feature vector:
These scene vectors capture structural layout, spatial depth, lighting conditions, and compositional elements indicative of scene type, providing crucial contextual grounding for interpreting ambiguous emotional instances and enabling disambiguation of emotional cues based on situational narrative.
The semantic disparity between Places365 (scene-level) and facial features presents a fundamental challenge addressed through our hierarchical architecture. Rather than directly fusing these heterogeneous representations, we employ FAUs as a semantic bridge. FAUs encode psychologically grounded facial dynamics that relate to both local muscle textures (connecting to LBP) and contextual emotional expectations (connecting to scene features).
Specifically, scene context informs emotion interpretation by providing situational priors. A smile in a “cemetery” scene (Places365) carries different emotional valence than a smile in a “party” scene. The fusion network learns these context-dependent interpretation rules through end-to-end training, where scene features modulate facial feature interpretation rather than being directly compared.
Additionally, the separate projection layers (
Section 2.5) transform scene and facial features into a shared semantic space before fusion, enabling the network to learn cross-modal correlations without requiring identical semantic granularity in the raw representations.
Object-Level Features (YOLOv5)
Beyond abstract scene features, specific objects within images significantly inform emotional interpretation. To capture object-level environmental context, we optionally incorporate object-level features using YOLOv5 object detection model. We focus on extracting the top N = 5 detected objects per image with confidence scores. For each object, we extract a 6-dimensional feature vector including bounding box coordinates (center x, y, width, height), class probability, and detection confidence. Vectors for the N most confident detections are concatenated, resulting in a fixed-length 30-dimensional object feature vector:
This feature vector provides structured summary of object-level visual context, capturing relationships between objects and emotional themes of images.
2.3.4. Global Image Features: ResNet-50
To capture holistic visual characteristics integrating complete visual structure, color patterns, spatial configurations, object relations, and semantic constructs determining affective quality, we utilize ResNet-50 architecture recognized for its representational capacity, computational efficiency, and transfer learning effectiveness. Pretrained on ImageNet with over 1.2 million labeled images, it learns hierarchical representations from low-level textures in shallow layers to high-level semantic features in deeper layers. Each input image is processed through the complete ResNet-50 network, with output extracted from the final global average pooling layer:
This 2048-dimensional feature vector provides compact representation integrating multi-scale information and hierarchical visual cues. ResNet-50 was selected over alternative architectures due to its advantageous trade-off between expressive capability and computational cost, making it particularly suitable for large-scale datasets like EmoSet-3.3M. Its residual architecture ensures stable feature extraction under diverse real-world challenges including noise, occlusion, and lighting variations.
The framework employs ResNet-50 as the backbone for global feature extraction while Places365 utilizes a separate ResNet architecture pretrained on scene recognition. This design choice reflects three considerations:
- (1)
Domain-Specific Pretraining: Places365-ResNet was trained on 10 million scene images across 365 categories, encoding scene-specific knowledge unavailable in ImageNet-pretrained ResNet-50. Sharing weights would dilute domain-specific representations.
- (2)
Feature Complementarity: ImageNet-pretrained ResNet-50 captures object-centric features, while Places365-ResNet encodes spatial layout, lighting, and environmental context. These complementary representations jointly improve emotion recognition.
- (3)
Ablation Evidence: Experiments replacing Places365 features with additional ImageNet-ResNet features showed 2.1% accuracy reduction, confirming that scene-specific pretraining provides non-redundant information.
The computational overhead of parallel ResNets remains acceptable (4.14 GFLOPs total) while enabling specialized feature extraction for different semantic levels.
2.4. Fusion Strategy and Classification
The five fusion configurations tested in this study combine different feature subsets while using shared ResNet-50 classification backbone for systematic comparison of how feature combinations affect accuracy and generalization.
Configuration 1 (Baseline): Uses only ResNet-50 features ;
Configuration 2 (LBP-ResNet): Concatenates LBP features and ResNet-50 features ;
Configuration 3 (LBP-Landmarks-Places365-ResNet): Concatenates LBP features, Landmarks, Places365 scene features, and ResNet-50 features ;
Configuration 4 (LBP-Landmarks-YOLO-ResNet): Concatenates LBP features, Landmarks, YOLO object features, and ResNet-50 features ;
Configuration 5 (LBP-FAUs-Places365-ResNet): Concatenates LBP features, FAUs, Places365 scene features, and ResNet-50 features .
Each fused feature vector is passed through a fully connected classification layer with softmax activation producing probability distributions across eight emotion classes:
where
is the logit for emotion class
.
2.5. Fusion Mechanism
The feature fusion strategy employs a concatenation-based approach followed by learnable dense layers for adaptive feature weighting. Given heterogeneous feature vectors from different modalities, the fusion process operates as follows:
First, each feature type undergoes dimensionality normalization through projection layers:
where
and
represent learnable weight matrices and bias vectors.
The normalized features are concatenated and processed through two fully connected layers with ReLU activation:
This architecture allows the network to learn optimal feature combinations through gradient-based optimization, rather than relying on fixed attention weights. The dense layers serve as implicit attention mechanisms, adaptively weighting feature contributions based on their discriminative value for emotion classification.
2.6. Heterogeneous Feature Alignment
To address potential misalignments between micro-textural cues (LBP: 256-dimensional) and global scene semantics (Places365: 2048-dimensional), we implement a multi-stage alignment strategy:
Dimensionality Harmonization: Each feature stream is projected to a common 512-dimensional embedding space through learned linear transformations, ensuring comparable feature magnitudes across modalities.
Layer Normalization: Batch normalization is applied independently to each feature type before concatenation, preventing scale disparities from dominating the fusion process.
Gradient Balancing: During training, gradient magnitudes are monitored across feature branches, with learning rate scaling applied to prevent any single modality from dominating optimization.
Semantic Bridging: Mid-level FAU features serve as a semantic bridge between low-level texture (LBP) and high-level context (Places365), as FAUs encode interpretable facial dynamics that relate both to local muscle textures and contextual emotional states.
This alignment strategy ensures that fine-grained facial details are not overwhelmed by high-dimensional scene representations during joint optimization.
3. Results
3.1. Experimental Setup and Configuration
The EmoSet-3.3M dataset underwent systematic stratified sampling to preserve balanced class distributions across the eight emotion categories defined by the Mikels model: amusement, anger, awe, contentment, disgust, excitement, fear, and sadness. With 3.3 million images distributed across these eight classes (approximately 10,660–19,828 images per category), the partitioning ensured balanced representation: 70% training, 20% validation, and 10% testing. All experiments were conducted on an NVIDIA RTX 4090 GPU (24 GB VRAM).
All input images underwent standardized preprocessing: resizing to 256 × 256 pixels followed by center-cropping to 224 × 224 pixels, producing 150,528 float values per sample (a standard input format compatible with contemporary CNN architectures). A comprehensive data augmentation strategy enhanced robustness against real-world imaging variations while preserving emotional cues. The augmentation pipeline incorporated: (1) random horizontal flips (50% probability) to increase pose diversity, (2) random rotations (±10°) simulating natural head movements without disrupting facial structure, and (3) random crops with padding followed by resizing to maintain spatial consistency. Subsequently, all images underwent ImageNet normalization using channel-specific means [0.485, 0.456, 0.406] and standard deviations [0.229, 0.224, 0.225].
Training was executed for 100 epochs using the Adam optimizer with a learning rate of 1 × 10
−4, selected for its adaptive gradient properties and stable convergence characteristics. Cross-entropy loss served as the objective function, mathematically defined as follows:
where
denotes the ground-truth label and
represents the predicted probability for emotion class (
). Batch sizes were optimized based on model complexity: 32 for baseline and LBP-only models, 64 for sophisticated fusion architectures to maximize GPU utilization while maintaining memory efficiency. All experiments executed in a multi-GPU PyTorch (version 2.5.1+cu121) Distributed Data Parallel (DDP) environment using NCCL backend with cuDNN autotuning enabled, ensuring scalability and reproducibility across runs.
Feature configurations were systematically determined through grid search and ablation studies rather than heuristic selection. The optimization process followed three stages:
Feature Dimensionality Selection: LBP histogram bins (128, 256, 512) were evaluated, with 256 bins achieving optimal accuracy-efficiency balance (71.2% accuracy versus 71.0% for 128 and 71.3% for 512 bins at 2× computational cost).
FAU Configuration: The number of action units (15, 20, 25, 30) was optimized through validation set performance, with 25 FAUs providing the best discriminative power without redundancy.
Fusion Architecture: Dense layer configurations (single layer, two layers, three layers) and hidden dimensions (256, 512, 1024) were systematically compared. Two layers with 512 hidden units achieved 74% accuracy, matching three-layer performance while reducing parameters by 23%.
Learning rate was optimized through logarithmic grid search (1 × 10−5 to 1 × 10−3), with 1 × 10−4 selected based on convergence stability and final validation accuracy. Batch sizes were adjusted based on GPU memory utilization, with larger batches (64) for fusion models improving gradient stability.
3.2. Baseline Model Performance
3.2.1. Quantitative Performance Assessment
The baseline ResNet-50 architecture, trained exclusively on raw RGB inputs without auxiliary micro-features or contextual information, established the fundamental performance benchmark. This configuration utilized standard convolutional feature extraction, global average pooling, and fully connected classification layers (representing conventional single-stream deep learning approaches to emotion recognition).
On the EmoSet-3.3M test set, the baseline achieved 69% overall accuracy with macro-averaged precision, recall, and F1-scores of 0.70 (
Table 1). While these metrics confirm that deep CNNs capture general affective representations through hierarchical feature learning, they simultaneously reveal significant limitations in distinguishing subtle emotional categories and managing inter-class ambiguity inherent in real-world scenarios.
3.2.2. Class-Specific Performance Insights
The baseline demonstrates strong performance for high-arousal emotions characterized by visually distinctive features, specifically anger (F1: 0.77), awe (F1: 0.75), disgust (F1: 0.74), and excitement (F1: 0.75). These emotions typically manifest through pronounced facial expressions (furrowed brows, widened eyes, exaggerated muscle activations) effectively captured by convolutional filters through hierarchical feature learning.
Conversely, the model exhibits notable difficulties with low-intensity and visually ambiguous emotion categories. Contentment (F1: 0.58) represents the most challenging category, frequently misclassified as sadness or amusement due to subtle expressiveness and lack of distinctive visual markers. Amusement (F1: 0.64) suffers from low recall (0.59), often confused with excitement due to shared positive valence and similar smiling patterns, highlighting the challenge of differentiating emotions with overlapping affective characteristics.
The confusion matrix (
Figure 3) reveals clear diagonal dominance for anger, awe, disgust, and excitement, indicating consistent recognition patterns aligned with their superior F1-scores. Systematic misclassification patterns emerge: amusement frequently migrates toward excitement, while contentment shows substantial confusion with both sadness and amusement. The grouped bar chart (
Figure 4) demonstrates that anger achieves highest precision (0.81) with minimal false positives, while excitement records strongest recall (0.79), successfully identifying most genuine instances. Precision–recall imbalances for amusement and contentment underscore their ambiguous nature and the baseline model’s limitations in capturing fine-grained emotional distinctions.
3.3. Multi-Level Feature Fusion Model Performance
3.3.1. LBP-ResNet Fusion: Micro-Textural Enhancement
The LBP-ResNet fusion architecture strategically combines LBP histogram features (256-dimensional) with ResNet-50 global embeddings (2048-dimensional), creating a 2304-dimensional hybrid feature space. This integration mitigates the baseline CNN’s tendency to overlook fine-grained textural variations (subtle wrinkles, localized contrast changes, micro-muscle activations) essential for identifying low-intensity emotional expressions.
The LBP-ResNet fusion achieved 71% overall accuracy (
Table 2), representing a consistent 2-percentage-point improvement over baseline. The enhancement is particularly pronounced for high-arousal emotions such as anger (F1: 0.79, +0.02), disgust (F1: 0.76, +0.02), and excitement (F1: 0.76, +0.01), where LBP’s fine-texture encoding effectively captures distinctive facial muscle activations invisible to global feature extractors.
Despite textural enhancement, contentment (F1: 0.59) remains problematic, continuing to exhibit confusion with sadness and amusement. Fear (F1: 0.69) shows moderate improvement, suggesting that while LBP enhances local texture sensitivity, it cannot fully address the complexity of context-dependent emotional expressions requiring broader semantic understanding.
The confusion matrix in
Figure 5 shows strong diagonal dominance for high-arousal emotions, indicating reliable recognition. However, systematic errors persist: amusement instances are frequently misclassified as excitement due to similar expressive patterns (broad smiles, positive cues), while contentment is often confused with sadness or amusement.
Figure 6 highlights that anger achieved highest precision (0.82) with minimal misclassification, while excitement showed strongest recall (0.80). The precision–recall discrepancies for amusement and contentment reflect their ambiguous nature, confirming that while LBP-ResNet improves performance for expressive categories, its effectiveness remains limited for context-dependent or low-intensity emotions.
3.3.2. Multi-Level Hierarchical Integration: LBP-Landmarks-Places365-ResNet
This sophisticated fusion architecture represents comprehensive multi-level integration combining (1) LBP micro-textural features, (2) facial landmark geometric cues (1404-dimensional normalized coordinates), (3) Places365 scene context (2048-dimensional semantic embeddings), and (4) ResNet-50 global features. The resulting high-dimensional feature space captures complementary information across semantic levels, from fine-grained texture to environmental context.
This configuration achieved 74% overall accuracy with macro-averaged F1-score of 0.75 (
Table 3), representing a substantial 5-percentage-point improvement over baseline and 3-percentage-point enhancement over LBP-only fusion. The integration demonstrates synergistic effects where contextual scene information from Places365 effectively disambiguates emotions with similar facial expressions.
Key performance improvements include the following:
Disgust (F1: 0.84): Achieved the highest performance gain (+0.10 versus baseline), benefiting from combined facial geometry and environmental context.
Anger (F1: 0.81): Maintained strong performance with enhanced precision (0.84).
Contentment (F1: 0.63): Showed meaningful improvement (+0.05) through contextual disambiguation, though remaining challenging.
Awe (F1: 0.78): Substantial enhancement attributable to scene context distinguishing it from similar facial expressions.
The confusion matrix in
Figure 7 illustrates improved prediction concentration along the diagonal compared to earlier models, indicating enhanced alignment between predicted and actual labels. Cross-class confusion is noticeably reduced, particularly between amusement and excitement (previously overlapping due to shared positive valence) and between contentment and sadness, where contextual cues from Places365 enhance differentiation of low-arousal emotions. However, some ambiguity persists between contentment and amusement, underscoring the inherent subtlety of these expressions.
Figure 8 compares class-level precision, recall, and F1-scores, highlighting consistent gains for emotions with distinct facial or structural cues. Anger and disgust achieve highest precision (0.84 and 0.83), while excitement attains strongest recall (0.83), confirming the advantage of integrating low-, mid-, and high-level features for high-arousal recognition. Amusement and contentment exhibit recall–precision imbalance, reflecting their ambiguous nature and signaling that further refinement is needed for subtle affective states.
3.3.3. LBP-Landmarks-YOLO-ResNet: Object-Level Context Integration
This configuration replaces Places365 scene features with YOLOv5-derived object-level semantics (30-dimensional concatenated features from top 5 detected objects), investigating whether specific object recognition enhances emotional context understanding compared to abstract scene representation. The performance metrics for the LBP-Landmarks-YOLO-ResNet configuration are presented in
Table 4.
The YOLO-integrated model achieved 72% accuracy, surpassing baseline (69%) and LBP-only fusion (71%) but underperforming relative to Places365 integration (74%). This 2-percentage-point deficit suggests that abstract scene semantics provide more effective emotional context than discrete object detection for emotion recognition tasks.
While object-level features enhance recognition of high-intensity emotions (anger, excitement, disgust), they prove less effective for subtle affective states (contentment, amusement) that benefit more from holistic environmental understanding. The confusion matrix in
Figure 9 reveals strong diagonal clustering for anger, awe, disgust, and excitement, but persistent misclassification patterns indicate limited effectiveness for nuanced categories.
Figure 10’s grouped bar visualization shows anger achieved highest precision (0.84) with minimal false positives, while excitement recorded best recall (0.78). The visualization exposes that amusement and contentment exhibit metric disparities, with elevated precision compared to recall, suggesting the model exercises caution when assigning these labels yet overlooks many genuine cases. This emphasizes their vulnerability to classification errors and the persistent difficulty of modeling understated affective conditions where micro-expressions and situational context remain inadequately captured.
3.3.4. Optimal Fusion: LBP-FAUs-Places365-ResNet
This architecture represents the most sophisticated integration, combining FAUs (25-dimensional muscle activation features) with LBP textures, Places365 scene context, and ResNet-50 global embeddings. FAUs provide psychologically grounded abstractions of facial muscle movements, enabling finer differentiation between visually similar emotional states through interpretable facial dynamics.
Table 5 summarizes the optimal LBP-FAUs-Places365-ResNet fusion results.
This optimal configuration achieved 74% accuracy with macro-averaged F1-score of 0.75, matching best performance while providing enhanced interpretability through FAU integration. The model demonstrates exceptional specificity (0.95), indicating minimal false positive rates across all emotion categories.
The key benefits of FAU integration include (1) Enhanced Discriminability: FAUs enable fine-grained distinction between emotions with similar visual patterns (e.g., amusement vs. excitement); (2) Psychological Grounding: Muscle activation features align with established facial expression theory (FACS); (3) Interpretability: Predictions can be traced to specific facial muscle movements and environmental contexts; and (4) Robust High-Intensity Recognition: Disgust (0.83) and anger (0.82) achieve peak F1-scores through combined muscle activation and contextual analysis
The confusion matrix in
Figure 11 exhibits a strong diagonal pattern across most categories, indicating reliable recognition for anger, awe, disgust, and excitement. This diagonal alignment demonstrates that combining FAU features, contextual scene information, and deep learning representations enables accurate classification of high-intensity emotional states. However, contentment is frequently misclassified as sadness, suggesting difficulty in distinguishing low-intensity emotions characterized by subtle visual cues. Amusement is often confused with excitement due to shared positive-valence characteristics, particularly smiling expressions. These patterns indicate that while FAUs provide valuable fine-grained details, they alone cannot fully resolve ambiguities between emotions with similar visual features or limited contextual markers.
Figure 12 presents grouped bar charts comparing precision, recall, and F1-scores. Anger achieves highest precision (0.85), reflecting strong discriminative capability with minimal false positives, while disgust attains strongest recall (0.85), indicating successful identification of nearly all instances. Both excitement and awe demonstrate balanced performance across all metrics. However, ongoing discrepancies for amusement and contentment reveal precision exceeding recall, suggesting these categories depend on subtle micro-expressions and refined contextual cues incompletely captured by the current approach.
3.4. Computational Efficiency and Scalability Analysis
To assess practical feasibility and deployment suitability, comprehensive computational performance was evaluated on critical metrics including parameter count, model size, Floating Point Operations (FLOPs), inference speed, memory consumption, and training duration per epoch. These metrics reveal trade-offs between computational cost and predictive performance, crucial for assessing applicability in real-time or resource-constrained scenarios.
All experiments were conducted on an NVIDIA RTX 4090 GPU (24 GB VRAM) using consistent training configurations across all model variants. Computational efficiency metrics across all configurations are compared in
Table 6.
The baseline ResNet-50 demonstrates lowest computational requirements (~23.5 M parameters, 89.7 MB, 4.13 GFLOPs), with 2.01 ms per-image inference latency. Adding LBP features marginally increases model size to 94 MB and latency to 2.20 ms, representing reasonable trade-off for improved texture discrimination. More sophisticated fusion approaches (integrating landmarks, FAUs, and scene representations) expand parameters to ~30 M while keeping inference times under 2.3 ms and memory below 820 MB. Although LBP-FAUs-Places365-ResNet shows highest computational demand, training time remains 18–21 min per epoch, indicating minimal overhead relative to performance improvements.
Key efficiency findings are as follows:
Parameter Efficiency: Maximum parameter increase of 23% (baseline to optimal) yields 5% absolute accuracy improvement.
Inference Speed: All configurations maintain <2.3 ms per-image latency, suitable for real-time applications.
Memory Efficiency: Peak memory remains <820 MB across all variants, compatible with edge devices.
Scalability: Linear scaling with dataset size; framework sustains efficiency at EmoSet-3.3M scale.
These results demonstrate that proposed multi-level fusion models provide enhanced feature representations without excessive resource requirements, establishing an efficient and scalable framework for emotion recognition suitable for real-world applications.
To contextualize deployment viability, we compare our model’s computational profile against established lightweight architectures.
Table 7 presents parameter counts and theoretical FLOPs based on published specifications [
57,
58,
59].
Direct accuracy comparison on EmoSet-3.3M requires retraining these architectures, which remains future work. However, our framework achieves real-time inference (2.2 ms, >450 FPS) while the additional parameters (29 M vs. 2–5 M) enable the 5-point accuracy improvement demonstrated in our ablation studies.
3.5. Comprehensive Fusion Strategy Comparison
A unified performance comparison was conducted across all models to determine which feature combinations contribute most to improving classification capability. Each model was tested under consistent conditions with results analyzed using standard performance indicators, ensuring fair benchmarking. A unified performance comparison across all fusion configurations is provided in
Table 8.
Model performance improves systematically with increasingly sophisticated features, where fusion-based architectures consistently surpass baseline ResNet-50. The baseline achieves macro and weighted F1-scores of 0.70 and 0.69, while models integrating LBP, facial landmarks, FAUs, and contextual scene information elevate these metrics to 0.75.
Figure 13 illustrates performance progression, where architectures leveraging multi-modal data yield substantial gains across majority of emotion classes, particularly high-intensity emotions (anger, disgust).
Key comparative insights are as follows:
Scene vs. Object Context: Places365 integration (74% accuracy) outperforms YOLO object detection (72%) by 2 percentage points, indicating abstract scene semantics better capture emotional context than discrete objects;
FAU Advantage: FAU-based configurations match best places365 performance while providing superior interpretability through grounding predictions in facial physiology;
Incremental vs. Multiplicative Gains: While individual features yield 1–2% improvements, their strategic combination produces cumulative 5% overall accuracy boost;
Consistency Across Classes: Multi-level fusion demonstrates more balanced F1-scores across emotion categories, reducing variance between easiest and hardest classes.
The observed trends suggest that although certain improvements are incremental, advanced fusion frameworks maintain superior consistency and equilibrium across emotional categories, validating the effectiveness of integrated feature representations in emotion recognition systems.
3.6. Cross-Dataset Validation and Robustness Assessment
The model’s generalization capability was systematically evaluated through rigorous cross-validation and external benchmark testing. A 5-fold stratified cross-validation protocol applied to the EmoSet-3.3M dataset produced remarkably consistent macro-averaged F1 scores of 0.74 ± 0.02 across all folds, demonstrating robust performance stability and suggesting minimal overfitting despite the large-scale nature of the training corpus.
To further validate generalization beyond the training distribution, the optimized LBP-FAUs-Places365-ResNet model was independently evaluated on the FER2013 benchmark dataset, a widely recognized external validation standard in facial emotion recognition. The model achieved 62% accuracy on this held-out dataset, which operates under substantially different imaging conditions, characterized by smaller scale, controlled laboratory settings, and more constrained facial expressions compared to the naturalistic, highly diverse in-the-wild imagery of EmoSet-3.3M. While the performance decrease reflects an expected domain adaptation challenge, the maintained absolute accuracy level indicates that the hierarchical feature fusion approach successfully captures emotion-relevant patterns that transfer across heterogeneous data distributions, validating the framework’s practical applicability in real-world deployment scenarios where training and operational data distributions inevitably diverge.
4. Discussion
The experimental results demonstrate that hierarchical multi-level feature fusion substantially improves emotion recognition beyond single-stream CNN approaches. The systematic progression from baseline (69%) through increasingly sophisticated configurations to optimal performance (74%) reveals that while individual features contribute 1–2% improvements, strategic combination produces cumulative gains approaching 5 percentage points.
4.1. Synergistic Effects of Multi-Level Feature Integration
The baseline ResNet-50 captures general affective patterns but struggles with nuanced emotional distinctions. The optimal LBP-FAUs-Places365-ResNet configuration (74% accuracy, 0.75 macro-F1) succeeds by integrating three complementary information layers: fine-grained textural features capturing micro-expressions, psychologically grounded FAUs, and environmental scene context. This configuration balances maximal accuracy with enhanced interpretability through facial physiology grounding.
4.2. Feature Modality Comparison
Scene versus Object Context: Places365 semantics (74% accuracy) outperform YOLOv5 object detection (72%) by 2 percentage points, indicating that holistic environmental representations better capture emotional context than discrete objects. Scenes encode spatial depth, lighting, and ambiance, all psychologically relevant to emotion interpretation.
FAUs versus Landmarks: FAU-based configurations match Places365 performance while providing superior interpretability. By translating geometric facial structure into psychologically meaningful muscle activations (FACS), FAUs enhance discriminability between visually similar emotions while grounding predictions in facial physiology.
Texture Integration: LBP features consistently improve performance by 2 percentage points across all configurations, demonstrating that fine-grained textural information (micro-expressions, wrinkles, muscle tension) provides independent discrimination value complementary to global CNN features.
4.3. Challenging Emotion Categories
The framework achieves particular improvements for difficult emotion classes:
Disgust: Largest gain (+0.10 F1), benefiting from distinctive facial geometry captured by landmarks and contextual contamination cues.
Contentment: Improved from 0.58 to 0.63 F1, though remaining challenging due to subtle visual manifestation requiring contextual interpretation.
Amusement: FAU integration reduces confusion with excitement by capturing subtle muscle activation differences (Duchenne markers, eyebrow height) unavailable in raw pixels.
4.4. Computational Efficiency
The framework maintains practical deployment feasibility: only 23% parameter increase (23.5 M to 29 M) for 5% accuracy gain, 2.2 ms per-image inference latency enabling real-time processing at 450+ fps, and under 820 MB peak memory compatible with edge devices. Linear scaling on the 3.3M-image corpus confirms deployment viability across diverse platforms.
4.5. Mechanism of Performance Improvement
The proposed method achieves high performance through three synergistic mechanisms:
Multi-Scale Information Integration: LBP captures micro-expressions invisible to global CNNs, FAUs encode interpretable muscle dynamics, and Places365 provides contextual disambiguation. Each modality contributes unique discriminative information, as evidenced by ablation studies showing consistent gains when adding each feature type.
Context-Dependent Interpretation: Scene features enable the model to resolve facial expression ambiguity. For instance, contentment and sadness share similar low-arousal facial patterns but occur in distinct environmental contexts (peaceful parks versus dimly lit spaces), allowing for scene-guided disambiguation.
Hierarchical Feature Abstraction: The three-level hierarchy (texture → muscle dynamics → scene) mirrors human emotional perception, which integrates facial details with situational understanding. This cognitively aligned architecture facilitates more human-consistent emotion classification.
5. Conclusions
This research demonstrates that hierarchical integration of complementary visual features substantially advances emotion recognition beyond individual modalities. The framework bridges theoretical multimodal emotion processing with practical systems deployable at scale. By combining texture analysis, psychologically grounded facial dynamics, and environmental semantics with efficient deep learning, it achieves human-aligned emotion understanding suitable for real-world human–computer interaction and affective computing applications.
The work validates that complex perceptual tasks benefit from multi-level feature hierarchies mirroring human cognitive processing. As emotion recognition systems advance toward practical deployment, maintaining balanced focus on accuracy and interpretability, performance and efficiency, and technical capability with ethical responsibility remains essential.
The framework particularly benefits previously challenging categories: contentment (+0.05 F1), fear (+0.06), and amusement (+0.06). The LBP-FAUs-Places365-ResNet configuration achieves the most favorable performance profile, combining maximal accuracy with enhanced interpretability through psychologically grounded FAUs, enabling predictions traceable to specific muscle activations and environmental contexts.
The work validates that systematic hierarchical fusion across abstraction levels outperforms individual modalities or ad hoc concatenation. It demonstrates that substantial 5% accuracy improvements require only 23% parameter increase with 2.2 ms inference latency, enabling real-time deployment. The study shows psychologically valid Feature Action Unit abstractions enhance both performance and interpretability compared to raw coordinates, and provides empirical evidence that scene-level semantics outperform object detection by 2 percentage points.
The framework’s efficiency positions it for deployment across affective computing (emotion-aware interfaces, conversational agents), behavioral analysis (educational and clinical settings), mental health monitoring (supplementary assessment tools), and social robotics (emotion-aware interaction systems).
Linear scaling on EmoSet-3.3M with maintained efficiency (less than 2.3 ms latency, less than 820 MB memory) ensures applicability to increasingly large datasets without proportional resource demands, supporting deployment on edge devices, cloud infrastructure, and heterogeneous environments.
Promising avenues include video-based temporal dynamics modeling, multimodal fusion with audio and physiological signals, robust domain generalization ensuring cross-cultural and cross-dataset performance, interpretability enhancement through attention mechanisms, and lightweight architecture optimization for edge deployment through neural architecture search and quantization.
The current framework, trained on EmoSet-3.3M, may exhibit limited generalization to domains with significant distribution shift. Future work will explore the following:
- (1)
Self-Supervised Pretraining: Contrastive learning approaches could enable learning from unlabeled emotion-relevant imagery, reducing dependence on annotated datasets and improving cross-domain generalization.
- (2)
Domain Adaptation: Techniques such as adversarial domain adaptation could bridge gaps between training and deployment domains, particularly for cultural variations in emotional expression.
- (3)
Generative Model Priors: Leveraging priors from large-scale generative models (e.g., diffusion models) could provide regularization improving robustness to out-of-distribution inputs.
Future work will focus on validating the model’s internal decision-making through comprehensive interpretability analysis. We plan to utilize t-SNE to visualize feature space topology, expecting the fused representation to demonstrate tighter intra-class clustering and distinct boundaries for high-arousal emotions compared to the baseline. To quantify the specific impact of each modality, we will conduct feature ablation studies measuring accuracy degradation when individual streams are suppressed, hypothesizing that while global features remain dominant, contextual and facial cues provide essential complementary information. Additionally, we intend to employ Grad-CAM visualization to confirm that the network correctly attends to both facial landmarks and relevant environmental elements, ensuring that performance gains stem from meaningful multi-level integration rather than dataset biases.