IAE-Net: Incremental Learning-Based Attention-Enhanced DenseNet for Robust Facial Emotion Recognition

Khan, Haseeb Ali; Lee, Jong-Ha

doi:10.3390/math14061023

Open AccessArticle

IAE-Net: Incremental Learning-Based Attention-Enhanced DenseNet for Robust Facial Emotion Recognition

by

Haseeb Ali Khan

and

Jong-Ha Lee

^*

Department of Biomedical Engineering, Keimyung University, Daegu 42601, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(6), 1023; https://doi.org/10.3390/math14061023

Submission received: 2 February 2026 / Revised: 13 March 2026 / Accepted: 16 March 2026 / Published: 18 March 2026

(This article belongs to the Special Issue Recent Advances and Applications of Artificial Neural Networks)

Download

Browse Figures

Versions Notes

Abstract

Facial emotion recognition (FER) is an important component of human–computer interaction and healthcare-oriented affective computing. However, reliable deployment remains difficult in unconstrained settings due to appearance and geometric variability (e.g., pose, illumination, and occlusion), demographic imbalance, and dataset bias. In practice, two additional constraints frequently limit real-world FER systems: the computational overhead of heavy architectures and limited adaptability when data evolve over time, where sequential updates can cause catastrophic forgetting. To address these challenges, we propose the Incremental Attention-Enhanced Network (IAE-Net), a compact single-branch framework built on a DenseNet121 backbone and a cascaded refinement pipeline. The model incorporates Channel Attention (CA) to emphasize expression-relevant feature channels and suppress less informative responses, followed by a deformable attention module (DA) that reduces feature misalignment caused by non-rigid facial motion and pose shifts, thereby improving robustness under geometric variability. For continual deployment, IAE-Net supports class-incremental updates via weight transfer, exemplar replay, and knowledge distillation to improve retention during sequential learning. We evaluate IAE-Net on four widely used benchmarks, FER2013, FERPlus, KDEF, and AffectNet, covering both controlled and in-the-wild conditions under a unified training protocol. The proposed approach achieves accuracies of 79.15%, 92.03%, 99.48%, and 74.20% on FER2013, FERPlus, KDEF, and AffectNet, respectively, with balanced precision, recall, and F1-score trends. These results indicate that IAE-Net provides an efficient and extensible FER framework with potential utility in dynamic real-world and longitudinal healthcare-oriented applications.

Keywords:

facial emotion recognition; DenseNet121; channel attention; deformable attention; incremental learning

MSC:

68T07; 68T10; 62H35

1. Introduction

Facial expressions convey a substantial share of human intent, affect, and mental state, shaping both social interaction and healthcare-oriented human–computer communication. Facial emotion recognition (FER) has advanced rapidly with the rise of deep learning (DL), enabling applications in online education [1], biometric authentication [2], driver vigilance and safety [3], and healthcare and neurocognitive rehabilitation [4,5]. Psychological and behavioral studies suggest that nonverbal cues—especially facial movements—encode a substantial portion of communicative content [6,7,8]. This motivates FER systems that remain accurate and dependable under real-world conditions, and that can be updated as new data becomes available. Recent emotion-aware interactive systems have also explored both emotion detection and emotion induction in immersive environments, further highlighting the importance of robust affective perception modules in human-centered applications [9]. In clinical contexts, FER and facial-expression analysis support screening and monitoring of neurological and neurodegenerative disorders (NDs), including Parkinson’s disease and post-stroke sequelae [10,11,12]. Patients frequently exhibit hypomimia (reduced range of facial muscle movement), which can impair expressivity and nonverbal social communication [13,14]. Non-invasive FER pipelines may, therefore, assist longitudinal assessment and personalized therapy [7,15]. In such longitudinal settings, models often need continual adaptation to changes in subjects, recording conditions, and symptom progression, which makes incremental or continual learning an important practical requirement alongside recognition accuracy. Early studies leveraged facial mimicry and expression dynamics to track dysfunction during therapy [16,17]. More recent work has compared machine learning (ML) and DL approaches for Parkinson’s detection from facial videos [18], and targeted emotion-training programs have been used to improve social responsiveness in specific cohorts [19]. These developments highlight FER not only as a recognition problem but also as a clinically relevant measurement tool.

Classical FER methods relied on hand-crafted descriptors such as scale-invariant feature transforms, local binary patterns, and histograms of oriented gradients, typically combined with shallow classifiers [20,21,22]. Although these approaches achieved reasonable performance in controlled laboratory settings, they often struggled under pose variation, occlusion, illumination changes, and demographic diversity. Deep convolutional neural networks (CNNs) have largely superseded hand-crafted pipelines by learning hierarchical features and nonlinear decision boundaries. Representative milestones include DenseNet architectures with dense connectivity [23], the Inception family [24,25], margin-based classifiers that replace conventional softmax heads [26], and very deep VGG-style networks [27]. Despite their success, standard CNNs can remain sensitive to background clutter and geometric variability, particularly on in-the-wild datasets. To improve robustness and interpretability, recent FER systems increasingly incorporate attention mechanisms, temporal modeling, and ensemble strategies. Hybrid CNN–BiLSTM architectures capture temporal dynamics in facial muscle movements [28,29], while optical-flow-assisted models explicitly leverage motion cues for dynamic FER [30]. Cascaded spatiotemporal attention mechanisms further enhance dynamic emotion recognition by focusing on spatially and temporally salient regions [31]. On the efficiency side, lightweight FER models based on depthwise separable convolutions [32] and few-shot relation networks [33] have been proposed to support deployment under resource constraints [34]. Ensemble and optimization-based strategies, including landmark-aware transfer learning [35], particle-swarm-assisted model selection [36], and attention-enhanced deep ensemble FER frameworks [37], have also been explored to further improve recognition performance. Beyond categorical FER, affective computing is increasingly studied for mental-health applications such as automatic depression detection and depression severity estimation from multimodal behavioral signals. While such systems often incorporate speech and temporal dynamics, robust facial representations remain an important cue under unconstrained conditions. This broader context further motivates FER models that generalize under acquisition variability and can be updated over time in longitudinal settings [38,39,40].

Despite this progress, several limitations in existing FER algorithms remain: (i) Computational overhead: many high-accuracy systems are computationally demanding, which limits deployment under resource constraints; (ii) Generalization under acquisition variability: performance can be fragile under dataset bias and acquisition variability, particularly when moving from controlled laboratory settings to in-the-wild images [41,42,43]; (iii) Limited continual adaptation: many systems do not support continual updates, and sequential fine-tuning can lead to catastrophic forgetting as new subjects or conditions appear; similar challenges have also been recognized in related class-incremental vision settings with evolving data streams [44]; (iv) Challenging visual conditions: low-resolution inputs and partial occlusions can suppress subtle expression cues; and (v) Imbalance and fairness: demographic imbalance and long-tailed emotion distributions can cause uneven performance across classes and capture conditions. These factors underscore the need for architectures that reduce feature misalignment under pose- and expression-related geometric changes and emphasize discriminative facial cues without overfitting to background-correlated evidence.

Motivated by these challenges, this work proposes an attention-enhanced FER architecture built on a single DenseNet121 backbone augmented with two complementary mechanisms: Channel Attention (CA) and Deformable Attention (DA). DenseNet121 is adopted as the core feature extractor because its dense connectivity encourages feature reuse and facilitates gradient flow, which is beneficial for capturing fine-grained facial cues. The CA module re-weights backbone channels using global average and max pooling followed by a shared multilayer perceptron, thereby amplifying expression-relevant cues (for example, around the eyes, eyebrows, nasolabial folds, and mouth corners) while suppressing less informative responses. Building on this channel selection, the DA module predicts spatial offset fields via a lightweight convolutional layer and applies differentiable warping to reduce feature misalignment. This design helps the network accommodate non-rigid facial deformations, head-pose variations, and imperfect face localization, producing more geometrically consistent representations before global pooling and classification. To support long-term deployment where new data become available over time, we further introduce a continual learning extension referred to as the Incremental Attention-Enhanced Network (IAE-Net). This extension combines exemplar rehearsal with knowledge distillation to reduce catastrophic forgetting under sequential updates. Importantly, the proposed approach retains a single-branch DenseNet121 design, meaning no multi-backbone ensemble, to preserve architectural simplicity. We evaluate the proposed model on four widely used FER benchmarks, FER2013, FERPlus, KDEF, and AffectNet, which collectively span controlled laboratory settings and in-the-wild conditions with diverse poses, lighting, and demographics. Under a unified training protocol, the results show consistent improvements over strong CNN baselines. Unlike prior CNN-attention FER pipelines that primarily use attention for feature reweighting or multi-branch fusion, our refinement is explicitly cascaded: CA first selects expression-relevant channels, and the subsequent DA estimates dense offsets to reduce pose- and non-rigid deformation-induced feature misalignment. This sequential design addresses complementary failure modes while preserving a single-backbone structure suitable for continual updates.

The main contributions of this study are as follows:

IAE-Net is presented as a compact FER framework that integrates a DenseNet121 backbone with CA-based channel re-weighting and DA-based feature alignment to improve discriminative emphasis and geometric consistency without relying on heavy ensembles.
We extend the proposed pipeline to an incremental learning setting using weight transfer, a class-balanced exemplar rehearsal buffer, and knowledge distillation, enabling sequential learning of new emotion classes while reducing catastrophic forgetting.
Class-incremental performance is reported across multiple stages and retention is quantified using forgetting rate (FR) and total knowledge loss (TKL), providing evidence of stability under sequential updates in addition to offline accuracy.
Extensive evaluation is conducted on FER2013, FERPlus, KDEF, and AffectNet under standard within-dataset protocols, including backbone and attention ablations under a unified training setup, along with Grad-CAM visualizations to inspect expression-relevant regions that drive predictions.

The remainder of this paper is organized as follows: Section 2 describes the proposed IAE-Net framework and the continual learning strategy. Section 3 presents the experimental setup, quantitative results, and discussion. Section 4 concludes the paper and outlines future directions.

2. Methodology

This section presents IAE-Net, a compact FER framework that integrates a single-branch DenseNet121 backbone with a cascaded attention refinement block comprising CA and DA. The processing pipeline consists of four stages: data preprocessing and augmentation, deep feature extraction using an ImageNet-pretrained backbone, attention-based feature refinement via CA followed by DA, and a lightweight classification head optimized using sparse categorical cross-entropy. To support continual deployment, the same network can be updated under a class-incremental learning protocol, where the classifier is expanded as new classes become available and previously learned knowledge is preserved using exemplar replay and weight transfer. All baseline backbones and ablation variants follow the same preprocessing and optimization protocol to ensure fair comparisons. Figure 1 summarizes the overall workflow, while Figure 2 shows the model-level architecture and module ordering.

2.1. Preprocessing and Data Preparation

IAE-Net is evaluated on four widely used FER benchmarks: FER2013, FERPlus, KDEF, and AffectNet. FER2013 and FERPlus are resized to

48 \times 48

, whereas KDEF and the AffectNet Kaggle subset are resized to

224 \times 224

for network input. For FER2013 and FERPlus, the provided facial images are used directly and resized to

48 \times 48

. Since these two datasets are grayscale, each image is replicated across three channels to match the CNN input format. For KDEF and the AffectNet Kaggle subset, the images are used as provided by the dataset source and resized to

224 \times 224

for training and evaluation. No additional face detection, landmark alignment, or manual cropping is applied beyond the dataset-provided facial images. Each image is normalized to

[0, 1]

. This unified normalization reduces differences in dynamic range across datasets and helps maintain stable optimization during training. To reduce overfitting and improve robustness, online data augmentation is applied during training. For each input

x_{i}

, an augmented sample

{\tilde{x}}_{i}

is generated as:

{\tilde{x}}_{i} = A (x_{i}; θ_{A})

(1)

where

A (\cdot)

denotes a stochastic augmentation operator including random horizontal flips, small in-plane rotations (

\pm 15^{°}

), moderate translation and zoom, and mild brightness and contrast perturbations, and

θ_{A}

denotes augmentation parameters sampled independently for each minibatch. When official training, validation, and test splits are provided, as in FER2013 and FERPlus, they are used directly. For datasets without predefined subject-independent splits, such as KDEF, identities are partitioned into training, validation, and test sets with no overlap to prevent subject leakage. Unless otherwise stated, all ablation experiments use the same splits across backbones and attention configurations.

2.2. DenseNet121 Backbone for Feature Extraction

The feature extractor is DenseNet121 initialized with ImageNet-pretrained weights and used with include_top = False. Dense connectivity promotes feature reuse and stable gradient propagation, which is well-suited for capturing subtle facial deformations. Given an input image

x \in R^{H \times W \times 3}

, the backbone produces a convolutional feature tensor:

f_{base} = F_{DN} (x; Θ_{DN}), f_{base} \in R^{H^{'} \times W^{'} \times C}

(2)

where

Θ_{DN}

denotes trainable parameters, C is the number of output channels, and

H^{'} \times W^{'}

is the downsampled spatial resolution. For backbone-only baselines (e.g., InceptionV3, ResNet50, Xception), we replace

F_{DN}

while keeping the attention and classifier blocks unchanged to isolate the effect of the backbone and the CA and DA modules. DenseNet121 is selected due to dense feature reuse and stable gradient flow, which are beneficial for capturing subtle expression-related cues. To validate this choice under the same training protocol, we include representative baselines spanning different complexity regimes, including a mobile-oriented CNN (MobileNetV2), widely used CNN backbones (InceptionV3, ResNet50, and Xception), and recent architectures (ConvNeXt and ViT). We report recognition performance for these backbones under the unified protocol (Section 3.4) and provide deployment-oriented efficiency profiling for representative backbones under the same workstation setting, including parameter count, FLOPs, measured inference latency, numerical precision, and CPU/GPU throughput (Section 3.7). These results enable a direct accuracy–efficiency comparison across models.

2.3. Channel Attention (CA)

Not all channels in

f_{base}

contribute equally to emotion recognition. CA recalibrates channel responses by emphasizing expression-relevant activations and suppressing redundant background responses. Two global channel descriptors are computed using global average pooling (GAP) and global max pooling (GMP):

v_{avg} = GAP (f_{base}), v_{\max} = GMP (f_{base}), v_{avg}, v_{\max} \in R^{C}

(3)

Both descriptors are processed by a shared two-layer MLP with reduction ratio r:

s_{avg} = W_{2} δ (W_{1} v_{avg})

(4)

s_{\max} = W_{2} δ (W_{1} v_{\max})

(5)

where

W_{1} \in R^{\frac{C}{r} \times C}

,

W_{2} \in R^{C \times \frac{C}{r}}

, and

δ (\cdot)

denotes the ReLU activation. The final CA mask is obtained by summing the two responses and applying a sigmoid gate:

M_{C} = σ (s_{avg} + s_{\max}), M_{C} \in {(0, 1)}^{C}

(6)

where

σ (\cdot)

denotes the sigmoid function. Finally, the channel-refined feature map is:

f_{C} = M_{C} \otimes f_{base}

(7)

where ⊗ denotes channel-wise multiplication with broadcasting over spatial dimensions. The CA design used in IAE-Net is illustrated in Figure 3.

2.4. Deformable Attention (DA)

Facial expressions involve non-rigid geometric deformations (e.g., eyebrow raise, cheek lift, mouth opening) that can cause spatial misalignment in fixed-grid feature sampling. DA introduces a deformation alignment step by learning spatial offsets and applying differentiable warping to the feature map. Given

f_{C} \in R^{H \times W \times C}

, we predict a 2D offset field using a lightweight

3 \times 3

convolution with bounded activation:

Δ = \tanh ({Conv}_{3 \times 3} (f_{C})), Δ \in R^{H \times W \times 2}

(8)

Let

Δ (u, v) = (Δ u (u, v), Δ v (u, v))

denote the offset at location

(u, v)

. The aligned feature map is obtained by bilinear sampling:

\hat{f} (u, v) = BilinearSample (f_{C}, (u + Δ u (u, v), v + Δ v (u, v)))

(9)

In practice, this is implemented using a standard differentiable warping operator (dense image warp). The aligned features are fused with the original features via a residual connection, followed by a

1 \times 1

convolution and batch normalization:

f_{D} = BN (φ_{1 \times 1} (f_{C} + \hat{f}))

(10)

This residual formulation preserves semantic content while incorporating deformation-corrected representations. Figure 4 illustrates the DA process and the residual fusion used to obtain

f_{D}

.

The proposed DA block operates as a lightweight local alignment module at the feature level. Unlike STNs, which typically estimate a global spatial transformation, DA predicts a dense offset field directly on the refined feature map and performs a single differentiable residual warping step for local refinement. Unlike deformable convolution, which modifies sampling positions within the convolution kernel itself, DA preserves the original backbone convolutions and performs alignment only after channel refinement. A direct comparison with STN and DeformConv is presented in Section 3.4.

2.5. Classification Head and Training Objective

The refined feature tensor

f_{D}

is summarized by global average pooling:

h = GAP (f_{D})

(11)

The embedding is passed through two fully connected layers with dropout:

h_{1} = δ (W_{1} h + b_{1}), h_{1} \in R^{256}

(12)

h_{2} = δ (W_{2} h_{1} + b_{2}), h_{2} \in R^{64}

(13)

A softmax classifier with K outputs (dataset-dependent number of classes) produces class probabilities:

p (y = k ∣ x) = \frac{\exp (z_{k})}{\sum_{j = 1}^{K} \exp (z_{j})}, z = W_{cls} h_{2} + b_{cls}

(14)

The network is trained end-to-end using sparse categorical cross-entropy:

L_{CE} = - \log p (y ∣ x)

(15)

Optimization is performed using SGD with momentum:

SGD (lr = 10^{- 4}, momentum = 0.9)

(16)

A dropout rate of 0.3 is applied between dense layers. Early stopping based on validation loss is used to select the final model checkpoint per dataset.

2.6. Continual Learning Under Sequential Updates

To support deployment settings where classes become available over time, we evaluate IAE-Net under a class-incremental protocol over T stages. At stage t, the model observes a subset of classes and is updated to accommodate newly introduced classes. Let

Θ_{t}

denote model parameters after stage t. Each stage is initialized using weight transfer from the previous stage

(Θ_{t - 1})

.

2.6.1. Exemplar Replay and Weight Transfer

To reduce catastrophic forgetting, we maintain a bounded rehearsal buffer P containing a class-balanced subset of exemplars from previously seen classes. During training at stage t, the effective training set is the union of current-stage data and replay samples:

D_{train} = D_{t} \cup P

(17)

In our implementation, the rehearsal memory uses a fixed budget of 20 exemplars per class. Stage 1 is initialized with two classes, and each subsequent stage introduces one additional class. Consequently, the total stored exemplar memory after each stage is 40, 60, 80, 100, 120, 140, and 160 samples for 2, 3, 4, 5, 6, 7, and 8 learned classes, respectively. During each update, replay samples are drawn from previously learned classes only. Exemplars are selected by random sampling without replacement from the training set of each class, and the buffer is refreshed at the end of every stage to maintain balanced coverage while avoiding duplication.

2.6.2. Incremental Training Objective

The default incremental objective optimizes cross-entropy on the mixed set

D_{train}

. Additionally, a distillation regularizer can be included to stabilize outputs on previously learned classes by matching softened predictions from a frozen teacher

M_{t - 1}

. When enabled, the total loss is:

L_{total} = L_{CE} (y, \hat{y}) + λ L_{KD} (z^{t - 1}, z^{t})

(18)

where

λ

controls the stability–plasticity trade-off. The distillation term is defined using KL divergence between temperature-softened output distributions:

L_{KD} = \sum_{c} T^{2} \cdot KL (softmax (\frac{z_{c}^{t - 1}}{T}), softmax (\frac{z_{c}^{t}}{T}))

(19)

In practice, T is typically set to a small constant (for example

T = 2

) and

λ

is tuned to balance retention and adaptation. If distillation is not activated, Equation (18) reduces to the cross-entropy objective on

D_{train}

. We use distillation temperature

T = 2

and loss weight

λ = 1.0

to balance stability and plasticity. T controls the softness of the teacher distribution, and

λ

controls how strongly previously learned outputs are preserved during updates.

3. Results and Discussion

This section reports a comprehensive evaluation of IAE-Net, defined as a DenseNet121 backbone augmented with the proposed cascaded CA and DA blocks, followed by a lightweight classification head. All results below are produced under the training protocol described in Section 2 and the experimental setup summarized in Section 3.1; any deviation is stated explicitly.

3.1. Experimental Setup

All experiments were conducted on a workstation running Windows 10 equipped with an Intel Core i9 (12th generation) CPU, 32 GB RAM, and an NVIDIA RTX 3080 GPU with 12 GB VRAM. The proposed IAE-Net was implemented in Python (Version 3.7.13) using the Keras library as the frontend and TensorFlow (2.4.0) as the backend. Unless otherwise stated, all models were trained for 50 epochs with a batch size of 64 using stochastic gradient descent (SGD) with a learning rate of 0.0001 and momentum of 0.9. The final checkpoint for each run was selected using early stopping based on validation loss.

3.2. Datasets

The proposed model was evaluated on four widely used facial-expression datasets: FER2013 [45], FERPlus [46], KDEF [47], and AffectNet [48]. FER2013 was introduced as part of the ICML 2013 Challenges in Representation Learning and contains

48 \times 48

grayscale face images collected via web search; the dataset comprises 35,887 images split into training, validation, and test sets, and is labeled with seven categorical emotions. FERPlus provides crowdsourced re-annotations of the FER2013 images, producing improved label distributions and an expanded emotion set (commonly eight classes including Contempt), which reduces label noise and is useful when more reliable labels are required. KDEF (Karolinska Directed Emotional Faces) is a lab-controlled dataset of 4900 high-resolution posed images from 70 actors (35 female, 35 male) showing seven prototypical expressions; KDEF is appropriate for experiments that require well-lit, posed stimuli and subject-disjoint evaluation. AffectNet is originally a large in-the-wild collection of facial images gathered from the web (on the order of one million images) annotated for both categorical emotions and continuous valence–arousal values [48]. In this study, we use a publicly available Kaggle subset of AffectNet containing 28,175 images annotated into 8 emotion categories (anger, contempt, disgust, fear, happy, neutral, sad, and surprise) [49]. Performance was measured using precision, recall, F1-score, and accuracy. Dataset characteristics and split configurations are summarized in Table 1.

3.3. Comparative Analysis of the Proposed Model Against SOTA Methods

To contextualize the results, Table 2 summarizes reported accuracies of representative CNN-, attention-, and transformer/graph-based architectures on the four datasets considered. Direct comparisons across studies should be interpreted cautiously because evaluation protocols often differ in class definitions, face detection/alignment and cropping, split strategy, data augmentation, and the use of external data. For this reason, Table 2 is used primarily for contextual positioning, while the controlled comparisons under our unified protocol are emphasized in the ablation study (Table 3).

Classical CNN and transfer-learning baselines such as VGG [27], the deep CNN of Mollahosseini [41], DenseNet201 [23], InceptionV3 [24], and Inception-ResNetV2 [25] provide strong reference points. On FER2013, these models commonly report accuracies in the 65% to 70% range (for example, 65.80% for VGG, 66.40% for Mollahosseini, 68.52% for DenseNet201, and 68.86% for InceptionV3), while reported performance on KDEF typically ranges from 86.75% to 94.70%. Transfer-learning variants [50,51] can improve training stability, but reported FER2013 results often remain in the low-to-mid 60% range and are generally lower on AffectNet under comparable reported settings.

Several enhanced CNN and fusion-based approaches aim to improve robustness through feature fusion, auxiliary modules, or hybrid representations. FaceLiveNet and Dense_FaceLiveNet [52,53] report notable gains on controlled datasets, with Dense_FaceLiveNet reaching 95.89% accuracy on KDEF. Deep Fusion based on im-cGAN [54] increases KDEF accuracy to 98.30%, and GA-Dense-FaceLiveNet [55] reports 99.17%. Hybrid pipelines that combine hand-crafted descriptors with learned heads, including FMA+MLP/LD/SVM [43], DCNN [50], DBN [56], and CBiLSTM [28], also achieve strong performance on KDEF (typically around 90–96% in the reported settings). However, on more challenging benchmarks such as FER2013, many approaches remain below the upper 70% range in reported protocols; for instance, PDREP [57] reaches 73.50% on FER2013 and only 76.33% on KDEF, highlighting sensitivity to dataset shift and unconstrained imaging conditions. GA [58] reports 77.40% on FER2013, indicating that bridging the gap on in-the-wild data remains challenging without increasing model complexity.

More recent FER systems increasingly incorporate explicit attention modeling or transformer-style components. ECAN [59] reports 58.21% on FER2013 and 51.84% on AffectNet, illustrating the difficulty of maintaining robustness under large appearance variability. AU-ViT [60] and VTFF [61] report strong performance on FERPlus (90.15% and 88.81%, respectively). ESSRN [62] and the Novel CNN of [63] improve performance on specific datasets but do not fully bridge the gap between controlled and in-the-wild scenarios. More complex architectures, including LCANet [64], TAN+OLC [65], MFER [66], SSFER [67], FERMixNet [68], FMR-CapsNet [69], and EmotionLens [70], achieve very high reported accuracies on FERPlus and/or AffectNet but these gains are often accompanied by increased architectural complexity, multiple models, or specialized ensemble strategies, which can raise computational cost and reduce deployment simplicity.

Within this landscape, IAE-Net achieves 79.15% on FER2013, 92.03% on FERPlus, 99.48% on KDEF, and 74.20% on AffectNet (Table 2). On FER2013, IAE-Net improves upon strong reported baselines such as GA [58] (77.40%) and PDREP [57] (73.50%), while using a single backbone and a compact attention refinement block. On FERPlus, IAE-Net surpasses several single-model attention/transformer and capsule-based methods (e.g., AU-ViT [60], VTFF [61], LCANet [64], MFER [66], FERMixNet [68], and FMR-CapsNet [69]). On KDEF, IAE-Net attains a near-ceiling accuracy of 99.48%, exceeding GA-Dense-FaceLiveNet [55] by approximately 0.31 percentage points. On AffectNet, IAE-Net reaches 74.20%, outperforming VGG-based and ECAN baselines and remaining competitive with EmotionLens [70] and FMR-CapsNet [69], while still using a single-backbone design.

Overall, these comparisons indicate that IAE-Net offers a favorable accuracy–efficiency trade-off, achieving competitive performance across heterogeneous benchmarks without relying on multi-backbone ensembles or heavyweight transformer stacks. Reported accuracies from representative prior methods are summarized in Table 2.

Table 2. Comparison of recent FER methods across multiple datasets. Bold values in the last row highlight the proposed model; “–” denotes that the method was not evaluated on the corresponding dataset.

Method	FER2013 (%)	FERPlus (%)	KDEF (%)	AffectNet (%)
VGG [27]	65.80	–	86.75	–
Mollahosseini [41]	66.40	–	–	–
InceptionV3 [24]	68.86	–	90.25	–
Inception-ResNetV2 [25]	69.72	–	94.70	–
DenseNet201 [23]	68.52	–	92.52	–
Transfer Learning DCNN [50]	62.30	–	–	–
VGG16 Transfer Learning [51]	55.80	68.40	–	59.20
FaceLiveNet [52]	68.60	–	–	–
Dense_FaceLiveNet [53]	69.99	–	95.89	–
Deep Fusion [54]	–	–	98.30	–
FMA+MLP [43]	59.77	–	92.28	–
FMA+LD [43]	66.60	–	93.67	–
FMA+SVM [43]	61.11	–	92.05	–
DCNN [50]	63.80	–	89.54	–
DBN [56]	–	–	90.22	–
CBiLSTM [28]	58.09	–	94.23	–
GA-Dense-FaceLiveNet [55]	–	–	99.17	–
PDREP [57]	73.50	–	76.33	–
GA [58]	77.40	–	–	–
iVABL [19]	69.60	–	95.63	–
VGG (tuned) [71]	69.65	–	95.92	–
ECAN [59]	58.21	–	86.49	51.84
AU-ViT [60]	–	90.15	–	65.59
VTFF [61]	–	88.81	–	61.85
ESSRN [62]	50.98	–	80.83	–
Novel CNN [63]	72.16	–	89.93	–
LCANet [64]	–	91.43	–	64.43
TAN + OLC [65]	–	90.67	–	65.17
MFER [66]	–	91.09	–	67.06
SSFER [67]	–	85.82	–	65.37
FERMixNet [68]	–	90.58	–	66.40
FMR-CapsNet [69]	–	91.82	–	71.12
EmotionLens [70]	–	–	–	73.96
IAE-Net (Ours)	79.15	92.03	99.48	74.20

3.4. Ablation Study on Channel and Deformable Attention

To quantify the contribution of the proposed attention mechanisms, an ablation study was conducted across four backbones (InceptionV3, ResNet50, Xception, and DenseNet121) and four datasets. Representative configurations per backbone are reported to isolate the effects of enabling CA and DA under a controlled training protocol. All variants share the same preprocessing and training settings; therefore, observed performance differences can be attributed primarily to the inclusion of CA and/or DA rather than to changes in data handling or optimization. Table 3 reports precision, recall, F1-score, and accuracy for each configuration. For runs in which per-class outputs were not available, precision, recall, and F1-score are omitted, and only accuracy is reported. Before discussing individual cases, we note a clear trend across the controlled comparisons in Table 3: adding attention (CA and/or DA) tends to improve FER performance regardless of the chosen backbone, indicating that the gain is not tied to a specific feature extractor. However, the strongest and most consistent improvements are obtained when both modules are combined with DenseNet121. This suggests a synergy where DenseNet121 provides rich reusable facial cues, CA amplifies expression-relevant channels while suppressing background-dominated responses, and DA further stabilizes the representation by reducing spatial misalignment caused by pose variation and non-rigid facial motion. As a result, DenseNet121+CA+DA achieves the best overall scores across the evaluated datasets.

The results in Table 3 show several consistent patterns. First, introducing attention mechanisms generally improves performance across most backbone–dataset combinations. Enabling CA and/or DA tends to increase accuracy and, when available, improves the precision, recall, and F1-score profile relative to configurations without one of the modules. Because Table 3 does not contain a complete (CA×, DA×) baseline for every backbone–dataset pair, conclusions are drawn primarily from controlled comparisons among the reported configurations within each backbone. For example, on FER2013, the Inception configuration with DA only achieves 61.25% accuracy, which increases to 74.31% when both CA and DA are enabled. Similar gains are observed for Xception and DenseNet121 across FERPlus, KDEF, and AffectNet. Second, DenseNet121 with both modules enabled provides the strongest overall performance under the unified protocol. On FER2013, accuracy increases from 73.23% (DenseNet121 with CA only) to 79.15% with CA+DA, together with P = 0.789, R = 0.8015, and F1 = 0.7952. On FERPlus, the same configuration reaches 92.03% accuracy with P = 0.908, R = 0.934, and F1 = 0.9208. On KDEF, DenseNet121+CA+DA achieves 99.48% accuracy with near-ceiling precision and recall, consistent with the controlled capture conditions of that dataset. On AffectNet, which is more variable and includes larger appearance and annotation noise, DenseNet121+CA+DA attains 74.20% accuracy and yields the best balance of precision and recall among the DenseNet121 variants evaluated. Third, CA and DA exhibit complementary behavior. Single-module variants often provide partial gains, whereas enabling both modules consistently yields the strongest performance for DenseNet121 across all evaluated datasets. This pattern suggests that CA primarily improves channel-wise feature selectivity, while DA improves spatial consistency by reducing feature misalignment caused by pose changes and non-rigid facial motion. Overall, the ablation results indicate that the improvements are not solely attributable to the DenseNet121 backbone, but also to the interaction between dense feature reuse and the combined channel re-weighting and deformation alignment mechanisms.

Table 3. Ablation/backbone comparison of CA and DA across four datasets and backbones. indicates that the module is enabled; × indicates that it is disabled; “–” denotes unavailable metrics; bold values indicate the best performance within each dataset.

Dataset	Method	CA	DA	P	R	F1	Acc (%)
FER2013	Inception	×	✓	0.6045	0.6416	0.6225	61.25
	ResNet50	×	✓	–	–	–	64.92
	Xception	×	✓	0.6733	0.6831	0.6781	67.73
	DenseNet121	✓	×	0.7177	0.7514	0.7341	73.23
	Inception	✓	✓	0.7291	0.7618	0.7451	74.31
	MobileNetV2	✓	✓	0.7512	0.5793	0.6541	66.00
	ConvNeXt	✓	✓	0.7313	0.7213	0.7263	72.30
	ViT	✓	✓	0.6738	0.7005	0.6869	68.55
	ResNet50	✓	×	0.7051	0.7185	0.7117	71.00
	Xception	✓	✓	0.7059	0.7276	0.7166	71.90
	DenseNet121	✓	✓	0.7890	0.8015	0.7952	79.15
FERPlus	Inception	×	✓	0.8826	0.8795	0.8810	88.08
	ResNet50	✓	✓	0.8983	0.8942	0.8962	89.58
	Xception	×	✓	0.8452	0.8401	0.8427	84.25
	DenseNet121	×	✓	0.8667	0.8770	0.8718	87.10
	Inception	✓	×	0.8135	0.8249	0.8191	81.75
	MobileNetV2	✓	✓	0.8700	0.8599	0.8650	86.10
	ConvNeXt	✓	✓	0.8585	0.8746	0.8665	87.10
	ViT	✓	✓	0.8130	0.8356	0.8241	82.65
	ResNet50	×	✓	0.7890	0.7933	0.7912	79.02
	Xception	✓	×	0.8382	0.8286	0.8334	83.23
	DenseNet121	✓	✓	0.9080	0.9340	0.9208	92.03
KDEF	Inception	×	✓	0.9328	0.9204	0.9266	92.63
	ResNet50	✓	✓	0.8981	0.8866	0.8923	89.15
	Xception	✓	×	0.9278	0.9361	0.9320	93.15
	DenseNet121	×	✓	0.9545	0.9597	0.9570	95.86
	Inception	✓	×	0.9746	0.9712	0.9729	97.28
	MobileNetV2	✓	✓	0.8913	0.8939	0.8926	88.95
	ConvNeXt	✓	✓	0.9520	0.9520	0.9502	95.10
	ViT	✓	✓	0.9215	0.9313	0.9264	92.35
	ResNet50	×	✓	0.8642	0.8780	0.8710	87.00
	Xception	×	✓	0.9322	0.9458	0.9400	93.95
	DenseNet121	✓	✓	0.9935	0.9960	0.9948	99.48
AffectNet	Inception	×	×	0.6193	0.6596	0.6388	64.78
	ResNet50	✓	×	0.6344	0.6436	0.6390	63.53
	Xception	✓	×	0.6541	0.6638	0.6589	66.33
	DenseNet121	✓	×	0.7261	0.7427	0.7343	72.90
	Inception	×	✓	0.7132	0.7302	0.7216	71.55
	MobileNetV2	✓	✓	0.6223	0.6416	0.6318	62.65
	ConvNeXt	✓	✓	0.6912	0.6744	0.6827	69.00
	ViT	✓	✓	0.6456	0.6584	0.6519	64.60
	ResNet50	×	×	0.6057	0.5906	0.5981	59.48
	Xception	×	✓	0.6789	0.6994	0.6890	68.73
	DenseNet121	✓	✓	0.7363	0.7533	0.7447	74.20

Table 4 compares DenseNet121 variants equipped with no alignment module, STN, Deformable Convolution, and the proposed DA under the same training and evaluation protocol on KDEF. The DA-equipped model achieves the strongest performance among the compared alignment variants under the present experimental setting, reaching 96.35% accuracy and 96.13% F1-score. This exceeds both STN (91.25% accuracy) and DeformConv (92.50% accuracy). In addition, DA requires fewer parameters than DeformConv (8.44 M vs. 10.07 M), indicating a more favorable accuracy–complexity trade-off for FER.

For readability, Figure 5 visualizes the final precision, recall, F1-score, and accuracy of the best DenseNet121+CA+DA setting reported in Table 3 across the four datasets.

The ablation trends support the intended mechanism of the proposed cascade. CA improves channel selectivity by amplifying expression-relevant cues and suppressing redundant or background-dominated responses, strengthening discrimination under appearance variability. The DA module improves geometric consistency by reducing feature misalignment caused by non-rigid facial motion and pose shifts. Their cascaded combination, therefore, addresses two distinct failure modes and explains the consistent gains observed across datasets.

3.5. Incremental Learning Evaluation

In addition to conventional offline training, we evaluate the proposed model under a continual learning setting with sequential updates, where training proceeds over multiple stages and the model is updated incrementally using newly available data together with a small rehearsal buffer. The model is initialized from the previous stage (weight transfer) and optimized using replay and knowledge distillation as described in Section 2. After each stage, we report precision (P), recall (R), F1-score (F1), and accuracy (Acc) on the evaluation targets used in the current experiment. The full stage-wise results and forgetting statistics are reported in Table 5. For compact presentation, the table reports generic stage labels. Rows Stage 1 to Stage 6 summarize the stage-wise incremental results, whereas the Final evaluation row reports performance after completion of the full incremental sequence for the corresponding dataset. TKL is computed from the stage-level FR values only.

To quantify catastrophic forgetting, we report the Forgetting Rate (FR) at each incremental stage as the non-negative drop in previously achieved performance. Let

a_{c}^{(t)}

denote the accuracy for class c measured after stage t. The forgetting for class c at stage t is defined as:

f_{c}^{(t)} = \max (0, a_{c}^{(t - 1)} - a_{c}^{(t)})

(20)

and the stage-level forgetting rate is obtained by averaging over the set of previously evaluated classes:

{FR}^{(t)} = \frac{1}{|C_{t - 1}|} \sum_{c \in C_{t - 1}} f_{c}^{(t)}

(21)

We further summarize forgetting across the full incremental sequence using Total Knowledge Loss (TKL), defined as the mean FR over the stages reported for a given dataset:

TKL = \frac{1}{T} \sum_{t = 1}^{T} {FR}^{(t)}

(22)

where T is the number of stages. Lower FR and TKL values indicate better retention of previously acquired knowledge under sequential updates.

Table 6 compares the proposed continual-learning strategy with naive fine-tuning and Learning without Forgetting (LwF) under the same class-incremental protocol. The proposed method achieves the strongest overall performance across all three datasets. On KDEF, the proposed network reaches 99.48% accuracy, compared with 96.10% for naive fine-tuning and 94.35% for LwF. On FER2013, it achieves 79.15% accuracy, exceeding naive fine-tuning (76.75%) and LwF (71.25%). On AffectNet, the proposed method attains 74.20% accuracy, compared with 69.85% for naive fine-tuning and 70.25% for LwF. These results indicate that rehearsal, weight transfer, and distillation jointly improve knowledge retention under sequential updates.

3.6. Qualitative Evaluation of Visual Results

To complement the quantitative results, Grad-CAM is used to visualize the regions that contribute most to the predicted emotion. The comparison is made against a backbone-only baseline that uses the same DenseNet121 feature extractor and classifier head but excludes CA and DA (referred to as the baseline). Figure 6 reports representative examples from controlled and in-the-wild datasets, where warmer colors indicate higher contribution to the predicted class. Incorrect predictions are highlighted in red. Across datasets, the proposed model produces more face-centered and semantically meaningful activations than the baseline. The most consistent evidence is observed around the periocular region (eyebrow and eyelid dynamics) and the perioral region (mouth opening and lip-corner movement), which align with facial action patterns commonly associated with expression changes. In Figure 6a (KDEF and FER2013), the baseline often exhibits diffuse activations or attends to less informative regions, which is consistent with confusions among visually similar categories. In contrast, the proposed model shows tighter localization under pose and illumination variation, indicating that DA helps reduce feature misalignment and preserves discriminative facial cues. In Figure 6b (AffectNet and FERPlus), the proposed model remains more stable under background clutter, age-related appearance variation, and non-frontal views. Several cases show corrected baseline confusions, particularly between weak negative expressions and neutral, or between anger and disgust. Remaining errors mainly occur for borderline expressions where muscle activation is subtle or overlapping across classes (for example, disgust versus anger, or neutral versus mild anger), even when the activation remains centered on the face. Overall, the visual evidence is consistent with the quantitative improvements and supports the contribution of CA and DA to better feature selectivity with reduced reliance on contextual cues.

3.7. Efficiency and Complexity Analysis

Representative backbone models and the proposed network were benchmarked under the same workstation setting described in Section 3.1. All efficiency measurements were obtained on KDEF using an input resolution of

224 \times 224

, FP32 inference, and a batch size of 64. Throughput and latency were measured during the inference stage only, excluding data loading and preprocessing. Latency is reported as the average inference time per image, computed from the batch latency (batch size = 64). Before benchmarking, several warm-up iterations were performed to stabilize GPU execution, and the reported latency corresponds to the average over multiple inference runs. In addition to CPU/GPU throughput, we report FLOPs, measured inference latency, parameter count, and recognition accuracy. The benchmarking hardware was an NVIDIA RTX 3080 GPU with 12 GB VRAM.

Table 7 shows that the proposed IAE-Net achieves the best recognition accuracy (99.48%) while maintaining moderate computational complexity. Specifically, IAE-Net requires 2.94 GFLOPs and 8.7 M parameters, which are substantially lower than InceptionV3 (5.72 GFLOPs, 27.7 M), ResNet50 (4.09 GFLOPs, 29.5 M), ConvNeXtTiny (4.49 GFLOPs, 28.8 M), and Xception (8.40 GFLOPs, 26.8 M), while also outperforming them in recognition accuracy. Although MobileNetV2 is lighter in terms of parameters (4.9 M) and FLOPs (0.57 G), its accuracy is markedly lower (88.95%). IAE-Net also maintains competitive inference speed, reaching 54–58 FPS on GPU and 11–14 FPS on CPU, with measured latency in the range of 17.2–18.5 ms under the stated benchmark setting. Overall, these results indicate that IAE-Net provides a strong accuracy–efficiency trade-off relative to both lightweight and heavier comparison models.

4. Conclusions

This work presented IAE-Net, a compact FER framework that combines a DenseNet121 backbone with cascaded CA and DA to improve robustness under geometric variability and to support sequential updates. Across four benchmarks (FER2013, FERPlus, KDEF, and AffectNet), IAE-Net achieved accuracies of 79.15%, 92.03%, 99.48%, and 74.20%, respectively, with balanced precision–recall behavior, indicating stable discrimination across both controlled and in-the-wild conditions. Ablation results show that CA and DA provide complementary benefits, and their joint use consistently yields the strongest performance, supporting the design choice of cascaded refinement rather than single-module enhancement. Beyond offline training, IAE-Net was evaluated under a class-incremental protocol using weight transfer, exemplar replay, and knowledge distillation. The model retained previously learned knowledge with controlled forgetting, reflected by TKL = 0.0871 on KDEF, 0.5975 on FER2013, and 0.5750 on AffectNet, and a gradual performance decline across stages rather than abrupt degradation. These findings suggest that IAE-Net can support sequential updates without full retraining under the evaluated experimental setting, which is promising for practical deployments where data distributions evolve. The efficiency analysis further shows that IAE-Net achieves 2.94 GFLOPs, 17.2–18.5 ms latency, and 8.7 M parameters while maintaining 99.48% accuracy on KDEF under the unified benchmark setting. Future work will extend evaluation to cross-dataset generalization and robustness testing under controlled occlusion and corruption protocols.

Author Contributions

H.A.K.: Conceptualization, Methodology, Software, Validation, Writing original draft and J.-H.L.: Supervision, Project administration, Funding acquisition, Writing review & editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI23C0942, RS-2024-00433896), Korea Basic Science Institute (National research Facilities and Equipment Center) grant funded by the Ministry of Education (grant no. 2020R1A6C101B189), Digital Innovation Hub project supervised by the Daegu Digital Innovation Promotion Agency (DIP) grant funded by the Korea government (MSIT and Daegu Metropolitan City) in 2023(25DIH-17) and Regional Innovation System & Education (RISE) program through the Daegu RISE Center, funded by the Ministry of Education (MOE) and the Daegu, Republic of Korea (2025-RISE-03-002).

Data Availability Statement

The data presented in this study were derived from publicly available third-party datasets, including FER2013, FERPlus, KDEF, and the publicly available Kaggle subset of AffectNet. These datasets are available from their respective sources as cited in the manuscript.

Acknowledgments

AI-assisted language polishing (grammar and formatting) was used during manuscript preparation. The authors reviewed and approved all final content.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

FER	Facial Emotion Recognition
DL	Deep Learning
ML	Machine Learning
CNN	Convolutional Neural Network
ND	Neurological and neurodegenerative disorder
IAE-Net	Incremental Attention-Enhanced Network
CA	Channel Attention
DA	Deformable Attention
GAP	Global Average Pooling
GMP	Global Max Pooling
SGD	Stochastic Gradient Descent

References

Mannepalli, K.; Sastry, P.N.; Suman, M. A novel adaptive fractional deep belief networks for speaker emotion recognition. Alex. Eng. J. 2017, 56, 485–497. [Google Scholar] [CrossRef]
Nan, Y.; Ju, J.; Hua, Q.; Zhang, H.; Wang, B. A-MobileNet: An approach of facial expression recognition. Alex. Eng. J. 2022, 61, 4435–4444. [Google Scholar] [CrossRef]
Jeong, M.; Ko, B.C. Driver’s facial expression recognition in real-time for safe driving. Sensors 2018, 18, 4270. [Google Scholar] [CrossRef]
Shen, F.; Dai, G.; Lin, G.; Zhang, J.; Kong, W.; Zeng, H. EEG-based emotion recognition using 4D convolutional recurrent neural network. Cogn. Neurodyn. 2020, 14, 815–828. [Google Scholar] [CrossRef]
Yun, S.S.; Choi, J.; Park, S.K.; Bong, G.Y.; Yoo, H. Social skills training for children with autism spectrum disorder using a robotic behavioral intervention system. Autism Res. 2017, 10, 1306–1323. [Google Scholar] [CrossRef]
Kaulard, K.; Cunningham, D.W.; Bülthoff, H.H.; Wallraven, C. The MPI facial expression database—A validated database of emotional and conversational facial expressions. PLoS ONE 2012, 7, e32321. [Google Scholar]
Canal, F.Z.; Müller, T.R.; Matias, J.C.; Scotton, G.G.; de Sa Junior, A.R.; Pozzebon, E.; Sobieranski, A.C. A survey on facial emotion recognition techniques: A state-of-the-art literature review. Inf. Sci. 2022, 582, 593–617. [Google Scholar] [CrossRef]
Mellouk, W.; Handouzi, W. Facial emotion recognition using deep learning: Review and insights. Procedia Comput. Sci. 2020, 175, 689–694. [Google Scholar] [CrossRef]
Afzal, S.; Khan, H.A.; Ali, S.; Lee, J.W. Virtual reality environment: Detecting and inducing emotions. In Proceedings of the 2025 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 11–14 January 2025; pp. 1–4. [Google Scholar]
Ricciardi, L.; Visco-Comandini, F.; Erro, R.; Morgante, F.; Bologna, M.; Fasano, A.; Ricciardi, D.; Edwards, M.J.; Kilner, J. Facial emotion recognition and expression in Parkinson’s disease: An emotional mirror mechanism? PLoS ONE 2017, 12, e0169110. [Google Scholar] [CrossRef]
Lin, J.; Chen, Y.; Wen, H.; Yang, Z.; Zeng, J. Weakness of eye closure with central facial paralysis after unilateral hemispheric stroke predicts a worse outcome. J. Stroke Cerebrovasc. Dis. 2017, 26, 834–841. [Google Scholar] [CrossRef] [PubMed]
Ferrari, C.; Berretti, S.; Pala, P.; Del Bimbo, A. Measuring 3D face deformations from RGB images of expression rehabilitation exercises. Virtual Real. Intell. Hardw. 2022, 4, 306–323. [Google Scholar] [CrossRef]
Gomez, L.F.; Morales, A.; Orozco-Arroyave, J.R.; Daza, R.; Fierrez, J. Improving Parkinson Detection Using Dynamic Features From Evoked Expressions in Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Nashville, TN, USA, 20–25 June 2021; pp. 1562–1570. [Google Scholar]
Rakshith, D.; Kenchannavar, H. Hybrid deep optimal network for recognizing emotions using facial expressions at real time. Int. J. Intell. Syst. Appl. 2024, 16, 47–58. [Google Scholar] [CrossRef]
Tanaka, H.; Umeda, R.; Kurogi, T.; Nagata, Y.; Ishimaru, D.; Fukuhara, K.; Nakai, S.; Tenjin, M.; Nishikawa, T. Clinical utility of an assessment scale for engagement in activities for patients with moderate-to-severe dementia: Additional analysis. Psychogeriatrics 2022, 22, 433–444. [Google Scholar] [CrossRef]
Bevilacqua, V.; D’Ambruoso, D.; Mandolino, G.; Suma, M. A new tool to support diagnosis of neurological disorders by means of facial expressions. In Proceedings of the IEEE International Symposium on Medical Measurements and Applications, Bari, Italy, 30–31 May 2011; pp. 544–549. [Google Scholar]
Dantcheva, A.; Bilinski, P.; Nguyen, H.T.; Broutart, J.C.; Bremond, F. Expression recognition for severely demented patients in music reminiscence-therapy. In Proceedings of the 2017 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, 28 August–2 September 2017; pp. 783–787. [Google Scholar]
Jin, B.; Qu, Y.; Zhang, L.; Gao, Z. Diagnosing Parkinson disease through facial expression recognition: Video analysis. J. Med. Internet Res. 2020, 22, e18697. [Google Scholar] [CrossRef]
Kerr-Gaffney, J.; Mason, L.; Jones, E.; Hayward, H.; Ahmad, J.; Harrison, A.; Loth, E.; Murphy, D.; Tchanturia, K. Emotion recognition abilities in adults with anorexia nervosa are associated with autistic traits. J. Clin. Med. 2020, 9, 1057. [Google Scholar] [CrossRef]
Carcagnì, P.; Del Coco, M.; Leo, M.; Distante, C. Facial expression recognition and histograms of oriented gradients: A comprehensive study. SpringerPlus 2015, 4, 645. [Google Scholar] [CrossRef]
Soyel, H.; Demirel, H. Improved SIFT matching for pose robust facial expression recognition. In Proceedings of the 2011 IEEE International Conference on Automatic Face & Gesture Recognition (FG), Santa Barbara, CA, USA, 21–25 March 2011. [Google Scholar]
Chen, L.; Zhou, C.; Shen, L. Facial expression recognition based on SVM in E-learning. IERI Procedia 2012, 2, 781–787. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, Inception-ResNet and the impact of residual connections on learning. Proc. AAAI Conf. Artif. Intell. 2017, 31, 4278–4284. [Google Scholar] [CrossRef]
Tang, Y. Deep learning using linear support vector machines. arXiv 2013, arXiv:1306.0239. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Liang, D.; Liang, H.; Yu, Z.; Zhang, Y. Deep convolutional BiLSTM fusion network for facial expression recognition. Vis. Comput. 2020, 36, 499–508. [Google Scholar] [CrossRef]
Pan, X.; Ying, G.; Chen, G.; Li, H.; Li, W. A deep spatial and temporal aggregation framework for video-based facial expression recognition. IEEE Access 2019, 7, 48807–48815. [Google Scholar] [CrossRef]
Sun, N.; Li, Q.; Huan, R.; Liu, J.; Han, G. Deep spatial-temporal feature fusion for facial expression recognition in static images. Pattern Recognit. Lett. 2019, 119, 49–61. [Google Scholar] [CrossRef]
Ye, Y.; Pan, Y.; Liang, Y.; Pan, J. A cascaded spatiotemporal attention network for dynamic facial expression recognition. Appl. Intell. 2023, 53, 5402–5415. [Google Scholar] [CrossRef]
Huo, H.; Yu, Y.; Liu, Z. Facial expression recognition based on improved depthwise separable convolutional network. Multimed. Tools Appl. 2023, 82, 18635–18652. [Google Scholar] [CrossRef] [PubMed]
Zhu, Q.; Mao, Q.; Jia, H.; Noi, O.E.N.; Tu, J. Convolutional relation network for facial expression recognition in the wild with few-shot learning. Expert Syst. Appl. 2022, 189, 116046. [Google Scholar] [CrossRef]
Khan, T.; Choi, G.; Lee, S. EFFNet-CA: An efficient driver distraction detection based on multiscale features extractions and channel attention mechanism. Sensors 2023, 23, 3835. [Google Scholar] [CrossRef]
Wadhawan, R.; Gandhi, T.K. Landmark-aware and part-based ensemble transfer learning network for static facial expression recognition from images. IEEE Trans. Artif. Intell. 2022, 4, 349–361. [Google Scholar] [CrossRef]
Li, R.; Ren, C.; Zhang, X.; Hu, B. A novel ensemble learning method using multiple objective particle swarm optimization for subject-independent EEG-based emotion recognition. Comput. Methods Programs Biomed. 2022, 140, 105080. [Google Scholar] [CrossRef]
Khan, T.; Yasir, M.; Choi, C. Attention-enhanced optimized deep ensemble network for effective facial emotion recognition. Alex. Eng. J. 2025, 119, 111–123. [Google Scholar] [CrossRef]
Lin, Z.; Wang, Y.; Zhou, Y.; Du, F.; Yang, Y. MLM-EOE: Automatic depression detection via sentimental annotation and multi-expert ensemble. IEEE Trans. Affect. Comput. 2025, 16, 2842–2858. [Google Scholar] [CrossRef]
Wang, Y.; Lin, Z.; Yang, C.; Zhou, Y.; Yang, Y. Automatic depression recognition with an ensemble of multimodal spatio-temporal routing features. IEEE Trans. Affect. Comput. 2025, 16, 1855–1872. [Google Scholar] [CrossRef]
Afzal, S.; Khan, H.A.; Piran, M.J.; Lee, J.W. A comprehensive survey on affective computing: Challenges, trends, applications, and future directions. IEEE Access 2024, 12, 96150–96168. [Google Scholar] [CrossRef]
Mollahosseini, A.; Chan, D.; Mahoor, M.H. Going deeper in facial expression recognition using deep neural networks. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; IEEE: Piscataway, NJ, USA, 2016. [Google Scholar]
Lasri, I.; Riadsolh, A.; Elbelkacemi, M. Facial emotion recognition of deaf and hard-of-hearing students for engagement detection using deep learning. Educ. Inf. Technol. 2023, 28, 4069–4092. [Google Scholar] [CrossRef]
Solis-Arrazola, M.A.; Sanchez-Yañez, R.E.; Garcia-Capulin, C.H.; Rostro-Gonzalez, H. Enhancing image-based facial expression recognition through muscle activation-based facial feature extraction. Comput. Vis. Image Underst. 2024, 240, 103927. [Google Scholar] [CrossRef]
Hussain, A.; Ullah, W.; Khan, N.; Khan, Z.A.; Yar, H.; Baik, S.W. Class-incremental learning network for real-time anomaly recognition in surveillance environments. Pattern Recognit. 2026, 170, 112064. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Erhan, D.; Carrier, P.L.; Courville, A.; Mirza, M.; Hamner, B.; Cukierski, W.; Tang, Y.; Thaler, D.; Lee, D.H.; et al. Challenges in Representation Learning: A report on three machine learning contests. arXiv 2013, arXiv:1307.0414. [Google Scholar] [CrossRef]
Barsoum, E.; Zhang, C.; Ferrer, C.C.; Zhang, Z. Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution. In Proceedings of the 18th ACM International Conference on Multimodal Interaction (ICMI), Tokyo, Japan, 12–16 November 2016; pp. 279–283. [Google Scholar] [CrossRef]
Calvo, M.G.; Lundqvist, D. Facial expressions of emotion (KDEF): Identification under different display-duration conditions. Behav. Res. Methods 2008, 40, 109–115. [Google Scholar] [CrossRef] [PubMed]
Mollahosseini, A.; Hasani, B.; Mahoor, M.H. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Trans. Affect. Comput. 2019, 10, 18–31. [Google Scholar] [CrossRef]
Kaggle. AffectNet Dataset (Kaggle Subset). 2024. Available online: https://www.kaggle.com/datasets/mstjebashazida/affectnet (accessed on 1 March 2026).
Akhand, M.A.H.; Roy, S.; Siddique, N.; Kamal, M.A.S.; Shimamura, T. Facial emotion recognition using transfer learning in the deep CNN. Electronics 2021, 10, 1036. [Google Scholar] [CrossRef]
Avcı, S.O.; Akay, O. Employment and Investigation of Various CNN Models and Datasets for Facial Expression Recognition and Classification. In Proceedings of the 2023 14th International Conference on Electrical and Electronics Engineering (ELECO), Bursa, Turkiye, 30 November–2 December 2023; pp. 1–5. [Google Scholar]
Ming, Z.; Chazalon, J.; Luqman, M.M.; Visani, M.; Burie, J.C. FaceLiveNet: End-to-end networks combining face verification with interactive facial expression-based liveness detection. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 3507–3512. [Google Scholar]
Hung, J.C.; Lin, K.C.; Lai, N.X. Recognizing learning emotion based on convolutional neural networks and transfer learning. Appl. Soft Comput. 2019, 84, 105724. [Google Scholar] [CrossRef]
Sun, Z.; Zhang, H.; Bai, J.; Liu, M.; Hu, Z. A discriminatively deep fusion approach with improved conditional GAN (im-cGAN) for facial expression recognition. Pattern Recognit. 2023, 135, 109157. [Google Scholar] [CrossRef]
Aghabeigi, F.; Nazari, S.; Osati Eraghi, N. An optimized facial emotion recognition architecture based on a deep convolutional neural network and genetic algorithm. Signal Image Video Process. 2024, 18, 1119–1129. [Google Scholar] [CrossRef]
Vedantham, R.; Reddy, E.S. A robust feature extraction with optimized DBN-SMO for facial expression recognition. Multimed. Tools Appl. 2020, 79, 21487–21512. [Google Scholar] [CrossRef]
Chen, X.; Li, D.; Tang, Y.; Huang, S.; Wu, Y.; Wu, Y. Pairwise dependency-based robust ensemble pruning for facial expression recognition. Multimed. Tools Appl. 2023, 83, 37089–37117. [Google Scholar] [CrossRef]
Nida, N.; Yousaf, M.H.; Irtaza, A.; Javed, S.; Velastin, S.A. Spatial deep feature augmentation technique for FER using genetic algorithm. Neural Comput. Appl. 2024, 36, 4563–4581. [Google Scholar] [CrossRef]
Li, S.; Deng, W. A deeper look at facial expression dataset bias. IEEE Trans. Affect. Comput. 2020, 13, 881–893. [Google Scholar] [CrossRef]
Mao, S.; Li, X.; Wu, Q.; Peng, X. Au-aware vision transformers for biased facial expression recognition. arXiv 2022, arXiv:2211.06609. [Google Scholar] [CrossRef]
Ma, F.; Sun, B.; Li, S. Facial Expression Recognition with Visual Transformers and Attentional Selective Fusion. IEEE Trans. Affect. Comput. 2023, 14, 1236–1248. [Google Scholar] [CrossRef]
Xu, X.; Zong, Y.; Lu, C.; Jiang, X. Enhanced sample self-revised network for cross-dataset facial expression recognition. Entropy 2022, 24, 1475. [Google Scholar] [CrossRef]
Khan, N.; Singh, A.V.; Agrawal, R. Novel Approach of Facial Expression Recognition for Cross-Datasets. In Proceedings of the 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, 6–8 July 2023; pp. 1–7. [Google Scholar]
Hu, P.; Tang, X.; Yang, L.; Kong, C.; Xia, D. LCANet: A model for analysis of students real-time sentiment by integrating attention mechanism and joint loss function. Complex Intell. Syst. 2025, 11, 27. [Google Scholar] [CrossRef]
Ma, F.; Sun, B.; Li, S. Transformer-Augmented Network with Online Label Correction for Facial Expression Recognition. IEEE Trans. Affect. Comput. 2024, 15, 593–605. [Google Scholar] [CrossRef]
Xu, J.; Li, Y.; Yang, G.; He, L.; Luo, K. Multiscale Facial Expression Recognition Based on Dynamic Global and Static Local Attention. IEEE Trans. Affect. Comput. 2025, 16, 683–696. [Google Scholar] [CrossRef]
Oadud, M.A.; HaiYu, H.; Kayes, A. Facial Expression Recognition with Limited Labels: A Self-Supervised and Semi-Supervised Approach: Overcoming Label Scarcity in FER. In Proceedings of the 2025 10th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 16–18 May 2025; pp. 1171–1178. [Google Scholar] [CrossRef]
Huang, Y.; Peng, J.; Zhang, W.; Zhao, T.; Chen, G.; Tan, S.; Yi, F.; Wang, L. FERMixNet: An Occlusion Robust Facial Expression Recognition Model with Facial Mixing Augmentation and Mid-Level Representation Learning. IEEE Trans. Affect. Comput. 2025, 16, 639–654. [Google Scholar] [CrossRef]
Verma, B. In-the-wild facial emotion recognition using relation-aware geometric features and CapsNet. Comput. Electr. Eng. 2025, 128, 110685. [Google Scholar]
Singh, M.; Bhargava, K.; Natarajan, K. EmotionLens: Optimizing FER with a Lightweight Residual CNN Architecture. In Proceedings of the 2025 3rd International Conference on Sustainable Computing and Data Communication Systems (ICSCDS), Erode, India, 6–8 August 2025; pp. 1–6. [Google Scholar]
Khaireddin, Y.; Chen, Z. Facial emotion recognition: State of the art performance on FER2013. arXiv 2021, arXiv:2105.03588. [Google Scholar] [CrossRef]

Figure 1. Workflow diagram of the proposed IAE-Net methodology.

Figure 2. High-level diagram of the proposed IAE-Net framework. The pipeline consists of dataset-specific preprocessing and augmentation, DenseNet121 feature extraction, cascaded CA and DA refinement, and a lightweight classifier.

Figure 3. CA module, GAP, and GMP generate channel descriptors that are passed through a shared two-layer MLP. The resulting channel mask re-weights backbone feature maps to emphasize expression-relevant activations.

Figure 4. DA module. A lightweight convolution predicts a bounded offset field that is applied through differentiable warping to reduce feature misalignment caused by non-rigid facial motion and pose variation.

Figure 5. Performance visualization of the proposed network on each dataset.

Figure 6. Grad-CAM visualization on KDEF, FER2013, AffectNet, and FERPlus. (a) Representative examples from KDEF and FER2013. (b) Representative examples from AffectNet and FERPlus. In each representative set, the upper row shows input images and the lower row shows activation maps. Labels show ground truth and predictions (baseline vs. proposed); errors are highlighted in red.

Table 1. Dataset summary and split configuration used for evaluation.

Dataset	Images	Classes	Train	Val	Test	Input Size	Reference
FER2013	35,887	7	28,709	3589	3589	48 × 48	[45]
FERPlus	35,887	8	28,709	3589	3589	48 × 48	[46]
KDEF	4900	7	3920	490	490	224 × 224	[47]
AffectNet (Kaggle subset)	28,175	8	22,540	2817	2818	224 × 224	[48,49]

Table 4. Direct comparison of alternative spatial alignment modules on KDEF under the same DenseNet121 backbone and training protocol.

Backbone	Method	P (%)	R (%)	F1 (%)	Acc (%)	Params (M)
DenseNet121	Baseline	87.88	88.96	88.41	88.59	7.31
DenseNet121	+ STN	91.84	90.45	91.14	91.25	7.33
DenseNet121	+ DeformConv	93.80	91.41	92.59	92.50	10.07
DenseNet121	+ DA	97.21	95.07	96.13	96.35	8.44

Table 5. Class-incremental evaluation across incremental stages. P/R/F1/Acc are reported in %. FR denotes the stage-level forgetting rate (average non-negative drop on previously learned classes), and TKL denotes the mean FR across the full incremental sequence for each dataset. “–” denotes not applicable entries, and bold values indicate the summarized final retention measure for each dataset.

Incremental Step	KDEF					FER2013					AffectNet
Incremental Step	P	R	F1	Acc	FR	P	R	F1	Acc	FR	P	R	F1	Acc	FR
Stage 1	99.80	99.70	99.75	99.75	0	85.20	84.40	84.80	84.80	0	86.10	84.80	85.45	85.40	0
Stage 2	99.74	99.65	99.69	99.70	0.05	83.90	83.10	83.50	83.60	0.40	84.20	83.10	83.65	83.60	0.32
Stage 3	99.67	99.59	99.63	99.63	0.11	82.70	82.00	82.30	82.40	0.65	82.90	81.70	82.30	82.20	0.58
Stage 4	99.62	99.54	99.58	99.58	0.09	81.80	81.00	81.40	81.50	0.58	81.30	80.50	80.90	80.80	0.47
Stage 5	99.57	99.49	99.53	99.53	0.16	80.90	80.10	80.50	80.60	0.92	79.80	79.00	79.40	79.30	0.93
Stage 6	99.54	99.46	99.50	99.48	0.20	80.20	79.40	79.80	79.90	0.88	78.60	77.70	78.15	78.10	0.88
Final evaluation	–	–	–	–	–	79.50	78.80	79.10	79.15	1.35	77.00	75.90	76.45	74.20	1.42
TKL					0.0871					0.5975					0.5750

Stage 1 is initialized with two classes, and each subsequent stage introduces one additional class during the incremental learning process. Rows Stage 1 to Stage 6 report the stage-wise incremental results, whereas the Final evaluation row reports performance after completion of the full incremental sequence for the corresponding dataset. TKL is computed from the stage-level FR values only. Because the class order differs across datasets, stage labels are reported generically.

Table 6. Comparison of incremental-learning strategies under the identical class-incremental protocol.

Dataset	Method	Precision	Recall	F1	Accuracy
KDEF	Naive fine-tuning	97.15	95.51	96.32	96.10
KDEF	LwF	93.69	95.10	94.39	94.35
KDEF	Proposed network	99.54	99.46	99.50	99.48
FER2013	Naive fine-tuning	75.55	78.96	77.22	76.75
FER2013	LwF	70.67	69.72	70.19	71.25
FER2013	Proposed network	79.50	78.80	79.10	79.15
AffectNet	Naive fine-tuning	68.37	69.57	68.97	69.85
AffectNet	LwF	71.61	69.59	70.59	70.25
AffectNet	Proposed network	77.00	75.90	76.45	74.20

Table 7. Efficiency and performance comparison on KDEF under the same implementation and hardware (FP32, batch size 64).

Method	CPU FPS	GPU FPS	Batch Size	FLOPs (G)	Latency (ms)	Parameter (M)	Accuracy (%)
MobileNetV2	16–19	61–63	64	0.57	15.9–16.4	4.9	88.95
InceptionV3	5–6	48–51	64	5.72	19.6–20.8	27.7	97.28
ResNet50	6–8	36–49	64	4.09	20.4–27.8	29.5	89.15
ConvNeXtTiny	6–9	49–52	64	4.49	19.2–20.4	28.8	95.10
Xception	5–6	44–47	64	8.40	21.3–22.7	26.8	93.95
Proposed Network	11–14	54–58	64	2.94	17.2–18.5	8.7	99.48

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Khan, H.A.; Lee, J.-H. IAE-Net: Incremental Learning-Based Attention-Enhanced DenseNet for Robust Facial Emotion Recognition. Mathematics 2026, 14, 1023. https://doi.org/10.3390/math14061023

AMA Style

Khan HA, Lee J-H. IAE-Net: Incremental Learning-Based Attention-Enhanced DenseNet for Robust Facial Emotion Recognition. Mathematics. 2026; 14(6):1023. https://doi.org/10.3390/math14061023

Chicago/Turabian Style

Khan, Haseeb Ali, and Jong-Ha Lee. 2026. "IAE-Net: Incremental Learning-Based Attention-Enhanced DenseNet for Robust Facial Emotion Recognition" Mathematics 14, no. 6: 1023. https://doi.org/10.3390/math14061023

APA Style

Khan, H. A., & Lee, J.-H. (2026). IAE-Net: Incremental Learning-Based Attention-Enhanced DenseNet for Robust Facial Emotion Recognition. Mathematics, 14(6), 1023. https://doi.org/10.3390/math14061023

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

IAE-Net: Incremental Learning-Based Attention-Enhanced DenseNet for Robust Facial Emotion Recognition

Abstract

1. Introduction

2. Methodology

2.1. Preprocessing and Data Preparation

2.2. DenseNet121 Backbone for Feature Extraction

2.3. Channel Attention (CA)

2.4. Deformable Attention (DA)

2.5. Classification Head and Training Objective

2.6. Continual Learning Under Sequential Updates

2.6.1. Exemplar Replay and Weight Transfer

2.6.2. Incremental Training Objective

3. Results and Discussion

3.1. Experimental Setup

3.2. Datasets

3.3. Comparative Analysis of the Proposed Model Against SOTA Methods

3.4. Ablation Study on Channel and Deformable Attention

3.5. Incremental Learning Evaluation

3.6. Qualitative Evaluation of Visual Results

3.7. Efficiency and Complexity Analysis

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI