Joint Learning for Mask-Aware Facial Expression Recognition Based on Exposed Feature Analysis and Occlusion Feature Enhancement

Hou, Huanyu; Sun, Xiaoming

doi:10.3390/app151910433

Open AccessArticle

Joint Learning for Mask-Aware Facial Expression Recognition Based on Exposed Feature Analysis and Occlusion Feature Enhancement

by

Huanyu Hou

and

Xiaoming Sun

^*

School of Measurement & Control Technology and Communication Engineering, Harbin University of Science and Technology, Harbin 150080, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10433; https://doi.org/10.3390/app151910433

Submission received: 5 August 2025 / Revised: 10 September 2025 / Accepted: 24 September 2025 / Published: 26 September 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Facial expression recognition (FER), applied in fields such as interaction and intelligent security, has seen widespread development with the advancement of machine vision technology. However, in natural environments, faces are often obscured by masks, posture, and body parts, leading to incomplete features, which results in poor accuracy of existing facial expression recognition algorithms. Apart from extreme scenarios where facial features are completely blocked, the key information of facial expression features is mostly preserved in most cases, yet insufficient parsing of these features leads to poor recognition results. To address this, we propose a novel joint learning framework that integrates explicit occlusion parsing and feature enhancement. Our model consists of three core modules: a Facial Occlusion Parsing Module (FOPM) for real-time occlusion estimation, an Expression Feature Fusion Module (EFFM) for integrating appearance and geometric features, and a Facial Expression Recognition Module (FERM) for final classification. Extensive experiments under a rigorous and reproducible protocol demonstrate significant improvements of our approach. On the masked facial expression datasets RAF-DB and FER+, our model achieves accuracies of 91.24% and 90.18%, surpassing previous state-of-the-art methods by 2.62% and 0.96%, respectively. Additional evaluation on a real-world masked dataset with diverse mask types further confirms the robustness and generalizability of our method, where it attains an accuracy of 89.75%. Moreover, the model maintains high computational efficiency with an inference time of 12.4 ms per image. By effectively parsing and integrating partially obscured facial features, our approach enables more accurate and robust expression recognition, which is essential for real-world applications in interaction and intelligent security systems.

Keywords:

facial expression recognition; mask-aware FER; deep learning; joint learning

1. Introduction

Enhancing the quality of interaction is a crucial goal, and within this pursuit, research in facial expression recognition (FER) plays a pivotal role. As a vital sub-field of computer vision, this discipline is dedicated to discerning human emotional states from facial expressions captured in images or videos. Traditional facial expression models mainly rely on hand-crafted features and machine learning techniques such as HOG [1] and Haar [2]. With the rapid development of deep learning, methods based on convolutional neural networks including CNN [3] and VGG [4] have significantly improved recognition performance.

Recent approaches have further advanced the field by incorporating structural and contextual facial information. For instance, the High-Definition Neural Network (HDNN) [5] excels at capturing relational features within facial images, integrating multi-scale networks to better represent emotional attributes. Similarly, vision transformer-based frameworks have been introduced to model long-range dependencies beneficial for FER. These methods achieve impressive accuracy under controlled conditions with fully visible faces. However, in real-world scenarios, facial expressions are often partially occluded due to obstacles such as masks, sunglasses, or head pose variations, which significantly degrades recognition performance. Figure 1 illustrates a comparison of facial expression recognition (FER) under ideal conditions versus common occlusion factors in natural environments (e.g., masks, poses, objects). Occlusions obscure key features, causing significant drops in recognition accuracy for traditional models—precisely the core challenge addressed by this research.

Several recent works have attempted to address occlusion-aware FER. The Face-mask-aware Face Parsing Model (FFPM) introduces a face parsing module combined with a transformer architecture to improve recognition under mask occlusion [6]. Li et al. proposed a Multi-Angle Feature Extraction (MAFE) method aimed at improving recognition accuracy under occlusion conditions by incorporating multi-scale global features, local fine-grained features, and salient region features [7]. Kim et al. employs a Spatial Transformer Network (STN) with an attention mechanism to leverage specific facial regions that make the primary contribution (or are most relevant) to particular facial expressions [8]. Devasena et al. proposed a novel Twinned-Attention Network (Twinned-Att) for efficient FER in occluded images [9]. Liang et al. proposed a Convolutional-Transformer Dual-Branch Network (CT-DBN) that leverages both local and global contextual information to achieve robust FER against real-word occlusion and head pose variations [10].

While these methods demonstrate progress, they exhibit three primary limitations that our work aims to address. First, methods like FFPM [6] and MVT [11] are primarily designed for and evaluated on standardized mask occlusions, lacking demonstrated gen-eralization to other common, unpredictable occlusions (e.g., hands, hair, accessories). Second, approaches relying solely on implicit attention mechanisms (e.g., STN [8], Twinned-Att [9]) or multi-branch feature extraction (e.g., MAFE [7], CT-DBN [10]) do not explicitly model the occlusion structure itself. They lack a dedicated mechanism to esti-mate the occlusion mask and use it as a structural prior to guide feature learning and re-covery. Third, current methods lack a unified framework that explicitly models occlusion as a first-class structural entity and leverages this explicit understanding to dynamically guide feature extraction and recovery across diverse, unpredictable occlusion patterns. Specifically, reconstruction-based methods become highly speculative when major facial regions are obscured; holistic methods do not adequately model partial occlusion; and sub-region/attention-based approaches rely on implicit, often unreliable, feature re-weighting without explicit structural understanding of the occlusion itself. This fundamental limitation—the inability to explicitly parse the occlusion configuration and use it as a structural prior—forces models to treat occluded and non-occluded regions indistinctly, leading to suboptimal feature utilization, poor generalization beyond trained occlusion types, and significant performance degradation in real-world scenarios where occlusion type, size, and location are highly variable.

In contrast to these works, our proposed framework introduces a novel Facial Occlusion Parsing Module (FOPM) that explicitly estimates a generic occlusion mask in real-time, providing a clear structural prior of the obstruction. This explicit parsing enables dynamic, adaptive guidance for subsequent feature extraction—focusing computation on salient visible regions while utilizing contextual reasoning to infer cues from occluded areas. Furthermore, unlike multi-branch models that process regions independently, our efficient Expression Feature Fusion Module (EFFM) integrates appearance features with geometric landmarks through a gated fusion mechanism, enabling robust and interpretable feature recombination under diverse occlusion patterns. This joint learning strategy of explicit occlusion parsing and guided feature enhancement differentiates our approach by offering a lightweight, plug-and-play solution that is not limited to pre-defined occlusion types and maintains high computational efficiency.

To bridge this gap, we propose a novel and highly practical facial expression recognition framework that emphasizes engineering applicability through its lightweight, plug-and-play occlusion parsing module and efficient feature fusion mechanism. Our approach introduces a Facial Occlusion Parsing Module (FOPM) that explicitly estimates the facial occlusion mask and guides the feature extraction process to focus on salient visible regions while recovering cues from occluded areas via contextual inference. Unlike multi-branch models that process different facial regions independently or attention-based methods that rely solely on implicit feature re-weighting, our FOPM provides an explicit structural prior of occlusion patterns, which is jointly optimized with the expression recognition task. This enables more accurate and interpretable guidance for feature extraction, especially under diverse and unpredictable occlusion scenarios. The FOPM module is computationally efficient and can be readily integrated into existing FER pipelines without significant architectural changes. The framework further incorporates local feature enhancement and global expression representation through a unified fusion mechanism, enabling efficient aggregation of both fine-grained details and high-level semantic information without the need for complex multi-branch coordination or heavy attention computation. Through careful design of the feature fusion components, our method maintains low computational overhead while significantly improving recognition accuracy under various occlusion conditions.

In summary, the main contributions of this work are as follows:

(1): Our method proposes a lightweight Facial Occlusion Parsing Module (FOPM) that explicitly estimates occlusion patterns and integrates this structural prior to guide feature extraction. This plug-and-play component can be easily deployed in existing FER systems, significantly enhancing their robustness to partial occlusions with low computational cost.
(2): Our study introduces an efficient end-to-end trainable framework that performs robust occlusion-aware FER by synergistically analyzing both exposed and occluded facial regions. The system employs innovative attention mechanisms that enable practical feature transfer and enhancement without compromising inference speed.
(3): We conduct a thorough evaluation on publicly available and mask-aware faces, showcasing the superior performance of our proposed method in comparison to various recent methods.

2. Related Works

To better position our contribution within the existing literature and highlight the methodological evolution in this field, this section reviews related work through a structured categorical framework. We organize the discussion into four distinct methodological paradigms that represent the chronological and conceptual progression in occlusion-aware FER research.

2.1. Occlusion Problem in FER

Approaches to tackle occlusion in computer vision are generally categorized into three main types: reconstruction-based, holistic-based, and sub-region-based methods. The majority of FER research to date has concentrated on evaluating these approaches with non-occluded facial images [12,13,14,15,16]. However, there is a growing body of research addressing partial occlusion issues in both two-dimensional and three-dimensional FER [17,18,19].

Reconstruction-Based Paradigm: The earliest approaches to handling occlusion in FER followed a reconstruction-based paradigm, aiming to restore occluded facial regions before recognition. These methods typically employ generative techniques such as Generative Adversarial Networks (GANs) [20] or autoencoders to reconstruct complete facial images from partially occluded inputs. The underlying assumption is that once reconstructed, standard FER methods can be applied effectively. While conceptually straightforward, these methods face significant limitations: they become highly speculative when major facial regions (e.g., over 50% of the face with masks) are obscured, often introducing artifacts that may mislead subsequent recognition. Furthermore, the reconstruction quality heavily depends on the occlusion pattern and extent, making this approach unreliable for real-world applications with diverse occlusion types.
Holistic Representation Paradigm: The second paradigm shift moved toward holistic representation methods that leverage sparse signal processing and global feature representations inherently robust to partial occlusion [21,22]. These approaches posit that discriminative expression information is distributed across the face rather than concentrated in specific regions. By learning robust global representations through techniques like sparse coding or deep global descriptors, these methods avoid the need for explicit occlusion detection or reconstruction. However, while demonstrating resilience in occluded object classification generally, these methods often fail to capture the subtle local variations crucial for fine-grained expression discrimination, particularly when critical expression regions (e.g., mouth for happiness, eyes for surprise) are obscured.
Local Feature Emphasis Paradigm: Recognizing the limitations of global approaches, the field evolved toward local feature emphasis methods that explicitly handle occlusion through region-based processing. This paradigm encompasses several sub-categories: Patch-based Methods segment facial images into smaller patches (overlapping or non-overlapping) and employ attention mechanisms or weighting schemes to emphasize non-occluded regions [23]. The ACNN framework, for instance, combines patch-based and global-local attention to mitigate occlusion effects. Attention-based Methods use learned attention weights to dynamically focus on semantically important and visible regions. The RAN framework [24] introduced relation-attention modules to adaptively capture vital facial regions, while OADN [25] combined landmark-guided attention with regional features to identify non-occluded areas. Landmark-based Methods utilize facial landmarks to guide feature extraction from specific semantic regions, providing structural priors for handling occlusion. While representing significant advancement, these methods still fundamentally treat occlusion as a nuisance to be avoided rather than a structural element to be explicitly modeled.
Joint Learning Paradigm (Emerging): The most recent evolution in occlusion-aware FER moves toward joint learning frameworks that explicitly model occlusion as an integral part of the recognition process. Rather than merely avoiding or compensating for occlusion, these methods aim to understand the occlusion configuration and leverage this understanding to guide feature extraction and fusion [26]. This emerging paradigm includes: Mask-aware Specialized Methods that specifically address predictable occlusions like face masks [6,27], often using mask detection as an auxiliary task. Geometry-Appearance Fusion Methods that combine geometric facial structure information with appearance features for more robust recognition under occlusion [10]. Explicit Occlusion Modeling Approaches that directly estimate occlusion patterns and use this information to guide adaptive feature processing—the direction our work advances.

2.2. Mask-Aware FER as a Specialized Domain

Face-mask-aware FER represents a specialized and practically important sub-domain within occlusion-aware FER, where the occlusion is more predictable due to the consistent shape and size of face masks [11,27,28,29,30,31,32,33,34,35]. The methodological evolution in this sub-domain mirrors the broader progression described above, from early attempts at reconstruction [20] to recent joint learning approaches that explicitly model mask presence [6,11,27,34,35].

A study that considered face masks in FER focused solely on recognizing emotions from the eye region, evaluating this method on the mask-aware FER-2013 dataset, which encompasses seven distinct emotions [28]. However, this approach neglects other crucial facial areas, such as the forehead, when isolating the eye region using landmark detection methods, thereby reducing the accuracy of the FER system. We have observed that existing landmark detection tools, including Openface2.0 [29], which is utilized in current studies, are less accurate when dealing with facial images that include face masks, further diminishing the detection accuracy of the FER system [30,31,32]. Although some studies have attempted to address the robustness of landmark detection, there remain unresolved issues: there is still a significant gap in landmark detection accuracy between masked and unmasked faces, and the landmarks can only roughly differentiate between the covered and uncovered facial regions.

In prior research, Wang et al. [33] introduced a Self-Cure Network (SCN) to mitigate uncertainties in FER through a straightforward yet effective approach. SCN addresses uncertainty in two key ways: by employing a self-attention mechanism over the FER dataset to assign weights to each training sample based on a ranking regularization, and through a meticulous relabeling mechanism that adjusts the labels of the lowest-ranked samples. This framework effectively realized FER. Subsequently, Yang et al. [27] developed a two-stage, attention-based deep network designed to manage three emotions—positive, negative, and neutral—addressing face mask related challenges in FER. In the initial masked/unmasked binary classification phase, the attention mechanism was integrated into the classifier to generate attention heatmaps for masked areas and reverse attention heatmaps for unmasked regions. In the subsequent mask-aware FER classification stage, the attention mechanism directed the model to focus on the most critical facial areas for FER classification, prioritizing the unmasked region over the masked one. Ma et al. [34] proposed Convolutional Visual Transformers (CVT) to handle FER in uncontrolled environments through two primary steps. First, CVT introduced an attentional selective fusion (ASF) to leverage feature maps produced by dual-branch CNNs. ASF captured discriminative information by fusing multiple features with a combination of global and local attention. The fused feature maps were then flattened and projected into sequences of visual words. Second, drawing inspiration from the success of Transformers in natural language processing, CVT modeled relationships between these visual words using global self-attention. This local and global fused learning approach benefited mask-aware FER. Li et al. [11] were the first to propose a novel pure transformer-based mask vision transformer (MVT) for FER in the wild, comprising two modules: a transformer-based mask generation network (MGN) to create a mask that filters out complex backgrounds and occlusions in face images, and a dynamic relabeling module to correct erroneous labels in wild FER datasets. In 2023, Liu et al. [35] introduced a Patch Attention Convolutional Vision Transformer (PACVT) to address the occlusion FER problem. The backbone convolutional neural network extracted facial feature maps, which were divided into multiple regional patches to capture both local and global features. The PACVT framework achieved commendable performance in mask-aware FER.

Despite this approach showing superior performance compared to other partial occlusion FER methods, there are two main deficiencies that require resolution. The attention mechanism only achieves a relatively high accuracy in distinguishing the face mask region from other image areas. The re-weighting of the detected masked and unmasked regions is not adaptive. Additionally, a limitation of our earlier work was the evaluation of only three emotions.

3. Methods

This section provides a detailed description of our proposed joint learning framework for mask-aware FER. The overall architecture is first outlined, followed by in-depth explanations of its three core components and the joint loss function that enables end-to-end training.

3.1. Framework

As illustrated in Figure 2a, our framework is composed of three integral, cascaded modules that work in concert. Our facial expression recognition framework is designed to accurately interpret and analyze human emotions through facial expressions. It is composed of three integral components, Facial Occlusion Parsing Module (FOPM), Expression Feature Fusion Module (EFFM) and Facial Expression Recognition Module (FERM). They play a crucial role in the overall process of expression analysis.

3.2. Facial Occlusion Parsing Module

The facial occlusion parsing module includes a facial occlusion parsing component and a facial expression feature component under occlusion constraints.

This module is the first step of our framework and is responsible for detecting and parsing any occlusions that may be present in the facial region. Occlusions can be anything from masks to scarves or even hands that might be covering parts of the face, which can impede the accuracy of expression recognition. The facial occlusion parsing module uses advanced algorithms to identify these obstructions and either adjust the analysis to account for them or flag the input as potentially unreliable for further processing. This ensures that our method can handle real-world scenarios where faces might not be fully visible.

The architecture of the FOPM, depicted in Figure 2b, is a lightweight convolutional network designed for efficient feature pre-processing. The input I (with dimensions 3 × 224 × 224) is processed sequentially through three convolutional groups to output the refined feature representation P. The detailed configuration of each group is as follows:

Group 1: This group consists of two convolutional layers. The first layer has a kernel size of 3 × 3, a stride of 2, and padding of 1, converting the input to 64 channels. This is followed by a ReLU activation. The second layer has a kernel size of 3 × 3, a stride of 1, and padding of 1, maintaining 64 channels, and is also followed by a ReLU activation. The output feature map dimensions are 64 × 112 × 112.

Group 2: This group mirrors the structure of Group 1 but increases the channel dimension. It contains two convolutional layers with kernel sizes of 3 × 3, stride of 1, and padding of 1. The first layer in this group increases the channels from 64 to 128, and the second layer maintains 128 channels. Each layer is followed by a ReLU activation. A max-pooling layer (kernel size = 2 × 2, stride = 2) is applied after the activations, resulting in an output feature map size of 128 × 56 × 56.

Group 3: This final group is configured to further refine and prepare the features for subsequent modules. It contains three convolutional layers. The first two layers have kernel sizes of 3 × 3, stride of 1, and padding of 1, increasing channels from 128 to 256 and then maintaining 256 channels. The third layer uses a kernel size of 1 × 1 with a stride of 1 to project the features to the final output dimension of 512 channels. Each layer is followed by a ReLU activation. The output P of this group is a feature map with dimensions 512 × 56 × 56.

The operation of the FOPM can be summarized by the function:

P = F O P (I),

(1)

where I denotes the input occlusion face image, and FOP(·) represents the function of the FOPM, which is dedicated to the precise pre-processing of facial features to output a refined feature representation P.

3.3. Expression Feature Fusion Module

The expression recognition feature fusion module includes local feature fusion expression and global feature expression. Local feature expression mainly refers to the situation where partial occlusion affects facial features, but the characteristics of facial expressions are still preserved, through the expression of the extracted features. Global feature expression is achieved by parsing facial features to realize a comprehensive expression of features.

Once the facial occlusions have been addressed by the FOPM, the Expression Feature Fusion Module (EFFM) is tasked with integrating the two primary types of facial information: appearance features (texture and color information from the CNN-extracted feature maps) and geometric features (structural information from the predicted facial landmarks). The integration of these complementary information streams is achieved through a dedicated two-branch process within the EFFM, culminating in a gated fusion mechanism.

The EFFM is composed of two sub-networks: the Feature Decomposition Network (FDN) and the Feature Recomposition Network (FRN). The FDN operates on the appearance feature stream. It takes the global feature map P (512 × 56 × 56) from the FOPM and decomposes it into M = 4 distinct regional feature vectors. Each branch passes through a Patch-Gated Unit (PGUnit), which acts as an attention mechanism to generate region-specific weight masks, effectively highlighting salient non-occluded areas.

Concurrently, the geometric feature stream is processed. The facial landmark heat maps (e.g., 68 × 56 × 56), predicted by the U-Net, are average-pooled and projected into a 128-dimensional geometric prior vector. This vector encodes the spatial structure and configuration of the face.

The FRN module performs the explicit integration. Each appearance-based feature vector from the FDN is first modulated by its corresponding attention weights from the PGUnit. The modulated vectors are then processed through a two-layer residual structure. Crucially, the 128D geometric prior vector is concatenated with each of these processed appearance vectors. This combined vector, representing both the local appearance and the global geometric context, is then passed through a gating layer (Sigmoid activation) that learns to dynamically control the flow of information from each branch based on the current geometric configuration. The final output feature map is obtained via a weighted summation of all gated branch features, effectively fusing appearance and geometric information to form a robust and occlusion-invariant representation.

This design ensures that the model does not rely on appearance features alone. The geometric landmarks provide a structural prior that guides the attention mechanism (PGUnit) to focus on semantically important and visible regions (e.g., the eyes and eyebrows when the mouth is occluded), leading to a more informed and robust fusion of features.

As depicted in Figure 2a, our approach firstly predicts the facial masks M from the pre-processing outputs P. In Figure 2c, the image P is processed through a U-Net architecture to generate the facial mask M. This mask serves as the facial mesh, effectively enhancing the preservation of the deformable aspects during the motion process.

In order to achieve precise registration of facial information amidst motion and to effectively capture facial features within the constraints of facial masks like M1 and M2, we have developed two U-Net architectures. These are designed to extract the necessary characteristic features. By individually processing the face frames P1 and P2, our method is capable of extracting the matching kernel estimator and the associated offset estimator. This enables the accurate alignment of facial information across multiple frames. Additionally, both masks M1 and M2 are input into a separate U-Net to generate a mask estimator, which is instrumental in the expression of moving faces.

3.4. Facial Expression Recognition Module

The facial expression recognition module can achieve facial expression recognition by thoroughly parsing feature information and integrating the features.

The final component of our framework is the facial expression recognition module. This is where the actual classification of facial expressions takes place. It takes the enriched feature set produced by the feature fusion module and applies a series of classification algorithms to determine the most likely emotional state of the individual. This could range from basic emotions such as happiness, sadness, anger, surprise, fear, and disgust, to more complex and nuanced expressions. The module is designed to be adaptable and can be trained on diverse datasets to improve its ability to recognize expressions across different demographics and cultural contexts.

The Facial Expression Recognition Module (FERM) consists of one 3-layer convolutional encoder, three stacked ResNet blocks, followed by three additional CNN layers and a ReLU activation layer. The output is passed through a global average pooling (GAP) layer to produce a 512-dimensional feature vector. This vector is then fed into the final classification head, which is implemented as a sequential fully connected (FC) layer that maps the 512-dimensional input to a 7-dimensional output vector (corresponding to the 7 emotion classes), followed by a Softmax activation function to produce the final class probabilities.

This architecture balances depth and efficiency. The residual design ensures stability in training, and the deep feature representations allow the model to differentiate nuanced emotional cues even under occlusion.

As depicted in Figure 2d, DVM facilitates the multiple aligned results {A} to the final multi-frame facial emotion recognition result O. In this module, the inputs A1, A2, and A3 are fed into the network D_Net for the recognition output. The inference process can be expressed as:

O = D N e t (A),

(2)

where {A} = {A1, A2, A3} represents the set of multiple aligned feature maps from the previous module, DNet(·) denotes the function of the recognition network DNet, and O is the final output of the facial emotion recognition result.

This network is organized into the groups. As depicted in Figure 2d, the architecture begins with a CNN structure, which is then followed by the integration of three residual networks. Subsequently, three additional CNN structures are sequentially concatenated, culminating in the inclusion of a ReLU module. The input features are concatenated with the learned residual features to produce the resulting output features. This module facilitates the results of the final emotion features from the alignment results.

3.5. Loss Functions

The setting of the loss function also mainly refers to the design of the network structure, which mainly includes three parts: the loss function for the module FOPM, the loss function for the module EFFM, and the loss function for the module FERM.

The overall loss function is a weighted sum of the losses from the three core modules: FOPM, EFFM, and FERM, enabling joint end-to-end training:

L_{t} = L_{p} + β L_{m} + γ L_{d},

(3)

where L_p, L_m, and L_d are the loss functions for the FOPM, EFFM, and FERM, respectively. Specifically, L_p is a Mean Squared Error (MSE) loss between the predicted and ground-truth occlusion maps, L_m is an MSE loss for the landmark heatmap regression task within the EFFM, and L_d is the Categorical Cross-Entropy loss applied to the Softmax output of the FERM, calculating the divergence between the predicted emotion probability distribution and the ground-truth one-hot label.

The terms β and γ are balance hyper-parameters that harmonize the contribution of each module’s loss to the total training objective L_t. To determine their optimal values, we conducted a grid search over a range of [0.1, 0.3, 0.5, 0.7, 0.9] for both parameters on the validation set. The combination that yielded the best performance was β = 0.6 and γ = 0.4. This weighting scheme indicates that the occlusion parsing (L_p) and feature fusion (L_m) tasks, which provide crucial structural priors and enriched features, are assigned a higher relative importance during the initial phases of learning compared to the final classification loss (L_d). The average ratios of L_p, L_m, and L_d to the total loss L_t at the end of training were approximately 38%, 25%, and 37%, respectively, demonstrating that all three components contributed significantly to the overall optimization objective.

4. Experiments

To validate the effectiveness and practicality of the proposed model for Facial Expression Recognition (FER) under occlusion settings, we conducted systematic experiments on widely used masked facial expression datasets. This section details the experimental setup, datasets, evaluation metrics, ablation studies, comparisons with state-of-the-art methods, and provides visualizations to further interpret the model’s behavior.

4.1. Datasets

Experiments were conducted on the following two public datasets:

(1): RAF-DB (Real-world Affective Faces Database) [36] comprises 29,672 facial images, each annotated with one of seven basic expressions: Surprise, Fear, Disgust, Happiness, Sadness, Anger, and Neutral. We strictly adhered to the official dataset split: 12,271 images for training and 3068 images for testing.
(2): FER+ (Face Emotion Recognition Plus) [37], an enhanced version of FER2013, contains 28,315 training images and 3589 testing images, also labeled with seven expression categories.

To simulate real-world mask occlusions, synthetic mask overlays were generated using the Dlib toolkit [38] and applied only to the detected facial regions. The composite mask overlay is presented in Figure 3. A subset of 10% of the official training set was randomly selected and held out as a validation set for hyperparameter tuning and early stopping in both cases; this validation set was never used for the final evaluation reported in the results. Crucially, all synthetic masks were generated after the official train/test/validation split. Furthermore, we ensured that no identity (if present across different splits) shared the same synthetic mask pattern between the training and test sets, preventing a potential source of bias and data leakage.

Additionally, to validate the model’s generalization capability under real-world occlusion conditions, we curated a RealMask-FER dataset. It consists of 5832 real-world masked facial images collected from public sources, encompassing a wide variety of mask types (medical, cloth, patterned) and wearing styles (e.g., below the nose, fully covered).

4.2. Implementation Details

4.2.1. Training Configuration

(1): Framework & Hardware: Implemented in PyTorch 1.12 [39], running on Ubuntu 18.04 with 2 × NVIDIA GeForce RTX 3090 GPUs.
(2): Optimizer: AdamW [40] was used with decay rates β₁ = 0.9 and β₂ = 0.99, and a weight decay of 1 × 10⁻⁴.
(3): Learning Rate Schedule: The initial learning rate was set to 0.002, and was reduced by a factor of 0.2 every 6 epochs.
(4): Training Strategy: Models were trained for a maximum of 300 epochs with a batch size of 16. Early stopping was employed with a patience of 20 epochs, monitoring the validation accuracy. The model with the best validation performance was restored for final evaluation.
(5): Input Resolution: All input images were resized to 3 × 224 × 224 pixels.

4.2.2. Model Complexity and Efficiency

The complete model has approximately 9.8 million parameters:

(1): The Facial Occlusion Parsing Module (FOPM including the U-Net): ~3.5 M parameters.
(2): The Expression Feature Fusion and Recognition Modules (EFFM + FERM): ~6.3 M parameters.

Inference performance was evaluated on a single NVIDIA RTX 3090 GPU with a batch size of 16 and an input resolution of 224 × 224. The average inference time per image is 12.4 ms, enabling a near real-time throughput of ~80 frames per second (FPS). The GPU memory consumption during inference is approximately 2.1 GB. For context, the recent PACVT [35] model reports an average inference time of 16.8 ms under similar conditions, demonstrating our model’s superior efficiency.

4.3. Evaluation Metrics

We adopt four widely used metrics for a comprehensive evaluation: Accuracy, Precision, Recall, and F1-Score.

(1): Accuracy measures the overall correctness of predictions:

$Accuracy = \frac{T P + T N}{T P + T N + F P + F N},$

(4)
(2): Precision assesses the correctness of positive predictions:

$P r e c i s i o n = \frac{T P}{T P + F P},$

(5)
(3): Recall evaluates the ability to detect positive samples:

$R e c a l l = \frac{T P}{T P + F N},$

(6)
(4): F1-Score is the harmonic mean of Precision and Recall, providing a balanced measure especially under class imbalance:

$F 1 - Score = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l},$

(7)

where TP, TN, FP, and FN denote True Positives, True Negatives, False Positives, and False Negatives, respectively.

4.4. Ablation Study

To validate the contribution of each proposed module (FOPM, EFFM, FERM), we conducted ablation experiments on both the RAF-DB and FER+ datasets. The results are detailed in Table 1.

The following observations can be derived:

(1): Efficacy of FOPM: The FOPM demonstrates the most significant individual impact, improving accuracy by 2.33% and 3.05% on RAF-DB and FER+, respectively, compared to the model without it. This underscores the critical importance of explicit occlusion parsing for recovering discriminative features under occlusion.
(2): Role of EFFM and FERM: While the individual improvements from EFFM and FERM are relatively smaller, their presence is essential. The cascaded integration of all three modules yields a synergistic effect, where the combined accuracy gain exceeds the sum of individual improvements. This visually demonstrates how the modules work in concert: FOPM identifies usable regions, EFFM performs in-depth analysis of expressive details within these regions, and FERM strategically enhances features potentially weakened by occlusion.

4.5. Comparison with State-of-the-Art Methods

4.5.1. Implementation Details for Compared Methods

To ensure a verifiable, detailed, and fair comparison, we retrained the following representative baseline methods using a standardized and reproducible protocol—RAN [23], SCN [33], CVT [34], MVT [11], and PACVT [35]. Each model was implemented using its official publicly released code, strictly adhering to the architectural details and loss functions specified in the original publications. To adapt these methods to the mask-aware FER task, we replaced the original image inputs with our synthetically masked versions of the RAF-DB and FER+ datasets. All models were trained from scratch under identical hardware and software environments. Key training settings were unified: batch size of 16, AdamW optimizer (β₁ = 0.9, β₂ = 0.99), weight decay of 1 × 10⁻⁴. The initial learning rate was tuned individually for each model via grid search over {0.001, 0.002, 0.005} using the same validation split. Each method was trained for a maximum of 300 epochs with early stopping (patience = 20 epochs) based on validation accuracy. A fixed random seed (42) was set for all experiments to ensure reproducibility.

The same synthetic mask generator (Dlib [38]) was applied uniformly to all methods during both training and testing. Evaluation was performed on the identical masked test sets of RAF-DB and FER+, using the same metrics.

4.5.2. Quantitative Results

The results under this standardized protocol are presented in Table 2 and Table 3. The following discussion can be derived from these results:

(1): Baseline Performance: From Table 2, early influential methods like RAN and SCN achieved accuracies of 86.90% and 87.03%, respectively. These established a foundational performance level but highlighted the difficulty of occlusion handling. From Table 3, Early methods RAN and SCN achieved accuracies of 87.85% and 88.01%, respectively. These established a foundational performance level but indicated room for improvement in handling occlusion.
(2): Progressive Improvements: From Table 2, subsequent advancements, represented by CVT (88.14%) and the 2021 state-of-the-art MVT (88.62%), demonstrated significant progress. The improvement from 87% to 88.6% underscores the research community’s ongoing efforts to tackle occlusion complexity. PACVT further refined these approaches, achieving 88.21%. From Table 3, Subsequent methods CVT (88.81%) and the then state-of-the-art MVT (89.22%) demonstrated significant progress, pushing accuracy closer to 89%. PACVT further refined these approaches, achieving 88.72%.
(3): Our Model: Our proposed model achieves a remarkable accuracy of 91.24%, surpassing the previous best (MVT at 88.62%) by a substantial margin of 2.62 percentage points. This result is highlighted as the highest in Table 2. Our proposed model achieves a notable accuracy of 90.18%, surpassing the previous best (MVT at 89.22%) by 0.96 percentage points. This result is highlighted as the highest in Table 3.
(4): Statistical Significance Analysis: To rigorously validate that the performance improvement of our model is statistically significant and not due to random chance, we conducted the McNemar’s test [41], a widely used non-parametric test for comparing the proportions of matched pairs in classification tasks. The null hypothesis posits that there is no difference in the error rates between our proposed method and the compared model (MVT). The tests were performed on the predictions of our model and the best previous model (MVT) across both RAF-DB and FER+ datasets. The resulting * p *-values were extremely small (* p * < 0.001 for both datasets), allowing us to confidently reject the null hypothesis. This result provides strong statistical evidence that the superior accuracy achieved by our proposed model is statistically significant.

4.5.3. Generalization Evaluation on Real-World Occlusions

To rigorously validate the model’s generalization capability beyond controlled synthetic patterns and address its real-world applicability, we conducted a dedicated quantitative evaluation on the RealMask-FER dataset. This dataset, introduced in Section 4.1, presents a significantly more challenging and realistic scenario due to its diverse spectrum of mask types, wearing styles, and uncontrolled environmental conditions.

Results and Analysis: The quantitative comparison on this challenging real-world set is paramount and the results are presented in Table 4. The following discussion can be derived from these results:

(1): Performance Gap Highlights Real-World Challenge: A consistent performance drop is observed across all models when compared to their results on synthetic masked benchmarks (Table 2 and Table 3). This drop is not a failure but a validation of the RealMask-FER dataset’s complexity and its success in capturing the challenging domain gap between idealized synthetic occlusions and messy real-world scenarios.
(2): Superior Generalization of Our Model: Our model demonstrates remarkable robustness, maintaining a commanding lead with an accuracy of 89.75%. It outperforms the best baseline (MVT [11]) by a significant margin of 1.83%. This substantial gap underscores that our framework’s core innovation—explicit occlusion parsing and guided feature enhancement—is effective at handling the inherent complexity and variety of real-world occlusions, proving its generalization is superior to methods reliant on implicit attention or multi-branch features alone.
(3): Effectiveness of Explicit Parsing: The fact that our model, trained solely on synthetically generated data, generalizes effectively to real and varied masks indicates that the Facial Occlusion Parsing Module (FOPM) learns a fundamental, generic representation of “occlusion” as a structural entity, rather than memorizing the texture or pattern of a synthetic mask. This explicit structural prior is crucial for adapting to unseen occlusion patterns.

Conclusion of the Evaluation: This evaluation moves beyond standard benchmarks and provides compelling evidence for the practical viability of our proposed method. The results substantiate that our model offers not only higher accuracy but also superior generalization and robustness, making it highly suitable for real-world applications where occlusion types are unpredictable and diverse. The performance on RealMask-FER is a direct validation of our model’s ability to fulfill its design purpose: performing robust facial expression recognition in natural, unconstrained environments.

4.6. Visualization and Discussion

4.6.1. Confusion Matrix Analysis

Figure 4 and Figure 5 show the confusion matrices of our model on the RAF-DB and FER+ datasets, respectively [42]. The following discussion can be derived from these results:

(1): High Overall Accuracy: Strong diagonal values across all expressions indicate high per-class recognition accuracy.
(2): Negative Valence Confusion: A consistent pattern of confusion exists among negative valence emotions (e.g., Sad, Fear, Angry, Disgust), as indicated by significant off-diagonal values between these classes. This is a common challenge in FER due to similar facial muscle activations.
(3): Performance Extremes: The Happy expression consistently shows the strongest diagonal intensity (highest accuracy), confirming its robust recognition even with masks. Conversely, the Disgust expression shows the weakest diagonal intensity and the most dispersed confusion, often being misclassified as Anger or Sadness.
(4): Specific Neutral vs. Sad Confusion: Figure 5 highlights a distinct, reciprocal confusion pattern between Neutral and Sad. This suggests that occlusion masks critically obscure key lower-face discriminators for Sadness (e.g., mouth corners drawn downwards), making it difficult to distinguish from a Neutral expression based primarily on upper-face features.

4.6.2. Intermediate Feature Visualization

Figure 6 provides a visual breakdown of our model’s processing pipeline for a partially occluded face. The following discussion can be derived from these results:

(1): Targeted Attention Allocation: The attention map (Figure 6c) exhibits a clear intensity contrast. Significantly higher weights (visually darker) are concentrated on visible, expression-salient regions (e.g., eyes, forehead), while occluded regions (e.g., mouth covered by the mask) are effectively suppressed (visually lighter).
(2): Salient Feature Fusion: The fused feature map (Figure 6d) visually reflects this modulation by the attention weights. Stronger activation patterns are predominantly over the unoccluded, high-attention regions identified in Figure 6c, while activations over the occluded areas are minimal. This confirms the model’s ability to dynamically focus on reliable visual cues and suppress noisy information from occluded areas.

4.7. Additional Analysis on Neutral vs. Sad Confusion

The observed confusion between Neutral and Sad expressions is a recognized challenge in occlusion-aware FER. To diagnose this, we conducted a post hoc analysis by visualizing the attention weights for misclassified samples. We found that:

(1): When a Sad expression was misclassified as Neutral, the model’s attention was often incorrectly focused on parts of the upper face that appeared neutral rather than on the subtle cues (e.g., slight eyebrow lowering, eyelid tension) that might persist even with masks.
(2): When a Neutral expression was misclassified as Sad, the model sometimes over-interpreted mild upper-face features or the overall context as indicative of negative valence.

This specific confusion highlights a limitation in parsing very subtle geometric and appearance features in the upper face under heavy occlusion. To mitigate this in future work, we propose:

(1): Incorporating more granular, attention-guided landmark detection specifically around the eyes and eyebrows.
(2): Employing contrastive learning to explicitly minimize the intra-class variation in Neutral and Sad while maximizing the inter-class separation in the feature space.

4.8. Conclusion of Experiments

The experimental results comprehensively validate the efficacy of the proposed joint learning framework. The model achieves state-of-the-art performance on standard benchmarks (RAF-DB, FER+) under synthetic occlusion and, more importantly, demonstrates strong generalization capability on a real-world masked dataset (RealMask-FER). The ablation studies confirm the critical contribution of each module, and the visualizations provide interpretable evidence of the model’s ability to focus on non-occluded, salient regions. The analysis also pinpoints specific challenges (e.g., Neutral-Sad confusion) that guide future research directions. The model’s high computational efficiency makes it suitable for practical, real-world applications.

5. Conclusions

Our research tackles the challenge of facial expression recognition in natural settings, where occlusions such as masks, posture variations, and other obstructions often lead to incomplete facial information and reduced recognition accuracy. We argue that essential expressive cues remain even under partial occlusion, yet prevailing methods struggle to parse these features effectively. To address this, we propose a novel model that comprehensively extracts and integrates facial expression features alongside occlusion context, significantly enhancing recognition performance. Experimental results validate the efficacy of our approach and demonstrate superior performance compared to existing methods.

Our model demonstrates superior performance, achieving state-of-the-art accuracy on the widely used RAF-DB (91.24%) and FER+ (90.18%) benchmarks with synthetic masks. Critically, to address concerns regarding real-world applicability, an additional quantitative evaluation was conducted on the RealMask-FER dataset, which contains a diverse variety of real-world mask types. Our model attained a leading accuracy of 89.75% on this challenging set, outperforming other methods by a significant margin and substantiating its strong generalization capability beyond controlled synthetic patterns. All comparisons were conducted under a verifiable and detailed protocol to ensure fairness and reproducibility.

Limitations and Future Work: Despite the demonstrated robustness, this study has limitations that point to valuable future research directions. While the evaluation on RealMask-FER confirms performance on real masks, the current framework is primarily optimized for mask-like occlusions. Its performance on other common obstructions (e.g., hands, hair, accessories, or extreme head poses) warrants further investigation. Furthermore, the model’s efficacy across an even broader spectrum of environmental conditions, cultural contexts, and spontaneous occlusions remains to be fully explored.

Building on this work, we plan to extend the framework to explicitly handle a more comprehensive range of partial occlusions. This will involve developing more flexible occlusion parsing modules that can dynamically adapt to various obstruction types without retraining. Additionally, rather than focusing solely on generalization from synthetic data, a crucial future direction involves curating and benchmarking on larger-scale, publicly available real-world occluded FER datasets to drive progress in the field. Finally, we will pursue cross-cultural validation and investigate the influence of factors such as lighting and camera angles to foster more adaptable and inclusive models.

Author Contributions

Conceptualization, H.H. and X.S.; methodology, H.H. and X.S.; formal analysis, H.H.; investigation, H.H.; writing—original draft preparation, H.H.; writing—review and editing, X.S.; supervision, X.S.; project administration, X.S.; funding acquisition, X.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The RAF-DB dataset is publicly available at http://www.whdeng.cn/RAF/model1.html (accessed on 2 October 2024), and the FER+ dataset can be accessed at https://github.com/Microsoft/FERPlus (accessed on 2 October 2024). To ensure the reproducibility of our results, the complete source code, model configurations, and pre-trained weights for our proposed framework will be made publicly available upon publication at: https://github.com/houhuanyu/FOPM (accessed on 2 October 2024). The repository includes detailed instructions for data preprocessing, model training, and evaluation, along with scripts to reproduce all experiments reported in this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Amiri, Z.; Hassanpour, H.; Beghdadi, A. Combining deep features and hand-crafted features for abnormality detection in WCE images. Multimed. Tools Appl. 2024, 83, 5837–5870. [Google Scholar] [CrossRef]
Khanbebin, S.N.; Mehrdad, V. Improved convolutional neural network-based approach using hand-crafted features for facial expression recognition. Multimed. Tools Appl. 2024, 82, 11489–11505. [Google Scholar] [CrossRef]
Hu, M.; Wang, H.; Wang, X.; Yang, J.; Wang, R. Video facial emotion recognition based on local enhanced motion history image and CNN-CTS LSTM networks. J. Vis. Commun. Image Represent. 2019, 59, 176–185. [Google Scholar] [CrossRef]
Duncan, D.; Shine, G.; English, C. Facial emotion recognition in real time. Comput. Sci. 2016, 10, 1–7. [Google Scholar]
Jain, N.; Kumar, S.; Kumar, A.; Shamsolmoali, P.; Zareapoor, M. Hybrid deep neural networks for face emotion recognition. Pattern Recognit. Lett. 2018, 115, 101–106. [Google Scholar] [CrossRef]
Yang, B.; Wu, J.; Ikeda, K.; Hattori, G.; Sugano, M.; Iwasawa, Y.; Matsuo, Y. Face-mask-aware facial expression recognition based on face parsing and vision transformer. Pattern Recognit. Lett. 2022, 164, 173–182. [Google Scholar] [CrossRef]
Li, Y.; Liu, H.; Liang, J.; Jiang, D. Occlusion-Robust Facial Expression Recognition Based on Multi-Angle Feature Extraction. Appl. Sci. 2025, 15, 5139. [Google Scholar] [CrossRef]
Kim, J.; Lee, D. Facial Expression Recognition Robust to Occlusion and to Intra-Similarity Problem Using Relevant Subsampling. Sensors 2023, 23, 2619. [Google Scholar] [CrossRef]
Devasena, G.; Vidhya, V. Twinned attention network for occlusion-aware facial expression recognition. Mach. Vis. Appl. 2025, 36, 23. [Google Scholar]
Liang, X.; Xu, L.; Zhang, W.; Zhang, Y.; Liu, J.; Liu, Z. A convolution-transformer dual branch network for head-pose and occlusion facial expression recognition. Vis. Comput. 2023, 39, 2277–2290. [Google Scholar] [CrossRef]
Li, H.; Sui, M.; Zhao, F.; Zha, Z.; Wu, F. MVT: Mask vision transformer for facial expression recognition in the wild. arXiv 2021, arXiv:2106.04520. [Google Scholar] [CrossRef]
Zhang, Y.; Li, Z.; Shen, D.; Wang, K.; Li, J.; Xia, C. Information gap based knowledge distillation for occluded facial expression recognition. Image Vis. Comput. 2015, 154, 105365. [Google Scholar] [CrossRef]
Wang, H.-T.; Lyu, J.-L.; Chien, S.H.-L. Dynamic Emotion Recognition and Expression Imitation in Neurotypical Adults and Their Associations with Autistic Traits. Sensors 2024, 24, 8133. [Google Scholar] [CrossRef] [PubMed]
Souza, J.M.S.; Alves, C.d.S.M.; Cerqueira, J.d.J.F.; Oliveira, W.L.A.d.; Pires, O.M.; Santos, N.S.B.d.; Wyzykowski, A.B.V.; Pinheiro, O.R.; Almeida Filho, D.G.d.; da Silva, M.O.; et al. Facial Biosignals Time–Series Dataset (FBioT): A Visual–Temporal Facial Expression Recognition (VT-FER) Approach. Electronics 2024, 13, 4867. [Google Scholar] [CrossRef]
Qi, Y.; Zhuang, L.; Chen, H.; Han, X.; Liang, A. Evaluation of Students’ Learning Engagement in Online Classes Based on Multimodal Vision Perspective. Electronics 2024, 13, 149. [Google Scholar]
Zhi, R.; Flierl, M.; Ruan, Q.; Kleijn, W.B. Graph-Preserving Sparse Nonnegative Matrix Factorization with Application to Facial Expression Recognition. IEEE Trans. Syst. Man Cybern. Part B 2010, 41, 38–52. [Google Scholar]
Shen, L.; Jin, X. VaBTFER: An Effective Variant Binary Transformer for Facial Expression Recognition. Sensors 2024, 24, 147. [Google Scholar] [CrossRef]
Xia, B.; Wang, S. Occluded facial expression recognition with stepwise assistance from unpaired non-occluded images. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2927–2935. [Google Scholar]
Liu, S.; Agaian, S.; Grigoryan, A. PortraitEmotion3D: A Novel Dataset and 3D Emotion Estimation Method for Artistic Portraiture Analysis. Appl. Sci. 2024, 14, 11235. [Google Scholar] [CrossRef]
Lu, Y.; Wang, S.; Zhao, W.; Zhao, Y. WGAN-based robust occluded facial expression recognition. IEEE Access 2019, 7, 93594–93610. [Google Scholar] [CrossRef]
Wright, J.; Yang, A.Y.; Ganesh, A.; Sastry, S.S.; Ma, Y. Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 31, 210–227. [Google Scholar] [CrossRef]
Selma, T.; Masud, M.M.; Bentaleb, A.; Harous, S. Inference Analysis of Video Quality of Experience in Relation with Face Emotion, Video Advertisement, and ITU-T P.1203. Technologies 2024, 12, 62. [Google Scholar] [CrossRef]
Li, Y.; Zeng, J.; Shan, S.; Chen, X. Occlusion aware facial expression recognition using CNN with attention mechanism. IEEE Trans. Image Process. 2018, 28, 2439–2450. [Google Scholar] [CrossRef]
Wang, K.; Peng, X.; Yang, J.; Meng, D.; Qiao, Y. Region attention networks for pose and occlusion robust facial expression recognition. IEEE Trans. Image Process. 2020, 29, 4057–4069. [Google Scholar] [CrossRef]
Ding, H.; Zhou, P.; Chellappa, R. Occlusion-Adaptive Deep Network for Robust Facial Expression Recognition. In Proceedings of the 2020 IEEE International Joint Conference on Biometrics (IJCB), Houston, TX, USA, 28 September–1 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–9. [Google Scholar]
Zhu, M.; Shi, D.; Zheng, M.; Sadiq, M. Robust facial landmark detection via occlusion-adaptive deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; IEEE Computer Society: Washington, DC, USA, 2019; pp. 3486–3496. [Google Scholar]
Yang, B.; Jianming, W.; Hattori, G. Face mask aware robust facial expression recognition during the COVID-19 pandemic. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 240–244. [Google Scholar]
Castellano, G.; De Carolis, B.; Macchiarulo, N. Automatic emotion recognition from facial expressions when wearing a mask. In Proceedings of the 14th Biannual Conference of the Italian SIGCHI Chapter, Bolzano, Italy, 11–13 July 2021; ACM: New York, NY, USA, 2021; pp. 1–5. [Google Scholar]
Araluce, J.; Bergasa, L.M.; Gómez-Huélamo, C.; Barea, R.; López-Guillén, E.; Arango, F.; Pérez-Gil, Ó. Integrating OpenFace 2.0 toolkit for driver attention estimation in challenging accidental scenarios. In Proceedings of the Advances in Physical Agents II: Proceedings of the 21st International Workshop of Physical Agents (WAF 2020), Alcalá de Henares, Spain, 19–20 November 2020; Springer: Cham, Switzerland, 2021; pp. 274–288. [Google Scholar]
Shaikh, M.A.; Al-Rawashdeh, H.S.; Sait, A.R.W. Deep Learning-Powered Down Syndrome Detection Using Facial Images. Life 2025, 15, 1361. [Google Scholar] [CrossRef]
Arabian, H.; Abdulbaki Alshirbaji, T.; Chase, J.G.; Moeller, K. Emotion Recognition beyond Pixels: Leveraging Facial Point Landmark Meshes. Appl. Sci. 2024, 14, 3358. [Google Scholar] [CrossRef]
Zhu, C.; Li, X.; Li, J.; Dai, S. Improving robustness of facial landmark detection by defending against adversarial attacks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; IEEE Computer Society: Washington, DC, USA, 2021; pp. 11751–11760. [Google Scholar]
Wang, K.; Peng, X.; Yang, J.; Lu, S.; Qiao, Y. Suppressing uncertainties for large-scale facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; IEEE Computer Society: Washington, DC, USA, 2020; pp. 6897–6906. [Google Scholar]
Ma, F.; Sun, B.; Li, S. Robust facial expression recognition with convolutional visual transformers. arXiv 2021, arXiv:2103.16854. [Google Scholar]
Liu, C.; Hirota, K.; Dai, Y. Patch attention convolutional vision transformer for facial expression recognition with occlusion. Inf. Sci. 2023, 619, 781–794. [Google Scholar]
Li, S.; Deng, W.; Du, J. Reliable crowdsourcing and deep locality preserving learning for expression recognition in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society: Washington, DC, USA, 2017; pp. 2852–2861. [Google Scholar]
Barsoum, E.; Zhang, C.; Ferrer, C.C.; Zhang, Z. Training deep networks for facial expression recognition with crowd-sourced label distribution. In Proceedings of the ACM International Conference on Multimodal Interaction, Tokyo, Japan, 12–16 November 2016; ACM: New York, NY, USA, 2016; pp. 279–283. [Google Scholar]
King, D.E. Dlib-ml: A machine learning toolkit. J. Mach. Learn. Res. 2009, 10, 1755–1758. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 8024–8035. [Google Scholar]
Llugsi, R.; El Yacoubi, S.; Fontaine, A.; Lupera, P. Comparison between Adam, Adamax and AdamW optimizers to implement a weather forecast based on neural networks for the Andean city of Quito. In Proceedings of the 2021 IEEE Fifth Ecuador Technical Chapters Meeting (ETCM), Cuenca, Ecuador, 12–15 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
McNemar, Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 1947, 12, 153–157. [Google Scholar] [CrossRef] [PubMed]
Susmaga, R. Confusion matrix visualization. In Proceedings of the Intelligent Information Processing and Web Mining: Proceedings of the International IIS: IIPWM ‘04 Conference, Zakopane, Poland, 17–20 May 2004; Springer: Berlin/Heidelberg, Germany, 2004; pp. 107–116. [Google Scholar]

Figure 1. The challenge of Facial Expression Recognition (FER) under occlusion.

Figure 2. (a) Our network architecture is composed of three integral modules: Facial Occlusion Parsing Module (FOPM), Expression Feature Fusion Module (EFFM) and Facial Expression Recognition Module (FERM). (b) The occlusion face, denoted as {I}, are first processed by the module FOPM to yield the face occlusion map and the facial landmarks. (c) Subsequently, these pre-processed results, face occlusion map and the facial landmarks, are passed to the module EFFM, which extracts exposure and obscured facial parsing features. (d) Finally, the facial parsing features are input into the module FERM, where they are processed to generate the ultimate outputs, denoted as {O}. This process is meticulously designed to ensure the facial emotion recognition of the occlusion faces.

Figure 3. The dataset encompasses the seven basic facial expressions, including “Surprised” and “Sad”. Utilizing this dataset, we have generated masked faces for FER.

Figure 4. Confusion matrix of our model for the different emotions on the dataset RAF-DB.

Figure 5. Confusion matrix of our model for the different emotions of mask-aware FER on the dataset FER+.

Figure 6. Visualization of intermediate results: (a) Input image, (b) Occlusion map, (c) Attention Weights, (d) Fused Feature.

Table 1. The performance AP (Average Precision) of modules FOPM, EFFM, and FERM on the two datasets RAF-DB and FER+.

Methods	FOPM	EFFM	FERM	RAF-DB	FER+
w/o FOPM	×	√	√	88.91%	87.13%
w/o EFFM	√	×	√	90.37%	89.37%
w/o FERM	√	√	×	89.56%	88.75%
Ours	√	√	√	91.24%	90.18%

Table 2. Performance comparison on the RAF-DB dataset.

Methods	Years	AP	Precision	Recall	F1-Score	Training Protocol
RAN	2018	86.90%	86.45%	86.20%	86.32%	Retrained on our masked data
SCN	2020	87.03%	86.88%	86.70%	86.79%	Retrained on our masked data
CVT	2020	88.14%	87.92%	87.65%	87.78%	Retrained on our masked data
MVT	2021	88.62%	88.40%	88.15%	88.27%	Retrained on our masked data
PACVT	2023	88.21%	87.95%	87.73%	87.84%	Retrained on our masked data
Ours	2025	91.24%	91.05%	90.88%	90.96%	Trained on our masked data

Table 3. Performance comparison on the FER+ dataset.

Methods	Years	AP	Precision	Recall	F1-Score	Training Protocol
RAN	2018	87.85%	87.50%	87.20%	87.35%	Retrained on our masked data
SCN	2020	88.01%	87.75%	87.52%	87.63%	Retrained on our masked data
CVT	2020	88.81%	88.55%	88.30%	88.42%	Retrained on our masked data
MVT	2021	89.22%	89.00%	88.75%	88.87%	Retrained on our masked data
PACVT	2023	88.72%	88.45%	88.20%	88.32%	Retrained on our masked data
Ours	2025	90.18%	89.95%	89.70%	89.82%	Trained on our masked data

Table 4. Performance comparison on the RealMask-FER dataset.

Methods	Years	AP	Precision	Recall	F1-Score	Training Protocol
MVT	2021	87.92%	87.65%	87.40%	87.52%	Retrained on our masked data
PACVT	2023	87.30%	87.02%	86.75%	86.88%	Retrained on our masked data
Ours	2025	89.75%	89.50%	89.25%	89.37%	Trained on our masked data

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hou, H.; Sun, X. Joint Learning for Mask-Aware Facial Expression Recognition Based on Exposed Feature Analysis and Occlusion Feature Enhancement. Appl. Sci. 2025, 15, 10433. https://doi.org/10.3390/app151910433

AMA Style

Hou H, Sun X. Joint Learning for Mask-Aware Facial Expression Recognition Based on Exposed Feature Analysis and Occlusion Feature Enhancement. Applied Sciences. 2025; 15(19):10433. https://doi.org/10.3390/app151910433

Chicago/Turabian Style

Hou, Huanyu, and Xiaoming Sun. 2025. "Joint Learning for Mask-Aware Facial Expression Recognition Based on Exposed Feature Analysis and Occlusion Feature Enhancement" Applied Sciences 15, no. 19: 10433. https://doi.org/10.3390/app151910433

APA Style

Hou, H., & Sun, X. (2025). Joint Learning for Mask-Aware Facial Expression Recognition Based on Exposed Feature Analysis and Occlusion Feature Enhancement. Applied Sciences, 15(19), 10433. https://doi.org/10.3390/app151910433

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Joint Learning for Mask-Aware Facial Expression Recognition Based on Exposed Feature Analysis and Occlusion Feature Enhancement

Abstract

1. Introduction

2. Related Works

2.1. Occlusion Problem in FER

2.2. Mask-Aware FER as a Specialized Domain

3. Methods

3.1. Framework

3.2. Facial Occlusion Parsing Module

3.3. Expression Feature Fusion Module

3.4. Facial Expression Recognition Module

3.5. Loss Functions

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.2.1. Training Configuration

4.2.2. Model Complexity and Efficiency

4.3. Evaluation Metrics

4.4. Ablation Study

4.5. Comparison with State-of-the-Art Methods

4.5.1. Implementation Details for Compared Methods

4.5.2. Quantitative Results

4.5.3. Generalization Evaluation on Real-World Occlusions

4.6. Visualization and Discussion

4.6.1. Confusion Matrix Analysis

4.6.2. Intermediate Feature Visualization

4.7. Additional Analysis on Neutral vs. Sad Confusion

4.8. Conclusion of Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI