Occlusion-Robust Facial Expression Recognition Based on Multi-Angle Feature Extraction

Li, Yunfei; Liu, Hao; Liang, Jiuzhen; Jiang, Daihong

doi:10.3390/app15095139

Open AccessArticle

Occlusion-Robust Facial Expression Recognition Based on Multi-Angle Feature Extraction

¹

School of Information Engineering, Xuzhou University of Technology, Xuzhou 221018, China

²

School of Computer Science and Artificial Intelligence, Changzhou University, Changzhou 213164, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 5139; https://doi.org/10.3390/app15095139

Submission received: 10 March 2025 / Revised: 26 April 2025 / Accepted: 27 April 2025 / Published: 6 May 2025

Download

Browse Figures

Versions Notes

Abstract

Facial occlusion represents a significant challenge in the domain of facial expression recognition (FER). The absence of feature information due to occlusion has been demonstrated to result in a reduction in recognition accuracy and model robustness. To address this challenge, a multi-angle feature extraction (MAFE) method is proposed in this paper, aiming to enhance the recognition accuracy under occlusion conditions by employing multi-scale global features, local fine-grained features, and important regional features. The MAFE approach involves three core modules: multi-feature extraction, regional detail feature fusion, and consistent feature recognition. In the multi-feature extraction module, PTIR-50 and Swin Transformer are used to extract global features and fine-grained features, and at the same time, the five key points of the face are combined to crop out the important regions from the global features. The Regional Bias Loss (RB-Loss) is then utilized to guide the model to focus on the key information regions. The subsequent Regional Detail Feature Fusion module combines fine-grained features with those from the important regions. This process enhances the expressiveness of the features. The Consistent Feature Recognition module proposes consistent feature loss (con-feature Loss) to ensure that global features and fused features guide each other, forcing the model to focus on more discriminative expression features. The experimental results demonstrate that MAFE attains 89.42% and 86.94% accuracies on the Occlusion-RAFDB and Occlusion-FERPlus datasets, thereby surpassing the existing methods. Accuracies of 92.11% and 90.15% are also obtained on the original RAF-DB and FERPlus datasets.

Keywords:

multi-angle feature extraction; occlusion facial recognition; con-feature loss

1. Introduction

Facial expressions are among the most pervasive and significant signals to convey human emotions and intentions, and they play a pivotal role in everyday interpersonal communication. Facial expression recognition (FER) is a significant area of artificial intelligence (AI) that is drawing more research attention because of its numerous potential applications in a variety of fields, such as healthcare, driver status detection, smart classrooms, etc. [1,2,3,4,5] While significant progress has been made in the development of FER, factors such as light, pose change, and occlusion still present challenges that limit its accuracy and robustness. Figure 1 illustrates some examples of FER with occluded faces.

Facial occlusion represents a significant challenge in the domain of face analysis. It hinders the FER model’s ability to fully discern the full range of facial features, a limitation that has garnered considerable attention in the field of facial identity recognition. However, this issue has been less extensively investigated in the context of FER due to the unavailability of datasets with dedicated annotation for OFER [1,2,5]. It is, therefore, important to address this issue in order to enhance the accuracy and robustness of FER [6].

In the initial studies on the OFER problem, the effect of partially artificially occluded faces on FER was investigated in controlled laboratory environments. Boucher et al. [7] examined which areas of the human face are most important for recognition by occluding crucial regions of the face. Bourel et al. [8] proposed the first OFER model, which recovered facial feature points. Kotsia et al. [9] conducted a comprehensive study of OFER based on Gabor features and an analysis of human observers, which revealed that occlusion of the mouth area had a greater impact on FER performance than occlusion of the eye area on the JAFFE [10] and CK [11] datasets. Consequently, numerous subsequent research works have employed subregion-based feature extraction and fusion schemes for OFER [12,13], which initially detect occluded regions and then remove them. Sub-region analysis methods [14,15] delineate regions based on the analysis of facial feature points, employing an attention mechanism to focus on crucial regions. However, the absence of discernible facial feature points due to occlusion may result in errors in the recognition process. The unoccluded image network-assisted method [16,17] employs two distinct networks; one is trained on unoccluded images and the other on occluded images. This method employs unoccluded images as privileged information to facilitate expression recognition in occluded scenarios. Nevertheless, this approach is not applicable in real-world scenarios, as it is challenging to ensure the consistent availability of unoccluded images in practical applications. Moreover, there are a number of techniques that are specifically designed for the detection, removal, and reconstruction of occluded regions [18].

It has been argued that the explicit removal of occluded regions is impractical, given the difficulty of precisely detecting and reconstructing occlusion. Ref. [15] points out that humans are capable of understanding facial expressions when the face is occluded and can effectively utilize both the regions and the whole face to perceive the semantics conveyed by an incomplete face. In light of these considerations, this paper presents MAFE, a novel approach to addressing the OFER problem by utilizing global, fine-grained, and crucial regional features in the samples. MAFE primarily comprises three modules: multi-feature extraction, region detail feature fusion, and consistent feature recognition. The multi-feature extraction module is responsible for the extraction of both the global features and fine-grained features of the input image. This is achieved through the introduction of multi-scale and fine-grained branches. Concurrently, the region division strategy, which is based on facial feature points and the attention weight selection mechanism, enables the model to concentrate on the crucial facial regions. In the region detail feature fusion module, the method presented in this paper effectively combines crucial regional features and fine-grained features, ensuring that the fusion features both encompass crucial region information and retain the fine-grained information of the original image. The consistent feature recognition module uses con-feature loss to improve the consistency between fusion and global features, boosting expression recognition accuracy. Our method’s effectiveness is confirmed by experiments on two public datasets and two occlusion test datasets, showing that it outperforms several state-of-the-art OFER methods. Our work offers a promising solution for enhancing the performance of OFER models. In particular, the contributions of this paper can be summarized as follows:

1. A straightforward multi-scale branching PTIR-50 is devised for the purpose of extracting global information from an image. The multi-scale feature pyramid and transformer module in the PTIR-50 are capable of capturing multi-scale information from an image, along with the long-distance dependencies within the image. This results in more comprehensive and precise feature representation. Furthermore, the utilization of a region division strategy based on facial feature points and an attention weight selection mechanism enables the model to focus on crucial regions, while ignoring the occluded regions. This effectively addresses the OFER problem.

2. In order to obtain more discriminative expression features and prevent feature redundancy during the feature fusion process, a con-feature loss function is proposed for the processing of features extracted from different branches. Therefore, the global, regional, and detailed information present in the image is fully utilized for the purpose of expression recognition.

3. The experimental results demonstrate that our OFER solution attains state-of-the-art results on the Occlusion-RAFDB and Occlusion-FERPlus datasets, with 89.42% and 86.94% accuracy. Furthermore, high accuracy was attained on the original RAF-DB and FERPlus datasets, with respective rates of 92.11% and 90.15%.

The remainder of this paper is organized as follows. Section 2 reviews the related work in FER and OFER. Section 3 introduces the proposed method MAFE in detail, including its overall architecture and core components. Section 4 presents the experimental setup, datasets, and evaluation metrics, followed by a thorough analysis of the results. Finally, Section 5 concludes the paper.

2. Related Work

In this section, we focus on regular FER work and OFER-related work. By examining the advances in conventional FER approaches alongside the challenges and solutions proposed for OFER, we aim to provide a comprehensive overview of the current landscape in the field.

2.1. FER

An FER system typically comprises three principal phases: face detection, feature extraction, and expression recognition. In the field of face detection, a multitude of algorithms, such as MTCNN [19], are employed to identify and localize faces within complex environments. Furthermore, researchers have developed a number of techniques for capturing appearance features resulting from facial expressions, including texture-based, geometry-based, and fusion features. The ongoing advancement of deep learning, coupled with the advent of extensive datasets, such as AffectNet [3], RAF-DB [4], and EmotioNet [2], has prompted researchers to employ convolutional neural networks (CNNs) for feature extraction. Fasel [20] demonstrated that shallow CNNs are robust regarding facial pose and scale. Tang [21] and Kahou et al. [22] employed deep CNNs for feature extraction and achieved success in the FER2013 and Emotiw2013 challenges, respectively. A number of recent studies have employed high-accuracy facial feature point detectors to obtain facial feature points, which are then used as informative features for FER tasks. These features are subsequently fed into a classifier for final expression recognition. Hassani et al. [23] proposed a new approach, 3D InceptionResNet, which multiplies facial feature points with certain layers of image features for recognition. Khan et al. [24] initially cropped small regions utilizing face feature points, subsequently generating small regional features as inputs to the neural network. Shahzad et al. [25] proposed a zoning-based face expression recognition (ZFER) approach to identify facial feature points through zoning, thereby facilitating the perception of deeper facial emotions. The practical issues associated with images in FER, including pose, direction, and input resolution, were addressed by pyramid with super resolution (PSR) [26] through the utilization of a super-resolution pyramidal network architecture and the introduction of an a priori distribution label smoothing (PDLS) loss function. Liu et al. [27] put forth a novel approach combining a fine-tuned SwinTransformer with a multi-weighted optimality search (FST-MWOS) to enhance the expression recognition process. He et al. [28] proposed a multilayer feature recognition algorithm based on a three-channel convolutional neural network (HFT). The CNN method fuses the features extracted from three network channels to obtain classification results. Jung et al. [29] proposed an FER network based on two models. One network extracts the appearance features from image sequences, while the other network extracts geometric features from facial feature points. The two networks are combined to improve model performance.

2.2. OFER

In the real world, human facial regions are readily occluded by external factors such as sunglasses, hats, scarves, and masks. Occlusions are divided into two types, based on whether an object is in the original image or not. The first type is artificial occlusions, where objects are added to the image to mimic occlusions. The second type is real occlusions, which are already present in the original image. These occlusions can harm FER accuracy by increasing intra-class differences or inter-class similarity. There are numerous studies on the OFER problem, which can be divided into four principal categories: robust feature extraction under occlusion, sub-region analysis, non-occluded network image assistance, and occlusion recovery methods.

(1) Robust feature extraction under occlusion The objective of this method is to extract features that are less affected by occlusion while maintaining the capacity to discriminate between different categories. Kotsia et al. [9] demonstrated how the introduction of artificial occlusion affects the performance of the FER system and discussed methods to address this issue. Wang et al. [15] proposed a region attention network (RAN) to capture crucial regions of the face, thereby enabling the solution of FER tasks with occlusion and pose changes. They also introduced RB-Loss to improve the importance of region attention weights.

(2) Sub-region analysis This method is designed to exclude the occluded regions from the recognition process. The impact of occlusion on the recognition performance of the FER model is reduced by focusing only on the unoccluded facial regions. Li et al. [30] proposed a new gated convolutional neural network (PG-CNN) that focuses on different regions of the face image and automatically detects the occluded regions of the face so that the network can focus on the more informative unoccluded regions. Li et al. [14] designed a block-based attention network for OFER. This network inputs a selection of 24 blocks, cropped from the regions of the eyes, nose, mouth, etc., into an attention network to compute the corresponding weights. These weights are then fused with the global features, allowing the model to focus on the information-rich unoccluded facial regions.

(3) Non-occluded network image assistance The method employs unoccluded facial images as a reference to facilitate the operation of OFER. Pan et al. [16] addressed the OFER challenge by training two deep neural networks, with one network focusing on processing unoccluded facial images and the other specializing in processing occluded facial images. The unoccluded network provides guidance to the occluded network, facilitating the more precise recognition of facial features in the occluded region. The fundamental principle of this approach is to utilize the comprehensive data present in the unoccluded facial image to facilitate enhanced recognition in the occluded image. Xia et al. [17] employed distribution densities in the feature space to quantify the complexity of unoccluded data. The unoccluded data can inform the classifier, enabling the identification and utilization of more meaningful and discriminative samples. By relying on the full information provided by the unoccluded data, the classifier is able to better understand and distinguish between different classes of features, improving the overall recognition performance.

(4) Occlusion recovery methods The objective of this method is to recover the occluded region of the image. Cotter [31] proposed a novel FER method based on Gabor multidirectional feature fusion and local Gabor binary pattern histogram sequence (LGBPHS) techniques to address the issue of a partially occluded face image. In order to reconstruct the occluded face, Lu et al. [32] proposed a Wasserstein generative adversarial network-based method, which consists of an autoencoder-based generator and a discriminator. This approach allows for the reconstruction of the occlusion, thereby obtaining an information-rich appearance.

2.3. Summary

With the development of deep learning, facial expression recognition technology has achieved excellent results. However, it still faces the issue of insufficient robustness in complex environments, and its reliance on large annotated datasets limits its generalization ability. The OFER method reduces the impact of occlusions on recognition accuracy through robust feature extraction and sub-region analysis. However, many methods rely on manually designed features or region divisions, making it difficult to adapt to complex occlusion scenarios.

Our work builds upon the strengths of both [14,15]. Specifically, we adopt the RB-loss proposed in [15] to optimize attention weights, enabling the fusion of key regional features with fine-grained features to enrich feature representations. Additionally, we employ the con-feature loss to encourage mutual guidance between fused features and global features, helping the network learn more consistent representations and thereby improving recognition accuracy. Based on these insights, we propose a method that incorporates multi-scale and fine-grained branches to extract global, regional, and fine-grained features from input images. This design facilitates a more comprehensive and expressive feature representation.

3. Method

This section presents an overview of the MAFE process. Subsequently, each module within the MAFE framework and the loss functions employed are elucidated in comprehensive detail.

3.1. Overview

As mentioned above, our initial research attempted to identify and eliminate the occluded region of the face, thereby addressing the OFER issue. This was achieved by removing the occluded block, allowing the network to focus solely on the unoccluded region of the face and recognize facial expressions. With this in mind, the MAFE system was designed to extract a range of features. This was achieved by implementing a region delineation strategy that is based on facial feature points and an attention weight selection mechanism. The latter allows the model to focus on crucial regions of the face, thereby reducing or eliminating the influence of occluded and irrelevant regions of the face on the network. The MAFE proposed in this paper is capable of obtaining fine-grained, global, and important region information from the image, which can then be used for expression recognition in occluded images. The flow of MAFE is illustrated in Figure 2 and consists of three main modules: multi-feature extraction, region detail feature fusion, and consistent feature recognition.

A face image is first input to PTIR-50 and SWIN-E, which extract the global and fine-grained features, respectively. At the same time, Dlib is used to detect the face and cut the image into several regions. These regions are then fed into PTIR-50 for feature extraction, resulting in the acquisition of regional features. Subsequently, crucial regional features are identified, based on the self-attention of each region, and then integrated with fine-grained features to enhance the extracted feature data. Finally, a con-feature loss function is employed to facilitate mutual guidance between the fused and global features, with the global features subsequently being fed into an MLP for expression recognition.

3.2. Multi-Feature Extraction Module

Before introducing the multi-feature extraction module, it is necessary to introduce the image-cropping operation, in which we input a facial image and use the Dlib algorithm to detect the five feature points of the image. This will give the facial feature points of the two eyes, the nose, and the left and right corners of the mouth. The image should then be cropped according to the points obtained, resulting in images of the four regions of the face: left bottom, left top, right bottom, and right top (referred to as Lb, Lt, Rb, and Rt). These should then be fed into the feature extraction branch, along with the original image. This cropping operation serves two purposes. First, it identifies those regions within the image that are more important for expression recognition. Secondly, it produces images that are similar to those in which the face is occluded. This results in a more comprehensive dataset and increases the robustness of the model with OFER samples. The cropping process is illustrated in Figure 3.

The feature extraction branch is constituted of two distinct networks, namely PTIR-50 and SWIN-E. The objective of this configuration is to obtain a comprehensive representation of the image at the global level while simultaneously focusing on the fine-grained details, thereby enhancing the richness and stability of the extracted features.

3.2.1. PTIR-50

Deep learning has led to many high-precision facial feature detectors for use in face recognition. Researchers are now focusing on using facial feature points as geometric features in FER tasks. In OFER, when the facial points are occluded, image features become more important than the points. The PTIR-50 system uses IResnet-50, a feature pyramid, and a transformer-encoder to capture global features, as illustrated in Figure 2.

First, the image is fed into IR-50 to extract the image features. A feature pyramid is used to generate multi-scale feature representations. Three feature levels (large, medium, and small) are input into the transformer-encoder to capture different scales. The transformer-encoder output is then combined to form the input image’s features and attention.

3.2.2. SWIN-E

Researchers typically refer to image recognition as coarse-grained image recognition, such as the recognition of categories, such as distinguishing between cats and dogs. However, within image recognition, there are more specific tasks, called fine-grained image recognition. The goal of fine-grained image recognition is to distinguish between subclasses within the same category, such as different kitten breeds. This idea can be applied to FER. For example, Figure 4 shows the JAFFE dataset, wherein expressions like happiness, sadness, and fear are labeled as happiness-person, sadness-person, and fear-person. FER is different from other recognition tasks because expression features are more subtly embedded in facial features. To extract finer details, we refer to methods commonly used in fine-grained image recognition.

Researchers have used the Swin Transformer to develop network models for fine-grained image recognition [33,34,35]. This approach has demonstrated that the Swin Transformer, with its shifted window attention mechanism, can effectively extract fine-grained features from images. The shifted window mechanism allows the model to capture detailed local features while maintaining the spatial relationship between different regions, making it particularly well suited for tasks like FER. Consequently, the fine-grained feature extraction branch in our model utilizes the Swin Transformer’s Encoder module.

The input image is designated as X, and the cropped image is designated as

X_{Lb}

,

X_{Lt}

,

X_{R b}

,

X_{R t}

. PTIR-50 is designated as

P (\cdot; θ_{p})

, and SWIN-E is designated as

S (\cdot; θ_{s})

. The features

F_{\cdot}

and self-attention

A t t n_{\cdot}

, which were acquired after PTIR-50 and SWIN-E for X and the region blocks

X_{Lb}

,

X_{Lt}

,

X_{R b}

,

X_{R t}

, are designated as follows:

\begin{matrix} [F_{X}, A t t n_{X}] = P (X; θ_{P}), F_{S w i n} = S (X; θ_{S}) \end{matrix} [F_{L b}, A t t n_{L b}] = P (X_{L b}; θ_{P}), [F_{L t}, A t t n_{L t}] = P (X_{L t}; θ_{p}) [F_{R b}, {A t t n}_{R b}] = P (X_{R b}; θ_{p}), [F_{R t}, {A t t n}_{R t}] = P (X_{r t}; θ_{P})

(1)

where

θ_{P}

is a parameter of PTIR-50 and

θ_{S}

is a parameter of SWIN-E.

The feature set F and the self-attention set A are obtained after sorting:

F = [F_{X}, F_{L b}, F_{L t}, F_{R b}, F_{R t}]

(2)

A = [A t t n_{X}, A t t n_{L b}, A t t n_{L t}, A t t n_{R b}, A t t n_{R t}]

(3)

As illustrated in Figure 5, a variety of features were extracted from the distinct region blocks.

3.3. Region Detail Feature Fusion Module

The self-attention set A, obtained from the multi-feature extraction module, is subjected to processing by the SE module [36], resulting in the generation of an attention weight

μ

. This attention weight A then functions as follows:

μ = f (A t t n, q)

(4)

where q denotes the parameters of the SE Module and f denotes the

s i g m o i d

function.

The obtained weight set W is:

W = [μ_{X}, μ_{L b}, μ_{L t}, μ_{R b}, μ_{R t}]

(5)

The maximum attention weight

W_{i}

and the corresponding crucial region feature

F_{i}

are obtained by maximizing W, The specific formula is as follows:

W_{i} = max W

(6)

To improve the expressiveness of the features, the fine-grained feature

F_{S w i n}

and the crucial regional features

F_{i}

are fused to obtain the fusion feature

F_{H}

:

F_{H} = S E (c o n c a t (F_{i}, F_{S w i n}))

(7)

where

S E (\cdot)

denotes SE Module.

3.4. Con-Feature Recognition Module

In the initial two modules, the fusion feature

F_{H}

and image feature

F_{X}

of input image X are obtained. In order for the network to learn more precise and consistent features, the feature selection process of the model is optimized using con-feature loss. Subsequently, the image feature

F_{X}

is passed to the MLP to return the predicted label Y.

As illustrated in Figure 6, Figure 6a represents

F_{X}

, extracted by PTIR-50, and

F_{H}

, whereas Figure 6b depicts

F_{X}

using con-feature loss. In Figure 6b, these two features inform one another in order to obtain more discriminative expression features, thereby improving the robustness and accuracy of the model.

We now provide a comprehensive description of the loss function that is employed for training purposes.

In the field of recognition tasks, the acquisition of more discriminative and precise image features is of paramount importance. This improves both the classification accuracy and the model’s robustness regarding different categories and details. Optimizing feature extraction reduces misclassification and improves real-world performance. Con-feature loss is a loss function that is used to combine the features extracted from two networks, with the objective of optimizing the model’s performance. This makes the extracted features more precise and discriminative. The concept of con-feature loss is based on the idea that ‘two networks do not make the same mistake’.

The proposed idea is that features identified by both networks in the expression recognition task will be more discriminative, improving model performance. Con-feature loss helps the model to capture more detailed and precise expression information through the mutual guidance of fused and global features. This approach enhances feature representation, increases sensitivity to subtle expression differences, and avoids feature redundancy from simple fusion. This significantly boosts expression recognition accuracy. The experiments and results supporting this hypothesis are shown in Section 4.

Con-feature loss is designed as follows:

L_{c o n - f} = \frac{1}{N} \frac{1}{L} \sum_{i = 1}^{N} \sum_{l = 1}^{L} {∥F_{H} (i, l) - F_{X} (i, l)∥}_{2}

(8)

where N denotes the number of samples and L denotes the length of the feature sequence.

RB-loss is designed as follows:

L_{R B} = max {0, ω - (μ_{max} - μ_{0})}

(9)

where

ω

is the hyperparameter,

μ_{0}

is the attentional weight of the original face image, and

μ_{max}

denotes the maximum weight of all region blocks.

Furthermore, the prevalence of ambiguous images in the FER dataset engenders uncertainty in the model with regard to specific samples, which, in turn, affects the model’s assessment of that classification. To mitigate this effect, a label-smoothing strategy

L o s s_{s m}

is employed, whereby label smoothing affords these images a certain degree of error tolerance probability, thereby preventing the model from placing undue credence in the sample labels.

L o s s_{s m}

is designed as follows:

\begin{matrix} L_{sm} & = \frac{1}{N} \sum_{i = 1}^{N} [(1 - τ) (- log (\frac{exp (x_{i, y_{i}})}{\sum_{k} exp (x_{i, k})})) \\ + τ (- \frac{1}{C} \sum_{j} log (\frac{exp (x_{i, j})}{\sum_{k} exp (x_{i, k})}))] \end{matrix}

(10)

where N denotes the number of samples, C denotes the number of categories, i denotes the index of the sample, j and k denote the index of the category, x denotes the output of the model,

x_{i, j}

denotes the output value of the i sample on the j category,

y_{i}

denotes the value of the sample label of the i sample, and

τ

denotes the smoothing factor.

In training, a joint optimization strategy is employed wherein the classification loss is weighted and combined with

L o s s_{s m}

,

L o s s_{R B}

, and

L o s s_{c o n - f}

, with the aim of enhancing the efficacy of the model. Specifically, the training loss function is defined as follows:

L_{t r a i n} = L_{c l s} + α L_{s m} + β L_{R B} + γ L_{c o n - f}

(11)

where

α

,

β

,

γ

are the weight hyperparameters corresponding to the loss.

4. Experiment

This section commences with an account of the dataset employed in the experiment, followed by a detailed exposition of the experimental setup. To demonstrate the efficacy of the aforementioned module, ablation experiments were designed and conducted. Furthermore, MAFE is benchmarked against state-of-the-art methods to illustrate its superiority.

4.1. Dataset

Two popular natural facial expression datasets, RAF-DB and FERPlus, were used to evaluate MAFE. These datasets cover facial images on different scales and in challenging situations. In addition, to evaluate the performance of MAFE on OFER, two datasets that were proposed in [15], Occlusion-FERPlus and Occlusion-RAFDB, are used for testing, and these datasets are briefly described below.

4.1.1. RAF-DB

The RAF-DB (Real-world Affective Faces Database) is the first real-world facial expression dataset that includes both basic and composite expressions. The images in this dataset vary greatly in terms of age, gender, ethnicity, head pose, lighting, occlusions (like glasses or facial hair), and post-processing effects (like filters). Experiments were conducted using images from the dataset that display six basic expressions (happy, surprised, sad, angry, disgusted, and fearful) and neutral expressions.

4.1.2. FERPlus

FERPlus is an extension of the FER2013 dataset for the ICML2013 Challenge. FERPlus consists of large-scale expression images collected by the Google search engine, along with new labels provided by Microsoft for FER2013. It contains 28,709 training images, 3589 validation images, and 3589 test images with a size of 48 × 48 pixels. The main difference between FER2013 and FERPlus is in the annotations. FER2013 was annotated by 1 annotator using 7 expression labels (neutral, happy, surprised, sad, angry, disgusted, and fearful), whereas FERPlus added the contempt label and was annotated by 10 annotators.

4.1.3. Occlusion-RAFDB

Occlusion-RAFDB is an occlusion test dataset composed by the authors of [15] based on the test dataset of the original RAF-DB dataset, which was manually labeled according to the type of occlusion (non-occlusion, wearing a mask, wearing glasses, left/right objects, top-face objects, and bottom-face objects) and selected images with at least one type of occlusion, with a total of 735 images containing the corresponding expression annotations and occlusion type annotations. In this paper, only the expression annotations in the test dataset are used.

4.1.4. Occlusion-FERPlus

Similarly, Occlusion-FERPlus is an occlusion test dataset composed by the authors of [15] based on the test dataset of the original FERPlus dataset, manually labeled according to the type of occlusion (non-occlusion, wearing a mask, wearing glasses, left/right objects, upper face objects, and lower face objects), and selecting those images with at least one type of occlusion, with a total of 605 images containing the corresponding expression labeling and the occlusion type annotations. In this paper, only the expression annotations in the test set are used.

4.2. Implementation Details

We used Pytorch to complete the experiments on two Nvidia Tesla V100 graphics cards. RAF-DB, FERPlus, Occlusion-RAFDB, and Occlusion-FERPlus were used as datasets during the experiments. Specifically, RAF-DB and FERPlus were used as the training dataset and validation dataset, and tests were performed on Occlusion-RAFDB and Occlusion-FERPlus to evaluate the performance of MAFE on the OFER task. The input image and face region image size were both resized to 224 × 224, and the parameters in

L_{t r a i n}

were set to

α

= 2,

β

= 1,

γ

= 2. We initialized the learning rate to 0.000025 and stopped training at the 70th epoch.

4.3. Comparison with SOTA

This section presents a comparative analysis of the performance of MAFE with other methods on the Occlusion-FERPlus and Occlusion-RAFDB datasets.

Summary of Occlusion-RAFDB: Table 1 presents the performance of the methods proposed by researchers in the field of FER, working on the Occlusion-RAFDB dataset for the five-year period from 2020 to 2024. MAFE demonstrated the highest accuracy, with an accuracy rate of 89.42%.

Summary of Occlusion-FERPlus: Table 1 presents the performance of the methodologies proposed by researchers in the field of FER, based on the Occlusion-FERPlus dataset over the five-year period from 2020 to 2024. The results demonstrate that the MAFE approach exhibits superior accuracy, with an accuracy rate of 86.94%.

The superior performance of MAFE on the Occlusion-RAFDB and Occlusion-FERPlus datasets validates its effectiveness in handling occlusion scenarios. Through its unique feature fusion strategy and optimized loss function, MAFE is able to significantly improve accuracy under occlusion and complex conditions.

Furthermore, MAFE achieved good performance when working on the original RAF-DB and FERPlus datasets.

Summary of RAF-DB: Table 2 lists the performance of the methods proposed by researchers in the field of FER, working on the RAF-DB dataset for the five years from 2020 to 2024. MAFE performed well in terms of accuracy at 92.11%, which is 2.51% and 2.57% higher than the previous Latent-OFER and MPA method performance.

Summary of FERPlus: Table 2 lists the performance of the methods proposed by researchers in the field of FER on the FERPlus dataset for the five years from 2020 to 2024, and MAFE performs well in terms of accuracy with 90.15%, which is 0.73% and 1.02% higher than the previous SCAN-CCI and MPA performance, respectively.

The results show that MAFE consistently outperforms previous methods on both datasets, making it a more robust model for FER.

4.4. Ablation Experiment

To investigate the contribution of each component within the MAFE framework, ablation experiments were conducted on the RAF-DB and Occlusion-RAFDB datasets. The outcomes of these experiments are presented in Table 3.

Observing the table, it is evident that compared to using only PTIR-50 or SWIN-E for expression recognition, MAFE achieves good performance on both the original RAF-DB dataset and the Occlusion-RAFDB dataset, with accuracies of 92.11% and 89.42%.

In order to demonstrate the efficacy of con-feature loss, a comparative analysis was conducted between con-feature loss and feature fusion. The objective was to evaluate the performance of processing

F_{X}

and

F_{H}

using con-feature loss in comparison to the performance of fusing

F_{X}

and

F_{H}

with features, as illustrated in Table 4.

The results of the experiment demonstrate that the method of using con-feature loss to process

F_{X}

and

F_{H}

, thereby enabling the model to learn more precise features, is significantly more accurate than the method of inputting

F_{X}

and

F_{H}

features into the MLP after fusion. Furthermore, the training process revealed that the time required to complete an epoch using con-feature loss was three minutes shorter than the time needed to train using feature fusion.

Figure 7 illustrates a visualization of the feature maps after con-feature loss and feature fusion. It can be observed that the visualization of the con-feature loss is concentrated in areas that are more indicative of an expression, while simple feature fusion also focuses on regions that are not related to expressions.

In addition, we also conducted an in-depth study of the four clipped regions to explore the impact of the important regions on the model’s performance. The features

F_{X}

,

F_{L b}

,

F_{L t}

,

F_{R b}

,

F_{R t}

(depicted in Figure 8) are features that are fused with the fine-grained features

F_{S w i n}

. The results of the experiments are shown in Table 5. ‘Max’ in the table indicates the selected crucial region.

4.5. Model Analysis

To evaluate the computational overhead of the proposed MAFE framework, we measured its parameter count and the number of floating-point operations (FLOPs) required during inference. With an input size of

224 \times 224

, the total number of parameters in the model is approximately 72 million, and the FLOPs is about 50.59 GFLOPs.

Although MAFE incurs a higher computational cost compared to traditional CNN architectures, considering its significant performance improvement on both standard and occluded datasets, this computational cost remains within an acceptable range. In the future, we will further optimize its deployment efficiency through methods such as pruning and replacing with lightweight backbones.

In addition, we recorded the F1 scores on different datasets: 87.13 on the RAF-DB dataset and 86.02 on the FERPlus dataset, indicating that MAFE maintains good generalization ability when facing different data distributions and annotation styles.

We further evaluated the generalization ability of the model across different datasets.

As shown in Table 6, the model’s accuracy drops significantly on the test datasets, especially under occlusion, indicating its limited generalization capacity. This is primarily due to the substantial distributional differences between datasets, which may cause the model to overfit to dataset-specific features during training, resulting in degraded performance on unseen data. In future work, we plan to incorporate domain generalization techniques to enhance the model’s robustness in complex environments.

To evaluate whether the performance improvements of the MAFE method over other approaches are statistically significant across different datasets, we computed the p-values for each set of results, as shown in Table 7. The results indicate that all p-values are below 0.1, suggesting marginal statistical significance and supporting the robustness and generalization capability of MAFE.

4.6. Visualization

In order to demonstrate the efficacy of MAFE regarding the OFER problem, some occluded samples were subjected to attentional visualization. Figure 9 illustrates the attentional visualization of some FER networks on OFER samples.

The first row of Figure 9 shows some OFER sample pictures, followed by the attention maps of FDRL [52] and ARM [53] on OFER samples in the next two rows, and then the attention of PTIR-50. The results show that these three networks focus significantly on the occluded part of the OFER samples, leading to poor recognition accuracy. In contrast, the last row of images for MAFE demonstrates that the occluded region receives minimal attention, highlighting MAFE’s advantage in handling OFER samples.

We selected representative samples with mouth-occluded and eye-occluded regions from the dataset and visualized their feature activation maps in Figure 10. These findings demonstrate that the model exhibits reasonable robustness to occlusion and is not overly sensitive to the position or nature of the occluded region.

Specifically, MAFE introduces a simplified global feature extraction network, PTIR-50, which efficiently captures global features while maintaining a lightweight structure. The fine-grained branch, SWIN-E, extracts detailed features from the image, improving the model’s ability to recognize subtle differences in facial expressions. Additionally, the con-feature loss function facilitates mutual guidance between global and fused features, allowing MAFE to learn more discriminative features, thereby enhancing both recognition accuracy and robustness.

5. Conclusions

This paper proposes a Multi-Angle Feature Extraction (MAFE) network, which leverages a two-branch feature extraction structure and tailored loss functions to guide the network’s focus toward the most critical and fine-grained facial expression regions, thereby enhancing occlusion-aware facial expression recognition (OFER) performance. MAFE consists of three core modules: multi-feature extraction, region detail feature fusion, and consistent feature recognition.

In the multi-feature extraction module, we hypothesize that networks with different architectures and parameters will extract complementary features, even when addressing the same task. To exploit this diversity, we simultaneously utilize a multi-scale branch (PTIR-50) and a fine-grained branch (SWIN-E) to extract global, regional, and fine-grained features. The original image and its cropped versions—based on facial landmarks—are fed into PTIR-50 to extract global and regional features along with their self-attention weights. By applying the regional bias loss (RB-Loss), the attention weights of key regions are emphasized beyond those of the global context. Meanwhile, SWIN-E captures fine-grained details from the original image.

In the region detail fusion module, features from the key regions are combined with fine-grained details to form a unified representation, capturing both local and detailed information. Finally, in the consistent feature recognition module, a con-feature loss is introduced to enforce mutual guidance between global and fused features, promoting feature consistency and improving overall recognition performance.

Author Contributions

Conceptualization, Y.L. and H.L.; Methodology, Y.L. and H.L.; Software, Y.L.; Validation, Y.L. and D.J.; Writing—original draft, Y.L.; Writing—review & editing, Y.L. and H.L.; Visualization, Y.L.; Supervision, H.L. and J.L.; Project administration, D.J.; Funding acquisition, D.J. All authors have read and agreed to the published version of the manuscript.

Funding

The study was financially supported by the Xuzhou Science and Technology Plan Project of China (GranNumber: KC23416).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available: RAF-DB is available at http://www.whdeng.cn/RAF/model1.html, accessed on 26 April 2025, and FERPlus is available at https://github.com/microsoft/FERPlus, accessed on 26 April 2025.

Acknowledgments

This work was supported by the Xuzhou Science and Technology Plan Project under Grant KC23416.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following is the list of abbreviations used in this paper:

FER	Facial Expression Recognition
OFER	Occluded Facial Expression Recognition
MAFE	Multi-Angle Feature Extraction
RB-Loss	Regional Bias Loss
Swin-E	Swin Transformer Encoder
PTIR-50	Pyramid Transformer ResNet-50

References

Barsoum, E.; Zhang, C.; Ferrer, C.C.; Zhang, Z. Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution. In Proceedings of the International Conference on Multimodal Interaction (ICMI), Tokyo Japan, 12–16 November 2016; pp. 279–283. [Google Scholar]
Benitez-Quiroz, C.F.; Srinivasan, R.; Martinez, A.M. EmotioNet: An Accurate, Real-Time Algorithm for the Automatic Annotation of a Million Facial Expressions in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5562–5570. [Google Scholar]
Mollahosseini, A.; Hasani, B.; Mahoor, M.H. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Trans. Affect. Compuing 2017, 10, 18–31. [Google Scholar] [CrossRef]
Li, S.; Deng, W.; Du, J.P. Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2584–2593. [Google Scholar]
Dhall, A.; Kaur, A.; Goecke, R.; Gedeon, T. EmotiW 2018: Audio-Video, Student Engagement and Group-Level Affect Prediction. In Proceedings of the International Conference on Multimodal Interaction (ICMI), Boulder, CO, USA, 16–20 October 2018; pp. 653–656. [Google Scholar]
Zeng, D.; Veldhuis, R.; Spreeuwers, L. A survey of face recognition techniques under occlusion. IET Biom. 2021, 10, 581–606. [Google Scholar] [CrossRef]
Boucher, J.D.; Ekman, P. Facial areas and emotional information. J. Commun. 1975, 21–29. [Google Scholar] [CrossRef] [PubMed]
Bourel, F.; Chibelushi, C.C.; Low, A.A. Recognition of Facial Expressions in the Presence of Occlusion. In Proceedings of the British Machine Vision Conference(BMVC), Manchester, UK, 10–13 September 2001; pp. 1–10. [Google Scholar]
Kotsia, I.; Buciu, I.; Pitas, I. An analysis of facial expression recognition under partial facial image occlusion. Image Vis. Comput. 2008, 26, 1052–1067. [Google Scholar] [CrossRef]
Lyons, M.; Akamatsu, S.; Kamachi, M.; Gyoba, J. Coding facial expressions with gabor wavelets. In Proceedings of the Third IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, 14–16 April 1998; pp. 200–205. [Google Scholar]
Kanade, T.; Cohn, J.F.; Tian, Y. Comprehensive database for facial expression analysis. In Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France, 28–30 March 2000; pp. 46–53. [Google Scholar]
Happy, S.; Routray, A. Automatic facial expression recognition using features of salient facial patches. IEEE Trans. Affect. Comput. 2014, 6, 1–12. [Google Scholar] [CrossRef]
Zhang, L.; Tjondronegoro, D.; Chandran, V. Facial expression recognition experiments with data from television broadcasts and the World Wide Web. Image Vis. Comput. 2014, 32, 107–119. [Google Scholar] [CrossRef]
Li, Y.; Zeng, J.; Shan, S.; Chen, X. Occlusion aware facial expression recognition using CNN with attention mechanism. IEEE Trans. Image Process. 2018, 28, 2439–2450. [Google Scholar] [CrossRef]
Wang, K.; Peng, X.; Yang, J.; Meng, D.; Qiao, Y. Region attention networks for pose and occlusion robust facial expression recognition. IEEE Trans. Image Process. 2020, 29, 4057–4069. [Google Scholar] [CrossRef]
Pan, B.; Wang, S.; Xia, B. Occluded facial expression recognition enhanced through privileged information. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 566–573. [Google Scholar]
Xia, B.; Wang, S. Occluded facial expression recognition with step-wise assistance from unpaired non-occluded images. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2927–2935. [Google Scholar]
Lee, I.; Lee, E.; Yoo, S.B. Latent-OFER: Detect, mask, and reconstruct with latent vectors for occluded facial expression recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision(ICCV), Paris, France, 1–6 October 2023; pp. 1536–1546. [Google Scholar]
Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef]
Fasel, B. Robust face analysis using convolutional neural networks. In Proceedings of the International Conference on Pattern Recognition (ICPR), Quebec City, QC, Canada, 11–15 August 2002; Volume 2, pp. 40–43. [Google Scholar]
Tang, Y. Deep learning using linear support vector machines. arXiv preprint, 2013; arXiv:1306.0239. [Google Scholar]
Kahou, S.E.; Pal, C.; Bouthillier, X.; Froumenty, P.; Gülçehre, Ç.; Memisevic, R.; Vincent, P.; Courville, A.; Bengio, Y.; Ferrari, R.C.; et al. Combining modality specific deep neural networks for emotion recognition in video. In Proceedings of the International Conference on Multimodal Interaction (ICMI), Sydney, Australia, 9–13 December 2013; pp. 543–550. [Google Scholar]
Hasani, B.; Mahoor, M.H. Facial expression recognition using enhanced deep 3D convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 30–40. [Google Scholar]
Khan, F. Facial expression recognition using facial landmark detection and feature extraction via neural networks. arXiv 2018, arXiv:1812.04510. [Google Scholar]
Shahzad, T.; Iqbal, K.; Khan, M.A.; Iqbal, N. Role of zoning in facial expression using deep learning. IEEE Access 2023, 11, 16493–16508. [Google Scholar] [CrossRef]
Vo, T.H.; Lee, G.S.; Yang, H.J.; Kim, S.H. Pyramid with super resolution for in-the-wild facial expression recognition. IEEE Access 2020, 8, 131988–132001. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision(ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
He, Y.; Zhang, Y.; Chen, S.; Hu, Y. Facial Expression Recognition Using Hierarchical Features With Three-Channel Convolutional Neural Network. IEEE Access, 2023; 84785–84794. [Google Scholar]
Jung, H.; Lee, S.; Yim, J.; Park, S.; Kim, J. Joint fine-tuning in deep neural networks for facial expression recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision(ICCV), Santiago, Chile, 7–13 December 2015; pp. 2983–2991. [Google Scholar]
Li, Y.; Zeng, J.; Shan, S.; Chen, X. Patch-gated CNN for occlusion-aware facial expression recognition. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 2209–2214. [Google Scholar]
Cotter, S.F. Sparse representation for accurate classification of corrupted and occluded facial expressions. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Dallas, TX, USA, 14–19 March 2010; pp. 838–841. [Google Scholar]
Lu, Y.; Wang, S.; Zhao, W.; Zhao, Y. Wgan-based robust occluded facial expression recognition. IEEE Access 2019, 7, 93594–93610. [Google Scholar] [CrossRef]
Huo, X.; Sun, G.; Tian, S.; Wang, Y.; Yu, L.; Long, J.; Zhang, W.; Li, A. HiFuse: Hierarchical multi-scale feature fusion network for medical image classification. Biomed. Signal Process. Control 2024, 87, 105534. [Google Scholar] [CrossRef]
Xu, Y.; Wu, S.; Wang, B.; Yang, M.; Wu, Z.; Yao, Y.; Wei, Z. Two-stage fine-grained image classification model based on multi-granularity feature fusion. Pattern Recognit. 2024, 146, 110042. [Google Scholar] [CrossRef]
Kim, J.H.; Kim, N.; Won, C.S. Global–local feature learning for fine-grained food classification based on Swin Transformer. Eng. Appl. Artif. Intell. 2024, 133, 108248. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Zhao, Z.; Liu, Q.; Zhou, F. Robust lightweight facial expression recognition network with label distribution training. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI-21), Online, 2–9 February 2021; Volume 35, pp. 3510–3519. [Google Scholar]
Zhao, Z.; Liu, Q.; Wang, S. Learning deep global multi-scale and local attention features for facial expression recognition in the wild. IEEE Trans. Image Process. 2021, 30, 6544–6556. [Google Scholar] [CrossRef]
Ma, F.; Sun, B.; Li, S. Facial expression recognition with visual transformers and attentional selective fusion. IEEE Trans. Affect. Comput. 2021, 14, 1236–1248. [Google Scholar] [CrossRef]
Gong, W.; Qian, Y.; Fan, Y. MPCSAN: Multi-head parallel channel-spatial attention network for facial expression recognition in the wild. Neural Comput. Appl. 2023, 35, 6529–6543. [Google Scholar] [CrossRef]
Gera, D.; Balasubramanian, S. Landmark guidance independent spatio-channel attention and complementary context information based facial expression recognition. Pattern Recognit. Lett. 2021, 145, 58–66. [Google Scholar] [CrossRef]
Lo, L.; Xie, H.; Shuai, H.H.; Cheng, W.H. Facial chirality: From visual self-reflection to robust facial feature learning. IEEE Trans. Multimed. 2022, 24, 4275–4284. [Google Scholar] [CrossRef]
Guo, S.; Cai, S.; Zou, X.; Zhou, Z.; Wang, L. Facial Expression Recognition with Local Occlusion Based on Weighted Multi-Head Parallel Attention. Comput. Syst. Appl. Chin. 2024, 33, 254–262. [Google Scholar] [CrossRef]
Ding, H.; Zhou, P.; Chellappa, R. Occlusion-adaptive deep network for robust facial expression recognition. In Proceedings of the 2020 IEEE International Joint Conference on Biometrics (IJCB), Houston, TX, USA, 28 September–1 October 2020; pp. 1–9. [Google Scholar]
Ma, H.; Celik, T.; Li, H.C. Lightweight attention convolutional neural network through network slimming for robust facial expression recognition. Signal Image Video Process. 2021, 15, 1507–1515. [Google Scholar] [CrossRef]
Ju, L.; Zhao, X. Mask-based attention parallel network for in-the-wild facial expression recognition. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 2410–2414. [Google Scholar]
Gong, W.; Wang, C.; Jia, J.; Qian, Y.; Fan, Y. Multi-feature fusion network for facial expression recognition in the wild. J. Intell. Fuzzy Syst. 2022, 42, 4999–5011. [Google Scholar] [CrossRef]
Wang, K.; Peng, X.; Yang, J.; Lu, S.; Qiao, Y. Suppressing uncertainties for large-scale facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 6897–6906. [Google Scholar]
Xia, H.; Li, C.; Tan, Y.; Li, L.; Song, S. Destruction and reconstruction learning for facial expression recognition. IEEE MultiMedia 2021, 28, 20–28. [Google Scholar] [CrossRef]
Gao, W.; Li, L.; Zhao, H. Facial Expression Recognition Method Based on SpResNet-ViT. In Proceedings of the 2nd Asia-Pacific Conference on Communications Technology and Computer Science (ACCTCS), Shenyang, China, 23–25 April 2022; pp. 182–187. [Google Scholar]
Jiang, J.; Deng, W. Disentangling identity and pose for facial expression recognition. IEEE Trans. Affect. Comput. 2022, 13, 1868–1878. [Google Scholar] [CrossRef]
Ruan, D.; Yan, Y.; Lai, S.; Chai, Z.; Shen, C.; Wang, H. Feature decomposition and reconstruction learning for effective facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 7660–7669. [Google Scholar]
Shi, J.; Zhu, S.; Liang, Z. Learning to amend facial expression representation via de-albino and affinity. arXiv 2021, arXiv:2103.10189. [Google Scholar]

Figure 1. Illustrative examples of occluded samples. The first row depicts some FER samples with faces occluded, while the second row displays the predicted outcomes employing the baseline Pyramid Transformer ResNet-50 (PTIR-50). The subsequent row depicts the results predicted by MAFE, demonstrating consistency with the sample labels.

Figure 2. A diagrammatic representation of the MAFE model. The upper half of the diagram represents the three modules of the MAFE model: multi-feature extraction, region detail feature fusion, and consistent feature recognition. The lower half of the figure is a graphical representation of PTIR-50.

Figure 3. Cropping process. The input image is processed through Dlib, wherein five feature points are identified and marked with red dots, as shown in Figure 3. Thereafter, the image is cropped.

Figure 4. Example of fine-grained image recognition—JAFFE expression recognition dataset. The JAFFE dataset was released in 1998 and comprises images of human facial expressions obtained from 10 Japanese women who were instructed to make a range of expressions in an experimental setting and were then photographed with a camera. The entire database comprises 213 images captured from 10 individuals, all of whom were female. Each subject displayed seven expressions: sad, happy, angry, disgusted, surprised, afraid, and neutral. Approximately 20 images were selected from each expression group.

Figure 5. Visualization of the regional features. As shown in the figure, different regions focus on distinct aspects, leading to variations in the extracted features.

Figure 6. Demonstration of the con-feature loss effect. The images in (a) illustrate the attentions of

F_{X}

and

F_{H}

; (b) depicts attention utilizing con-feature loss.

Figure 6. Demonstration of the con-feature loss effect. The images in (a) illustrate the attentions of

F_{X}

and

F_{H}

; (b) depicts attention utilizing con-feature loss.

Figure 7. Feature fusion and on-feature loss visualization. The left image shows that feature fusion focuses more on facial contours and areas like the eyebrows, while the right image demonstrates how con-feature loss helps concentrate the model’s attention on facial features.

Figure 8. Example of image segmentation based on facial feature points. From left to right, the segments are left top, right top, right bottom, and left bottom.

Figure 9. Attention visualization. The visualization of the attention of some FER networks on OFER samples demonstrates the superiority of MAFE when considering OFER samples.

Figure 10. The visualization of occluded eyes and mouth. The first row shows the mouth being occluded, and the second row shows the eyes being occluded (e.g., by glasses/sunglasses).

Table 1. Comparison with other methods for handling occlusion datasets.

Dataset	Method	Accuracy (%)
Occlusion-RAFDB	ResNet18 [15]	80.19
	RAN [15]	82.72
	EfficientFace [37]	83.24
	MA-Net [38]	83.65
	VTFF [39]	83.95
	MPCSAN [40]	86.26
	SCAN-CCI [41]	85.03
	Latent-OFER [18]	84.20
	Facial Chirality [42]	88.16
	MPA [43]	87.47
	MAFE	89.42
Occlusion-FERPlus	ResNet18 [15]	73.33
	RAN [15]	83.63
	VTFF [39]	84.79
	SCAN-CCI [41]	86.12
	MPCSAN [40]	86.12
	OADN [44]	84.57
	MPA [43]	86.28
	MAFE	86.94

Table 2. Comparison with other methods, using the original datasets.

Dataset	Method	Accuracy (%)
RAF-DB	RAN [15]	82.72
	EfficientFace [37]	83.24
	MA-Net [38]	88.40
	LANet [45]	86.70
	MAPNet [46]	87.26
	MFNet [47]	88.53
	SCAN-CCI [41]	89.02
	MPA [43]	89.54
	Latent-OFER [18]	89.60
	MAFE	92.11
FERPlus	RAN [15]	88.55
	SCN [48]	88.01
	VTFF [39]	88.81
	ADC-Net [49]	88.90
	SpResNet-ViT [50]	88.10
	SCAN-CCI [41]	89.42
	IPD-FER [51]	88.42
	MPA [43]	89.13
	MAFE	90.15

Table 3. Ablation study on MAFE.

Dataset	Method	Accuracy (%)
RAF-DB	PTIR-50	91.46
	SWIN-E	91.30
	PTIR-50+SWIN-E	91.57
	PTIR-50+Crop	91.53
	SWIN-E+Crop	91.37
	MAFE	92.11
Occlusion-RAFDB	PTIR-50	87.09
	SWIN-E	86.91
	PTIR-50+SWIN-E	88.76
	PTIR-50+Crop	88.47
	SWIN-E+Crop	87.64
	MAFE	89.42

Table 4. Comparison of con-feature loss and feature fusion.

Dataset	Method	Accuracy (%)
RAF-DB	con-feature loss	92.11
RAF-DB	Feature fusion	91.33
Occlusion-RAFDB	con-feature loss	89.42
Occlusion-RAFDB	Feature fusion	88.04

Table 5. Impact of the different regions on model performance.

Dataset	Region	Accuracy (%)
RAF-DB	Lt	91.53
	Rt	91.95
	Rb	91.78
	Lb	91.51
	Image	91.72
	MAX	92.11
Occlusion-RAFDB	Lt	87.79
	Rt	88.74
	Rb	88.47
	Lb	88.20
	Image	88.33
	MAx	89.42

Table 6. Cross-dataset evaluation results.

Train Dataset	Test Dataset	Accuracy (%)
RAF-DB–92.11%	FERPlus	26.81
RAF-DB–92.11%	Occlusion-FERPlus	25.84
FERPlus–90.15%	RAF-DB	37.06
FERPlus–90.15%	Occlusion-RAFDB	30.98

Table 7. p-values of methods evaluated on four datasets.

Dataset	RAN [15]	SCAN-CCI [41]	MPA [43]	MAFE
RAF-DB	82.72	89.02	89.54	92.11
FERPlus	88.55	89.42	89.13	90.15
Occlusion-RAFDB	82.72	85.03	87.47	89.42
Occlusion-FERPlus	83.63	86.12	86.28	86.94
p-value	0.057	0.086	0.038	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Liu, H.; Liang, J.; Jiang, D. Occlusion-Robust Facial Expression Recognition Based on Multi-Angle Feature Extraction. Appl. Sci. 2025, 15, 5139. https://doi.org/10.3390/app15095139

AMA Style

Li Y, Liu H, Liang J, Jiang D. Occlusion-Robust Facial Expression Recognition Based on Multi-Angle Feature Extraction. Applied Sciences. 2025; 15(9):5139. https://doi.org/10.3390/app15095139

Chicago/Turabian Style

Li, Yunfei, Hao Liu, Jiuzhen Liang, and Daihong Jiang. 2025. "Occlusion-Robust Facial Expression Recognition Based on Multi-Angle Feature Extraction" Applied Sciences 15, no. 9: 5139. https://doi.org/10.3390/app15095139

APA Style

Li, Y., Liu, H., Liang, J., & Jiang, D. (2025). Occlusion-Robust Facial Expression Recognition Based on Multi-Angle Feature Extraction. Applied Sciences, 15(9), 5139. https://doi.org/10.3390/app15095139

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Occlusion-Robust Facial Expression Recognition Based on Multi-Angle Feature Extraction

Abstract

1. Introduction

2. Related Work

2.1. FER

2.2. OFER

2.3. Summary

3. Method

3.1. Overview

3.2. Multi-Feature Extraction Module

3.2.1. PTIR-50

3.2.2. SWIN-E

3.3. Region Detail Feature Fusion Module

3.4. Con-Feature Recognition Module

4. Experiment

4.1. Dataset

4.1.1. RAF-DB

4.1.2. FERPlus

4.1.3. Occlusion-RAFDB

4.1.4. Occlusion-FERPlus

4.2. Implementation Details

4.3. Comparison with SOTA

4.4. Ablation Experiment

4.5. Model Analysis

4.6. Visualization

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI