1. Introduction
Facial expressions are among the most pervasive and significant signals to convey human emotions and intentions, and they play a pivotal role in everyday interpersonal communication. Facial expression recognition (FER) is a significant area of artificial intelligence (AI) that is drawing more research attention because of its numerous potential applications in a variety of fields, such as healthcare, driver status detection, smart classrooms, etc. [
1,
2,
3,
4,
5] While significant progress has been made in the development of FER, factors such as light, pose change, and occlusion still present challenges that limit its accuracy and robustness.
Figure 1 illustrates some examples of FER with occluded faces.
Facial occlusion represents a significant challenge in the domain of face analysis. It hinders the FER model’s ability to fully discern the full range of facial features, a limitation that has garnered considerable attention in the field of facial identity recognition. However, this issue has been less extensively investigated in the context of FER due to the unavailability of datasets with dedicated annotation for OFER [
1,
2,
5]. It is, therefore, important to address this issue in order to enhance the accuracy and robustness of FER [
6].
In the initial studies on the OFER problem, the effect of partially artificially occluded faces on FER was investigated in controlled laboratory environments. Boucher et al. [
7] examined which areas of the human face are most important for recognition by occluding crucial regions of the face. Bourel et al. [
8] proposed the first OFER model, which recovered facial feature points. Kotsia et al. [
9] conducted a comprehensive study of OFER based on Gabor features and an analysis of human observers, which revealed that occlusion of the mouth area had a greater impact on FER performance than occlusion of the eye area on the JAFFE [
10] and CK [
11] datasets. Consequently, numerous subsequent research works have employed subregion-based feature extraction and fusion schemes for OFER [
12,
13], which initially detect occluded regions and then remove them. Sub-region analysis methods [
14,
15] delineate regions based on the analysis of facial feature points, employing an attention mechanism to focus on crucial regions. However, the absence of discernible facial feature points due to occlusion may result in errors in the recognition process. The unoccluded image network-assisted method [
16,
17] employs two distinct networks; one is trained on unoccluded images and the other on occluded images. This method employs unoccluded images as privileged information to facilitate expression recognition in occluded scenarios. Nevertheless, this approach is not applicable in real-world scenarios, as it is challenging to ensure the consistent availability of unoccluded images in practical applications. Moreover, there are a number of techniques that are specifically designed for the detection, removal, and reconstruction of occluded regions [
18].
It has been argued that the explicit removal of occluded regions is impractical, given the difficulty of precisely detecting and reconstructing occlusion. Ref. [
15] points out that humans are capable of understanding facial expressions when the face is occluded and can effectively utilize both the regions and the whole face to perceive the semantics conveyed by an incomplete face. In light of these considerations, this paper presents MAFE, a novel approach to addressing the OFER problem by utilizing global, fine-grained, and crucial regional features in the samples. MAFE primarily comprises three modules: multi-feature extraction, region detail feature fusion, and consistent feature recognition. The multi-feature extraction module is responsible for the extraction of both the global features and fine-grained features of the input image. This is achieved through the introduction of multi-scale and fine-grained branches. Concurrently, the region division strategy, which is based on facial feature points and the attention weight selection mechanism, enables the model to concentrate on the crucial facial regions. In the region detail feature fusion module, the method presented in this paper effectively combines crucial regional features and fine-grained features, ensuring that the fusion features both encompass crucial region information and retain the fine-grained information of the original image. The consistent feature recognition module uses con-feature loss to improve the consistency between fusion and global features, boosting expression recognition accuracy. Our method’s effectiveness is confirmed by experiments on two public datasets and two occlusion test datasets, showing that it outperforms several state-of-the-art OFER methods. Our work offers a promising solution for enhancing the performance of OFER models. In particular, the contributions of this paper can be summarized as follows:
1. A straightforward multi-scale branching PTIR-50 is devised for the purpose of extracting global information from an image. The multi-scale feature pyramid and transformer module in the PTIR-50 are capable of capturing multi-scale information from an image, along with the long-distance dependencies within the image. This results in more comprehensive and precise feature representation. Furthermore, the utilization of a region division strategy based on facial feature points and an attention weight selection mechanism enables the model to focus on crucial regions, while ignoring the occluded regions. This effectively addresses the OFER problem.
2. In order to obtain more discriminative expression features and prevent feature redundancy during the feature fusion process, a con-feature loss function is proposed for the processing of features extracted from different branches. Therefore, the global, regional, and detailed information present in the image is fully utilized for the purpose of expression recognition.
3. The experimental results demonstrate that our OFER solution attains state-of-the-art results on the Occlusion-RAFDB and Occlusion-FERPlus datasets, with 89.42% and 86.94% accuracy. Furthermore, high accuracy was attained on the original RAF-DB and FERPlus datasets, with respective rates of 92.11% and 90.15%.
The remainder of this paper is organized as follows.
Section 2 reviews the related work in FER and OFER.
Section 3 introduces the proposed method MAFE in detail, including its overall architecture and core components.
Section 4 presents the experimental setup, datasets, and evaluation metrics, followed by a thorough analysis of the results. Finally,
Section 5 concludes the paper.
2. Related Work
In this section, we focus on regular FER work and OFER-related work. By examining the advances in conventional FER approaches alongside the challenges and solutions proposed for OFER, we aim to provide a comprehensive overview of the current landscape in the field.
2.1. FER
An FER system typically comprises three principal phases: face detection, feature extraction, and expression recognition. In the field of face detection, a multitude of algorithms, such as MTCNN [
19], are employed to identify and localize faces within complex environments. Furthermore, researchers have developed a number of techniques for capturing appearance features resulting from facial expressions, including texture-based, geometry-based, and fusion features. The ongoing advancement of deep learning, coupled with the advent of extensive datasets, such as AffectNet [
3], RAF-DB [
4], and EmotioNet [
2], has prompted researchers to employ convolutional neural networks (CNNs) for feature extraction. Fasel [
20] demonstrated that shallow CNNs are robust regarding facial pose and scale. Tang [
21] and Kahou et al. [
22] employed deep CNNs for feature extraction and achieved success in the FER2013 and Emotiw2013 challenges, respectively. A number of recent studies have employed high-accuracy facial feature point detectors to obtain facial feature points, which are then used as informative features for FER tasks. These features are subsequently fed into a classifier for final expression recognition. Hassani et al. [
23] proposed a new approach, 3D InceptionResNet, which multiplies facial feature points with certain layers of image features for recognition. Khan et al. [
24] initially cropped small regions utilizing face feature points, subsequently generating small regional features as inputs to the neural network. Shahzad et al. [
25] proposed a zoning-based face expression recognition (ZFER) approach to identify facial feature points through zoning, thereby facilitating the perception of deeper facial emotions. The practical issues associated with images in FER, including pose, direction, and input resolution, were addressed by pyramid with super resolution (PSR) [
26] through the utilization of a super-resolution pyramidal network architecture and the introduction of an a priori distribution label smoothing (PDLS) loss function. Liu et al. [
27] put forth a novel approach combining a fine-tuned SwinTransformer with a multi-weighted optimality search (FST-MWOS) to enhance the expression recognition process. He et al. [
28] proposed a multilayer feature recognition algorithm based on a three-channel convolutional neural network (HFT). The CNN method fuses the features extracted from three network channels to obtain classification results. Jung et al. [
29] proposed an FER network based on two models. One network extracts the appearance features from image sequences, while the other network extracts geometric features from facial feature points. The two networks are combined to improve model performance.
2.2. OFER
In the real world, human facial regions are readily occluded by external factors such as sunglasses, hats, scarves, and masks. Occlusions are divided into two types, based on whether an object is in the original image or not. The first type is artificial occlusions, where objects are added to the image to mimic occlusions. The second type is real occlusions, which are already present in the original image. These occlusions can harm FER accuracy by increasing intra-class differences or inter-class similarity. There are numerous studies on the OFER problem, which can be divided into four principal categories: robust feature extraction under occlusion, sub-region analysis, non-occluded network image assistance, and occlusion recovery methods.
(1) Robust feature extraction under occlusion The objective of this method is to extract features that are less affected by occlusion while maintaining the capacity to discriminate between different categories. Kotsia et al. [
9] demonstrated how the introduction of artificial occlusion affects the performance of the FER system and discussed methods to address this issue. Wang et al. [
15] proposed a region attention network (RAN) to capture crucial regions of the face, thereby enabling the solution of FER tasks with occlusion and pose changes. They also introduced RB-Loss to improve the importance of region attention weights.
(2) Sub-region analysis This method is designed to exclude the occluded regions from the recognition process. The impact of occlusion on the recognition performance of the FER model is reduced by focusing only on the unoccluded facial regions. Li et al. [
30] proposed a new gated convolutional neural network (PG-CNN) that focuses on different regions of the face image and automatically detects the occluded regions of the face so that the network can focus on the more informative unoccluded regions. Li et al. [
14] designed a block-based attention network for OFER. This network inputs a selection of 24 blocks, cropped from the regions of the eyes, nose, mouth, etc., into an attention network to compute the corresponding weights. These weights are then fused with the global features, allowing the model to focus on the information-rich unoccluded facial regions.
(3) Non-occluded network image assistance The method employs unoccluded facial images as a reference to facilitate the operation of OFER. Pan et al. [
16] addressed the OFER challenge by training two deep neural networks, with one network focusing on processing unoccluded facial images and the other specializing in processing occluded facial images. The unoccluded network provides guidance to the occluded network, facilitating the more precise recognition of facial features in the occluded region. The fundamental principle of this approach is to utilize the comprehensive data present in the unoccluded facial image to facilitate enhanced recognition in the occluded image. Xia et al. [
17] employed distribution densities in the feature space to quantify the complexity of unoccluded data. The unoccluded data can inform the classifier, enabling the identification and utilization of more meaningful and discriminative samples. By relying on the full information provided by the unoccluded data, the classifier is able to better understand and distinguish between different classes of features, improving the overall recognition performance.
(4) Occlusion recovery methods The objective of this method is to recover the occluded region of the image. Cotter [
31] proposed a novel FER method based on Gabor multidirectional feature fusion and local Gabor binary pattern histogram sequence (LGBPHS) techniques to address the issue of a partially occluded face image. In order to reconstruct the occluded face, Lu et al. [
32] proposed a Wasserstein generative adversarial network-based method, which consists of an autoencoder-based generator and a discriminator. This approach allows for the reconstruction of the occlusion, thereby obtaining an information-rich appearance.
2.3. Summary
With the development of deep learning, facial expression recognition technology has achieved excellent results. However, it still faces the issue of insufficient robustness in complex environments, and its reliance on large annotated datasets limits its generalization ability. The OFER method reduces the impact of occlusions on recognition accuracy through robust feature extraction and sub-region analysis. However, many methods rely on manually designed features or region divisions, making it difficult to adapt to complex occlusion scenarios.
Our work builds upon the strengths of both [
14,
15]. Specifically, we adopt the RB-loss proposed in [
15] to optimize attention weights, enabling the fusion of key regional features with fine-grained features to enrich feature representations. Additionally, we employ the con-feature loss to encourage mutual guidance between fused features and global features, helping the network learn more consistent representations and thereby improving recognition accuracy. Based on these insights, we propose a method that incorporates multi-scale and fine-grained branches to extract global, regional, and fine-grained features from input images. This design facilitates a more comprehensive and expressive feature representation.
3. Method
This section presents an overview of the MAFE process. Subsequently, each module within the MAFE framework and the loss functions employed are elucidated in comprehensive detail.
3.1. Overview
As mentioned above, our initial research attempted to identify and eliminate the occluded region of the face, thereby addressing the OFER issue. This was achieved by removing the occluded block, allowing the network to focus solely on the unoccluded region of the face and recognize facial expressions. With this in mind, the MAFE system was designed to extract a range of features. This was achieved by implementing a region delineation strategy that is based on facial feature points and an attention weight selection mechanism. The latter allows the model to focus on crucial regions of the face, thereby reducing or eliminating the influence of occluded and irrelevant regions of the face on the network. The MAFE proposed in this paper is capable of obtaining fine-grained, global, and important region information from the image, which can then be used for expression recognition in occluded images. The flow of MAFE is illustrated in
Figure 2 and consists of three main modules: multi-feature extraction, region detail feature fusion, and consistent feature recognition.
A face image is first input to PTIR-50 and SWIN-E, which extract the global and fine-grained features, respectively. At the same time, Dlib is used to detect the face and cut the image into several regions. These regions are then fed into PTIR-50 for feature extraction, resulting in the acquisition of regional features. Subsequently, crucial regional features are identified, based on the self-attention of each region, and then integrated with fine-grained features to enhance the extracted feature data. Finally, a con-feature loss function is employed to facilitate mutual guidance between the fused and global features, with the global features subsequently being fed into an MLP for expression recognition.
3.2. Multi-Feature Extraction Module
Before introducing the multi-feature extraction module, it is necessary to introduce the image-cropping operation, in which we input a facial image and use the Dlib algorithm to detect the five feature points of the image. This will give the facial feature points of the two eyes, the nose, and the left and right corners of the mouth. The image should then be cropped according to the points obtained, resulting in images of the four regions of the face: left bottom, left top, right bottom, and right top (referred to as Lb, Lt, Rb, and Rt). These should then be fed into the feature extraction branch, along with the original image. This cropping operation serves two purposes. First, it identifies those regions within the image that are more important for expression recognition. Secondly, it produces images that are similar to those in which the face is occluded. This results in a more comprehensive dataset and increases the robustness of the model with OFER samples. The cropping process is illustrated in
Figure 3.
The feature extraction branch is constituted of two distinct networks, namely PTIR-50 and SWIN-E. The objective of this configuration is to obtain a comprehensive representation of the image at the global level while simultaneously focusing on the fine-grained details, thereby enhancing the richness and stability of the extracted features.
3.2.1. PTIR-50
Deep learning has led to many high-precision facial feature detectors for use in face recognition. Researchers are now focusing on using facial feature points as geometric features in FER tasks. In OFER, when the facial points are occluded, image features become more important than the points. The PTIR-50 system uses IResnet-50, a feature pyramid, and a transformer-encoder to capture global features, as illustrated in
Figure 2.
First, the image is fed into IR-50 to extract the image features. A feature pyramid is used to generate multi-scale feature representations. Three feature levels (large, medium, and small) are input into the transformer-encoder to capture different scales. The transformer-encoder output is then combined to form the input image’s features and attention.
3.2.2. SWIN-E
Researchers typically refer to image recognition as coarse-grained image recognition, such as the recognition of categories, such as distinguishing between cats and dogs. However, within image recognition, there are more specific tasks, called fine-grained image recognition. The goal of fine-grained image recognition is to distinguish between subclasses within the same category, such as different kitten breeds. This idea can be applied to FER. For example,
Figure 4 shows the JAFFE dataset, wherein expressions like happiness, sadness, and fear are labeled as happiness-person, sadness-person, and fear-person. FER is different from other recognition tasks because expression features are more subtly embedded in facial features. To extract finer details, we refer to methods commonly used in fine-grained image recognition.
Researchers have used the Swin Transformer to develop network models for fine-grained image recognition [
33,
34,
35]. This approach has demonstrated that the Swin Transformer, with its shifted window attention mechanism, can effectively extract fine-grained features from images. The shifted window mechanism allows the model to capture detailed local features while maintaining the spatial relationship between different regions, making it particularly well suited for tasks like FER. Consequently, the fine-grained feature extraction branch in our model utilizes the Swin Transformer’s Encoder module.
The input image is designated as
X, and the cropped image is designated as
,
,
,
. PTIR-50 is designated as
, and SWIN-E is designated as
. The features
and self-attention
, which were acquired after PTIR-50 and SWIN-E for
X and the region blocks
,
,
,
, are designated as follows:
where
is a parameter of PTIR-50 and
is a parameter of SWIN-E.
The feature set
F and the self-attention set
A are obtained after sorting:
As illustrated in
Figure 5, a variety of features were extracted from the distinct region blocks.
3.3. Region Detail Feature Fusion Module
The self-attention set
A, obtained from the multi-feature extraction module, is subjected to processing by the SE module [
36], resulting in the generation of an attention weight
. This attention weight
A then functions as follows:
where
q denotes the parameters of the SE Module and
f denotes the
function.
The obtained weight set
W is:
The maximum attention weight
and the corresponding crucial region feature
are obtained by maximizing
W, The specific formula is as follows:
To improve the expressiveness of the features, the fine-grained feature
and the crucial regional features
are fused to obtain the fusion feature
:
where
denotes SE Module.
3.4. Con-Feature Recognition Module
In the initial two modules, the fusion feature and image feature of input image X are obtained. In order for the network to learn more precise and consistent features, the feature selection process of the model is optimized using con-feature loss. Subsequently, the image feature is passed to the MLP to return the predicted label Y.
As illustrated in
Figure 6,
Figure 6a represents
, extracted by PTIR-50, and
, whereas
Figure 6b depicts
using con-feature loss. In
Figure 6b, these two features inform one another in order to obtain more discriminative expression features, thereby improving the robustness and accuracy of the model.
We now provide a comprehensive description of the loss function that is employed for training purposes.
In the field of recognition tasks, the acquisition of more discriminative and precise image features is of paramount importance. This improves both the classification accuracy and the model’s robustness regarding different categories and details. Optimizing feature extraction reduces misclassification and improves real-world performance. Con-feature loss is a loss function that is used to combine the features extracted from two networks, with the objective of optimizing the model’s performance. This makes the extracted features more precise and discriminative. The concept of con-feature loss is based on the idea that ‘two networks do not make the same mistake’.
The proposed idea is that features identified by both networks in the expression recognition task will be more discriminative, improving model performance. Con-feature loss helps the model to capture more detailed and precise expression information through the mutual guidance of fused and global features. This approach enhances feature representation, increases sensitivity to subtle expression differences, and avoids feature redundancy from simple fusion. This significantly boosts expression recognition accuracy. The experiments and results supporting this hypothesis are shown in
Section 4.
Con-feature loss is designed as follows:
where
N denotes the number of samples and
L denotes the length of the feature sequence.
RB-loss is designed as follows:
where
is the hyperparameter,
is the attentional weight of the original face image, and
denotes the maximum weight of all region blocks.
Furthermore, the prevalence of ambiguous images in the FER dataset engenders uncertainty in the model with regard to specific samples, which, in turn, affects the model’s assessment of that classification. To mitigate this effect, a label-smoothing strategy
is employed, whereby label smoothing affords these images a certain degree of error tolerance probability, thereby preventing the model from placing undue credence in the sample labels.
is designed as follows:
where
N denotes the number of samples,
C denotes the number of categories,
i denotes the index of the sample,
j and
k denote the index of the category,
x denotes the output of the model,
denotes the output value of the
i sample on the
j category,
denotes the value of the sample label of the
i sample, and
denotes the smoothing factor.
In training, a joint optimization strategy is employed wherein the classification loss is weighted and combined with
,
, and
, with the aim of enhancing the efficacy of the model. Specifically, the training loss function is defined as follows:
where
,
,
are the weight hyperparameters corresponding to the loss.
4. Experiment
This section commences with an account of the dataset employed in the experiment, followed by a detailed exposition of the experimental setup. To demonstrate the efficacy of the aforementioned module, ablation experiments were designed and conducted. Furthermore, MAFE is benchmarked against state-of-the-art methods to illustrate its superiority.
4.1. Dataset
Two popular natural facial expression datasets, RAF-DB and FERPlus, were used to evaluate MAFE. These datasets cover facial images on different scales and in challenging situations. In addition, to evaluate the performance of MAFE on OFER, two datasets that were proposed in [
15], Occlusion-FERPlus and Occlusion-RAFDB, are used for testing, and these datasets are briefly described below.
4.1.1. RAF-DB
The RAF-DB (Real-world Affective Faces Database) is the first real-world facial expression dataset that includes both basic and composite expressions. The images in this dataset vary greatly in terms of age, gender, ethnicity, head pose, lighting, occlusions (like glasses or facial hair), and post-processing effects (like filters). Experiments were conducted using images from the dataset that display six basic expressions (happy, surprised, sad, angry, disgusted, and fearful) and neutral expressions.
4.1.2. FERPlus
FERPlus is an extension of the FER2013 dataset for the ICML2013 Challenge. FERPlus consists of large-scale expression images collected by the Google search engine, along with new labels provided by Microsoft for FER2013. It contains 28,709 training images, 3589 validation images, and 3589 test images with a size of 48 × 48 pixels. The main difference between FER2013 and FERPlus is in the annotations. FER2013 was annotated by 1 annotator using 7 expression labels (neutral, happy, surprised, sad, angry, disgusted, and fearful), whereas FERPlus added the contempt label and was annotated by 10 annotators.
4.1.3. Occlusion-RAFDB
Occlusion-RAFDB is an occlusion test dataset composed by the authors of [
15] based on the test dataset of the original RAF-DB dataset, which was manually labeled according to the type of occlusion (non-occlusion, wearing a mask, wearing glasses, left/right objects, top-face objects, and bottom-face objects) and selected images with at least one type of occlusion, with a total of 735 images containing the corresponding expression annotations and occlusion type annotations. In this paper, only the expression annotations in the test dataset are used.
4.1.4. Occlusion-FERPlus
Similarly, Occlusion-FERPlus is an occlusion test dataset composed by the authors of [
15] based on the test dataset of the original FERPlus dataset, manually labeled according to the type of occlusion (non-occlusion, wearing a mask, wearing glasses, left/right objects, upper face objects, and lower face objects), and selecting those images with at least one type of occlusion, with a total of 605 images containing the corresponding expression labeling and the occlusion type annotations. In this paper, only the expression annotations in the test set are used.
4.2. Implementation Details
We used Pytorch to complete the experiments on two Nvidia Tesla V100 graphics cards. RAF-DB, FERPlus, Occlusion-RAFDB, and Occlusion-FERPlus were used as datasets during the experiments. Specifically, RAF-DB and FERPlus were used as the training dataset and validation dataset, and tests were performed on Occlusion-RAFDB and Occlusion-FERPlus to evaluate the performance of MAFE on the OFER task. The input image and face region image size were both resized to 224 × 224, and the parameters in were set to = 2, = 1, = 2. We initialized the learning rate to 0.000025 and stopped training at the 70th epoch.
4.3. Comparison with SOTA
This section presents a comparative analysis of the performance of MAFE with other methods on the Occlusion-FERPlus and Occlusion-RAFDB datasets.
Summary of Occlusion-RAFDB:
Table 1 presents the performance of the methods proposed by researchers in the field of FER, working on the Occlusion-RAFDB dataset for the five-year period from 2020 to 2024. MAFE demonstrated the highest accuracy, with an accuracy rate of 89.42%.
Summary of Occlusion-FERPlus:
Table 1 presents the performance of the methodologies proposed by researchers in the field of FER, based on the Occlusion-FERPlus dataset over the five-year period from 2020 to 2024. The results demonstrate that the MAFE approach exhibits superior accuracy, with an accuracy rate of 86.94%.
The superior performance of MAFE on the Occlusion-RAFDB and Occlusion-FERPlus datasets validates its effectiveness in handling occlusion scenarios. Through its unique feature fusion strategy and optimized loss function, MAFE is able to significantly improve accuracy under occlusion and complex conditions.
Furthermore, MAFE achieved good performance when working on the original RAF-DB and FERPlus datasets.
Summary of RAF-DB:
Table 2 lists the performance of the methods proposed by researchers in the field of FER, working on the RAF-DB dataset for the five years from 2020 to 2024. MAFE performed well in terms of accuracy at 92.11%, which is 2.51% and 2.57% higher than the previous Latent-OFER and MPA method performance.
Summary of FERPlus:
Table 2 lists the performance of the methods proposed by researchers in the field of FER on the FERPlus dataset for the five years from 2020 to 2024, and MAFE performs well in terms of accuracy with 90.15%, which is 0.73% and 1.02% higher than the previous SCAN-CCI and MPA performance, respectively.
The results show that MAFE consistently outperforms previous methods on both datasets, making it a more robust model for FER.
4.4. Ablation Experiment
To investigate the contribution of each component within the MAFE framework, ablation experiments were conducted on the RAF-DB and Occlusion-RAFDB datasets. The outcomes of these experiments are presented in
Table 3.
Observing the table, it is evident that compared to using only PTIR-50 or SWIN-E for expression recognition, MAFE achieves good performance on both the original RAF-DB dataset and the Occlusion-RAFDB dataset, with accuracies of 92.11% and 89.42%.
In order to demonstrate the efficacy of con-feature loss, a comparative analysis was conducted between con-feature loss and feature fusion. The objective was to evaluate the performance of processing
and
using con-feature loss in comparison to the performance of fusing
and
with features, as illustrated in
Table 4.
The results of the experiment demonstrate that the method of using con-feature loss to process and , thereby enabling the model to learn more precise features, is significantly more accurate than the method of inputting and features into the MLP after fusion. Furthermore, the training process revealed that the time required to complete an epoch using con-feature loss was three minutes shorter than the time needed to train using feature fusion.
Figure 7 illustrates a visualization of the feature maps after con-feature loss and feature fusion. It can be observed that the visualization of the con-feature loss is concentrated in areas that are more indicative of an expression, while simple feature fusion also focuses on regions that are not related to expressions.
In addition, we also conducted an in-depth study of the four clipped regions to explore the impact of the important regions on the model’s performance. The features
,
,
,
,
(depicted in
Figure 8) are features that are fused with the fine-grained features
. The results of the experiments are shown in
Table 5. ‘Max’ in the table indicates the selected crucial region.
4.5. Model Analysis
To evaluate the computational overhead of the proposed MAFE framework, we measured its parameter count and the number of floating-point operations (FLOPs) required during inference. With an input size of , the total number of parameters in the model is approximately 72 million, and the FLOPs is about 50.59 GFLOPs.
Although MAFE incurs a higher computational cost compared to traditional CNN architectures, considering its significant performance improvement on both standard and occluded datasets, this computational cost remains within an acceptable range. In the future, we will further optimize its deployment efficiency through methods such as pruning and replacing with lightweight backbones.
In addition, we recorded the F1 scores on different datasets: 87.13 on the RAF-DB dataset and 86.02 on the FERPlus dataset, indicating that MAFE maintains good generalization ability when facing different data distributions and annotation styles.
We further evaluated the generalization ability of the model across different datasets.
As shown in
Table 6, the model’s accuracy drops significantly on the test datasets, especially under occlusion, indicating its limited generalization capacity. This is primarily due to the substantial distributional differences between datasets, which may cause the model to overfit to dataset-specific features during training, resulting in degraded performance on unseen data. In future work, we plan to incorporate domain generalization techniques to enhance the model’s robustness in complex environments.
To evaluate whether the performance improvements of the MAFE method over other approaches are statistically significant across different datasets, we computed the
p-values for each set of results, as shown in
Table 7. The results indicate that all
p-values are below 0.1, suggesting marginal statistical significance and supporting the robustness and generalization capability of MAFE.
4.6. Visualization
In order to demonstrate the efficacy of MAFE regarding the OFER problem, some occluded samples were subjected to attentional visualization.
Figure 9 illustrates the attentional visualization of some FER networks on OFER samples.
The first row of
Figure 9 shows some OFER sample pictures, followed by the attention maps of FDRL [
52] and ARM [
53] on OFER samples in the next two rows, and then the attention of PTIR-50. The results show that these three networks focus significantly on the occluded part of the OFER samples, leading to poor recognition accuracy. In contrast, the last row of images for MAFE demonstrates that the occluded region receives minimal attention, highlighting MAFE’s advantage in handling OFER samples.
We selected representative samples with mouth-occluded and eye-occluded regions from the dataset and visualized their feature activation maps in
Figure 10. These findings demonstrate that the model exhibits reasonable robustness to occlusion and is not overly sensitive to the position or nature of the occluded region.
Specifically, MAFE introduces a simplified global feature extraction network, PTIR-50, which efficiently captures global features while maintaining a lightweight structure. The fine-grained branch, SWIN-E, extracts detailed features from the image, improving the model’s ability to recognize subtle differences in facial expressions. Additionally, the con-feature loss function facilitates mutual guidance between global and fused features, allowing MAFE to learn more discriminative features, thereby enhancing both recognition accuracy and robustness.
5. Conclusions
This paper proposes a Multi-Angle Feature Extraction (MAFE) network, which leverages a two-branch feature extraction structure and tailored loss functions to guide the network’s focus toward the most critical and fine-grained facial expression regions, thereby enhancing occlusion-aware facial expression recognition (OFER) performance. MAFE consists of three core modules: multi-feature extraction, region detail feature fusion, and consistent feature recognition.
In the multi-feature extraction module, we hypothesize that networks with different architectures and parameters will extract complementary features, even when addressing the same task. To exploit this diversity, we simultaneously utilize a multi-scale branch (PTIR-50) and a fine-grained branch (SWIN-E) to extract global, regional, and fine-grained features. The original image and its cropped versions—based on facial landmarks—are fed into PTIR-50 to extract global and regional features along with their self-attention weights. By applying the regional bias loss (RB-Loss), the attention weights of key regions are emphasized beyond those of the global context. Meanwhile, SWIN-E captures fine-grained details from the original image.
In the region detail fusion module, features from the key regions are combined with fine-grained details to form a unified representation, capturing both local and detailed information. Finally, in the consistent feature recognition module, a con-feature loss is introduced to enforce mutual guidance between global and fused features, promoting feature consistency and improving overall recognition performance.