1. Introduction
Face recognition (FR) has represented one of the most important research topics for many years. Many researchers [
1,
2,
3,
4,
5,
6] have introduced robust methods to solve the FR problem. The trend of developing methods for FR appears to have almost reached its peak at the time of this writing. Influenced by the convolutional neural networks (CNNs), the current algorithms using deep learning methods [
1,
2,
3,
4,
5,
6] have achieved superior accuracy for FR. Systems based on FR are widely used in many areas across the world including airports, community gates, and healthcare; FR is also employed in some authentication applications, such as face-to-face attendance monitoring and mobile payment systems based on face profiles.
With the emergence of the COVID-19 pandemic, a viral infection caused by severe acute respiratory syndrome [
7] has spread globally and brought many major challenges to daily human activities. To avoid COVID-19 infection, many people have worn and continue to wear masks. Mask wearing affects current FR application systems because the human face—the target of interest—is partially covered. In real-world FR applications, face occlusion, particularly masked face occlusion, will significantly affect existing FR performance and decrease re-identification accuracy [
8].
Modern deep learning-based models are advanced enough to extract face features and learn the important key features such as face edges, mouth, nose, and eyes [
9]. However, the presence of a facial mask occludes most of the key features, thus complicating the feature extraction process. Since traditional methods for FR have been designed specifically to work with all face information available, a mask on a face makes the models lose about 50% of the useful information [
10]. The facial mask blocks important features such as the mouth and nose, thus obstructing the face feature structure, as reported in [
11]. This specific issue has recently emerged as a particularly serious barrier to the field of FR. Therefore, to solve this problem, novel methods must be invented, or the existing algorithms must be modified substantially.
Initially, facial mask recognition has been attempted, and researchers have introduced many robust solutions [
12,
13,
14] to detect facial masks. Many scholars have recently presented various methods that address the MFR problems using deep learning techniques [
9,
11,
15,
16,
17,
18,
19,
20]. Alzu’bi et al. summarized the various MFR methods that have recently been proposed [
21]. Further, because of the insufficient availability of masked face images for model training and testing, studies have proposed several masked face datasets [
11,
22,
23] and data augmentation tools for generating simulated masked images [
9,
11].
MFR represents a special case in the occlusion FR domain. In contrast to regular occluded FR, MFR involves three major challenges: the key features of the face, such as the mouth, nose, and chin, are occluded; most of the FR methods have been designed specifically to work with all face features available, and there is currently no publicly available large-scale masked face training and testing benchmark dataset. Moreover, most existing methods have been developed for the specific masked face datasets used in their development. Thus, a specific method may perform well for a limited dataset whereas it performs badly for other datasets. Further, the average accuracy of the existing MFR methods is only 89.5% [
21]. With this background, it is necessary to develop a method that can consistently achieve good results on all datasets.
Recently, the methods based on attention mechanism are widely used to solve various problems in vision tasks such as image classification [
24], age-invariant face recognition [
25,
26], and specifically face recognition with masked face [
11,
19]. It should also be noted that the existing MFR methods which are based on attention demonstrate high accuracy compared to other methods. Hence, this paper proposes a method for solving the problem of MFR by verifying individuals with masked faces using an attention module and angular margin loss ArcFace. This method uses a refined ResNet-50 [
5] network as a backbone network and integrates the attention module into the backbone network. The model can obtain highly discriminative features and improve facial feature representations through the proposed method, which successfully overcomes the recognition accuracy problem of MFR. However, recognizing the face of a person wearing a facial mask with hat, glass, and different face angles is a limitation of this approach.
The main contributions are summarized as four-fold:
A new MFR method using a deep learning network architecture based on the attention module and angular margin loss ArcFace is proposed to focus on the informative parts not occluded by the facial mask (i.e., the regions around the eyes).
The CBAM attention module is integrated with a refined ResNet-50 network architecture for feature extraction without additional computational cost.
Proposed new simulated masked face images generated from regular face recognition datasets using a data argumentation tool for model training and valuation. Datasets generated in this research are available through the website
https://github.com/MaskedFaceDataSet/SimulatedMaskedFaceDataset (accessed on 6 May 2022).
The experimental results on simulated and real masked face datasets demonstrate that the proposed method outperforms other state-of-the-art methods for all datasets.
2. Related Works
With the success of FR research, researchers have continued to focus on the challenges posed by occluded face recognition [
17,
27,
28]. The recognition of an occluded face is challenging because the human face can be covered by visual obstacles of any size or shape appearing anywhere [
29]. With the COVID-19 pandemic, MFR has become one of the greatest challenges in the FR domain. MFR is a specific facial occlusion problem since the essential parts of the face, such as the mouth, nose, or chin, are occluded. The objective of research on MFR is to identify or verify the specific identity of a person when people are wearing a facial mask. Some of the existing methods that researchers have proposed to solve occluded face recognition and MFR problems are described in this section.
Song et al. [
17] presented a technique to address partial occlusion by discovering and disposing of corrupted face feature elements for recognition. This study decomposed the face recognition challenge under random partial occlusions in three stages: First, they developed a pairwise differential Siamese Network (PDSN) to capture the differentiation in the face features between the occluded and non-occluded face pairs. Second, they built a masked dictionary for masked features that they obtained from the previous stage to composite the feature discarding mask (FDM). Third, a combination of the FDM of random partial occlusions from the dictionary is multiplied by the original feature to eliminate the effect of partial occlusions from recognition. This approach aims to remove the occluded areas from depth features. However, it is difficult in practice to meet the requirements of the matched image.
Various studies have adopted restoration-based methods [
30,
31,
32,
33] to restore the missing part of the face image and reconstruct a new face image from the training dataset. Since generative adversarial nets (GANs) were first introduced [
34], many researchers have used GAN methods to address facial occlusion problems. Yeh et al. [
35] proposed a method that involved generating the corrupted pixel(s) and reconstructing the missing content. Din et al. [
18] proposed a model that can detect and remove the mask to provide complete, unobstructed facial images. First, the model detects the mask region and produces it as binary segmentation. Then, it uses two discriminators based on the GAN network to learn the global structure and missing part of the facial image. However, these approaches have not evaluated the recognition performance of their models. In contrast to the previous GAN-based methods, Li et al. [
36] presented an algorithm framework that consists of de-occlusion and distillation modules. The de-occlusion module uses GAN to perform masked face completion, which recovers the occluded features beneath the mask and eliminates the appearance uncertainty. The distillation module uses a pre-trained model to perform face classification. On the simulated LFW dataset, their highest accuracy for recognition performance is 95.44%.
MFR became an urgent research topic to consider during the COVID-19 epidemic. Mandal et al. [
15] proposed a new framework with which to handle the MFR problem that used a deep network based on ResNet-50 [
37]. The authors trained the network using the small Real-world Masked Face Recognition Dataset (RMFRD) described in [
22]. However, this method did not yield adequate results because the network used only works with non-occlusion faces. Anwar and Raychowdhury [
9] presented a similar strategy using FaceNet [
1], a deep network-based face recognition system, to train with their dataset VGGFace2-mini-SM1. They used their own proposed simulated masked face dataset to train the network. This method produced better results than the first method since they trained with a large dataset from scratch.
Meanwhile, Huang et al. [
38] used ArcFace [
5], a deep network-based face recognition system, to train with their simulated dataset. Their simulated dataset was generated with random occlusion (mask or glasses). In that study, the network was able to learn more features than the masked dataset. However, their performance results greatly decreased when tested with only the masked face dataset. Walid Hariri [
16] proposed a new method based on occlusion face removal and deep learning-based features to discard the occlusion region. They used a pre-trained network to handle the MFR problem. They applied the cropping filter technique to remove the occluded part covered by a facial mask and therefore extract only features in the non-masked face region. The occlusion removal technique can discard non-masked face areas from each image. However, it cannot guarantee a clean elimination of non-masked face parts since facial masks are not all placed in the same position on the face. Moreover, their recognition performance results with both simulated masked face and real masked face images still need to be improved.
Recent works have attempted to deal with MFR using attention mechanisms. Li et al. [
20] proposed a new strategy by integrating a cropping-based and attention-based approach with the CBAM [
26]. The cropping-based process removes the masked face region from face images. They examined several cropping proportion cases of the input image to find the one that achieved the best recognition accuracy. In the attention-based process, the masked face features and features around the eyes were respectively given low and high weights. The authors reported that their approach achieved 92.61% MFR accuracy. In another study, Deng et al. [
11] proposed an algorithm using cosine loss (MFCosface) to handle the MFR. As a result, their method improved the accuracy of masked face recognition compared to the first method based on attention. They also designed an Attention-Inception module that combined the CBAM with Inception-ResNet to help the model pay greater attention to the region not covered by the mask. This technique achieved a slight improvement in the verification task.
The existing works described above inspire our present work. By observing the strength of the attention module, which plays an important role in MFR work, this study extends them further by proposing a novel network architecture by integrating the attention module into the refined ResNet-50 network implemented in the ArcFace repository.
6. Conclusions
This paper presents a new method to solve the masked face recognition problems using deep learning technology. Traditional FR methods based on deep learning can address normal face recognition problems and achieve high performance. However, such methods show dramatically reduced performance when a face is covered with a mask. Through the analysis of the masked face images, we found that some of the key facial features are covered by a facial mask which makes the FR methods cannot recognize the face properly. To tackle the problem, this study introduced a new network architecture based on an attention mechanism that can focus on the most informative part around the eyes of the masked face images and obtain more discriminative feature information. Moreover, one of the most widely used ArcFace loss functions is implemented into the proposed network to optimize the feature embedding and to increase the similarity of the intra-class samples and diversity of the inter-class sample. To handle the problem of insufficient masked face datasets, new simulated masked face images are generated by using data augmentation for model training and evaluation. Through the various experiments, the following points summarize the findings in this paper:
The attention module can focus on the non-occluded part of the masked face and significantly improve the recognition performance.
The newly generated masked face dataset can effectively help the model training and evaluation.
The results show that the proposed method provides outstanding performance and a better recognition rate on both generated masked face and real masked image datasets compared to the state-of-the-art methods.
We hope this research study becomes a useful solution to solve the masked face recognition problem. In future work, the improvement of the method to solve masked face recognition with different postures, expressions, illumination, and the presence of a hat are considered.