Effective Attention-Based Mechanism for Masked Face Recognition

: Research on facial recognition has recently been ﬂourishing, which has led to the introduction of many robust methods. However, since the worldwide outbreak of COVID-19, people have had to regularly wear facial masks, thus making existing face recognition methods less reliable. Although normal face recognition methods are nearly complete, masked face recognition (MFR)—which refers to recognizing the identity of an individual when people wear a facial mask—remains the most challenging topic in this area. To overcome the difﬁculties involved in MFR, a novel deep learning method based on the convolutional block attention module (CBAM) and angular margin ArcFace loss is proposed. In the method, CBAM is integrated with convolutional neural networks (CNNs) to extract the input image feature maps, particularly of the region around the eyes. Meanwhile, ArcFace is used as a training loss function to optimize the feature embedding and enhance the discriminative feature for MFR. Because of the insufﬁcient availability of masked face images for model training, this study used the data augmentation method to generate masked face images from a common face recognition dataset. The proposed method was evaluated using the well-known masked image version of LFW, AgeDB-30, CFP-FP, and real mask image MFR2 veriﬁcation datasets. A variety of experiments conﬁrmed that the proposed method offers improvements for MFR compared to the current state-of-the-art methods.


Introduction
Face recognition (FR) has represented one of the most important research topics for many years. Many researchers [1][2][3][4][5][6] have introduced robust methods to solve the FR problem. The trend of developing methods for FR appears to have almost reached its peak at the time of this writing. Influenced by the convolutional neural networks (CNNs), the current algorithms using deep learning methods [1][2][3][4][5][6] have achieved superior accuracy for FR. Systems based on FR are widely used in many areas across the world including airports, community gates, and healthcare; FR is also employed in some authentication applications, such as face-to-face attendance monitoring and mobile payment systems based on face profiles.
With the emergence of the COVID-19 pandemic, a viral infection caused by severe acute respiratory syndrome [7] has spread globally and brought many major challenges to daily human activities. To avoid COVID-19 infection, many people have worn and continue to wear masks. Mask wearing affects current FR application systems because the human face-the target of interest-is partially covered. In real-world FR applications, face occlusion, particularly masked face occlusion, will significantly affect existing FR performance and decrease re-identification accuracy [8].
Modern deep learning-based models are advanced enough to extract face features and learn the important key features such as face edges, mouth, nose, and eyes [9]. However,

•
A new MFR method using a deep learning network architecture based on the attention module and angular margin loss ArcFace is proposed to focus on the informative parts not occluded by the facial mask (i.e., the regions around the eyes).

•
The CBAM attention module is integrated with a refined ResNet-50 network architecture for feature extraction without additional computational cost.

•
Proposed new simulated masked face images generated from regular face recognition datasets using a data argumentation tool for model training and valuation. Datasets generated in this research are available through the website https://github.com/ MaskedFaceDataSet/SimulatedMaskedFaceDataset (accessed on 6 May 2022).

•
The experimental results on simulated and real masked face datasets demonstrate that the proposed method outperforms other state-of-the-art methods for all datasets.

Related Works
With the success of FR research, researchers have continued to focus on the challenges posed by occluded face recognition [17,27,28]. The recognition of an occluded face is challenging because the human face can be covered by visual obstacles of any size or shape appearing anywhere [29]. With the COVID-19 pandemic, MFR has become one of the greatest challenges in the FR domain. MFR is a specific facial occlusion problem since the essential parts of the face, such as the mouth, nose, or chin, are occluded. The objective of research on MFR is to identify or verify the specific identity of a person when people are wearing a facial mask. Some of the existing methods that researchers have proposed to solve occluded face recognition and MFR problems are described in this section.
Song et al. [17] presented a technique to address partial occlusion by discovering and disposing of corrupted face feature elements for recognition. This study decomposed the face recognition challenge under random partial occlusions in three stages: First, they developed a pairwise differential Siamese Network (PDSN) to capture the differentiation in the face features between the occluded and non-occluded face pairs. Second, they built a masked dictionary for masked features that they obtained from the previous stage to composite the feature discarding mask (FDM). Third, a combination of the FDM of random partial occlusions from the dictionary is multiplied by the original feature to eliminate the effect of partial occlusions from recognition. This approach aims to remove the occluded areas from depth features. However, it is difficult in practice to meet the requirements of the matched image.
Various studies have adopted restoration-based methods [30][31][32][33] to restore the missing part of the face image and reconstruct a new face image from the training dataset. Since generative adversarial nets (GANs) were first introduced [34], many researchers have used GAN methods to address facial occlusion problems. Yeh et al. [35] proposed a method that involved generating the corrupted pixel(s) and reconstructing the missing content. Din et al. [18] proposed a model that can detect and remove the mask to provide complete, unobstructed facial images. First, the model detects the mask region and produces it as binary segmentation. Then, it uses two discriminators based on the GAN network to learn the global structure and missing part of the facial image. However, these approaches have not evaluated the recognition performance of their models. In contrast to the previous GAN-based methods, Li et al. [36] presented an algorithm framework that consists of deocclusion and distillation modules. The de-occlusion module uses GAN to perform masked face completion, which recovers the occluded features beneath the mask and eliminates the appearance uncertainty. The distillation module uses a pre-trained model to perform face classification. On the simulated LFW dataset, their highest accuracy for recognition performance is 95.44%.
MFR became an urgent research topic to consider during the COVID-19 epidemic. Mandal et al. [15] proposed a new framework with which to handle the MFR problem that used a deep network based on ResNet-50 [37]. The authors trained the network using the small Real-world Masked Face Recognition Dataset (RMFRD) described in [22]. However, this method did not yield adequate results because the network used only works with non-occlusion faces. Anwar and Raychowdhury [9] presented a similar strategy using FaceNet [1], a deep network-based face recognition system, to train with their dataset VGGFace2-mini-SM1. They used their own proposed simulated masked face dataset to train the network. This method produced better results than the first method since they trained with a large dataset from scratch.
Meanwhile, Huang et al. [38] used ArcFace [5], a deep network-based face recognition system, to train with their simulated dataset. Their simulated dataset was generated with random occlusion (mask or glasses). In that study, the network was able to learn more features than the masked dataset. However, their performance results greatly decreased when tested with only the masked face dataset. Walid Hariri [16] proposed a new method based on occlusion face removal and deep learning-based features to discard the occlusion region. They used a pre-trained network to handle the MFR problem. They applied the cropping filter technique to remove the occluded part covered by a facial mask and therefore extract only features in the non-masked face region. The occlusion removal technique can discard non-masked face areas from each image. However, it cannot guarantee a clean elimination of non-masked face parts since facial masks are not all placed in the same position on the face. Moreover, their recognition performance results with both simulated masked face and real masked face images still need to be improved.
Recent works have attempted to deal with MFR using attention mechanisms. Li et al. [20] proposed a new strategy by integrating a cropping-based and attentionbased approach with the CBAM [26]. The cropping-based process removes the masked face region from face images. They examined several cropping proportion cases of the input image to find the one that achieved the best recognition accuracy. In the attention-based process, the masked face features and features around the eyes were respectively given low and high weights. The authors reported that their approach achieved 92.61% MFR accuracy. In another study, Deng et al. [11] proposed an algorithm using cosine loss (MF-Cosface) to handle the MFR. As a result, their method improved the accuracy of masked face recognition compared to the first method based on attention. They also designed an Attention-Inception module that combined the CBAM with Inception-ResNet to help the model pay greater attention to the region not covered by the mask. This technique achieved a slight improvement in the verification task.
The existing works described above inspire our present work. By observing the strength of the attention module, which plays an important role in MFR work, this study extends them further by proposing a novel network architecture by integrating the attention module into the refined ResNet-50 network implemented in the ArcFace repository.

Feature Extraction Network
Feature extraction-which is a crucial process in masked face recognition-aims to extract the key face components such as the eyes, nose, mouth, and texture from a face image. However, this process becomes more complicated when there is a mask covering the face in question. Therefore, the selection of the feature extracting network is a critical decision. The refined CNN architecture ResNet-50 implemented in ArcFace work is selected as a network backbone to extract the face features. This study follows [5] to modify the layer block in the third stage from the original ResNet-50 [37] architecture {3, 4, 6, 3} to {3, 4, 14, 3} layer blocks. Further, the improvement residual unit architecture is also applied to the network, which has a BN-Conv-BN-PReLu-Conv-BN structure and sets the stride as two for the second convolutional layer instead of the first one (as shown in Figure 1). After the last layer, the batch normalization, dropout, fully connected layer, and batch normalization (BN-Dropout-FC-BN) structure is used to obtain the final 512-D face embedding feature.

Convolutional Block Attention Module (CBAM)
The proposed method adopts the CBAM presented by Woo et al. [24] in the network model. The CBAM consists of a channel attention module and a spatial attention module, which are arranged in a particular order, as shown in Figure 2. It is a lightweight module that can smoothly integrate with any CNN architecture. Given an input feature map F R C×H×W of the convolutional layer, where C, H, and W are channel size, height, and width, respectively, let M channel R C×1×1 denote a 1D channel attention map and M spatial R 1×H×W denotes a 2D spatial attention map. The overall attention process can then be shown as shown in Equations (1) and (2).
where ⊗ denotes element-wise multiplication and F is the final output of the feature maps or refined feature maps.

Channel Attention Module
The channel attention module focuses on the major features of the input image. This module uses both average-pooling and max-pooling operations on the input feature map to generate two different spatial information vectors: F ch avg and F ch max , which denote averagepooled features and max-pooled features, respectively. Both vectors are consecutively forwarded to a shared network multi-layer perceptron (MLP) with filter kernel size 1 × 1 to produce a channel attention map M channel R C×1×1 . Next, the output feature vectors from the shared network are merged using element-wise submission. The final output of the M channel R C×1×1 after element-wise submission is then passed to the sigmoid function σ to generate the channel weights, as shown in Equation (3). The channel attention module process can be depicted as shown in Figure 3.
where σ is the sigmoid function and MLP uses the ReLu activation function.

Spatial Attention Module
The spatial attention module focuses on an informative region of the input images features. Similar to the channel attention module, the spatial attention module adopts the average-pooling and max-pooling operations to obtain two 2D maps: F sp avg and F sp max denote average-pooling and max-pooling features, respectively. Those are then concatenated with a convolution layer with a filter kernel size of 7 × 7 to obtain a 2D spatial attention map M spatial R 1×H×W . The spatial attention module process can be illustrated as shown in Figure 4 and Equations (4) and (5).
where σ is the sigmoid function and f 7×7 denotes a convolution operation with the filter kernel size of seven.  Figure 5 shows the overall proposed network architecture diagram. As described in Section 3.1, this work uses the refined ResNet-50 architecture as a backbone to extract face features. The proposed network model uses no-masked and masked face images with the size 3 × 112 × 112 as the input. The network backbone architecture consists of four main convolutional layer block stages with the number of blocks stacked. Therefore, the respective numbers of blocks stacked in the first, second, third, and fourth stages are {3, 4, 14, 3}. The sizes of the feature maps in the first, second, third, and fourth stages are 64 × 56 × 56, 128 × 28 × 28, 256 × 14 × 14, and 512 × 7 × 7 with kernel size of 3 × 3, respectively. CBAM is adopted in each output of the convolutional block of the backbone network to focus more effectively on an object of interest effectively. F represents the feature map after the pre-operation of the convolution. Then, the channel and spatial attention modules compute sequentially to produce refined feature maps F . Finally, the refined output features F are summed with the input feature maps of the previous block. The network repeats the same operation until the last convolutional layer block and the batch normalization (BN), dropout, and fully connected layers are applied to obtain 512-D face embedding features. ArcFace loss adds an angular margin m to the target (ground truth) and multiplies by the feature scale s. Then, the softmax function proceeds and contributes to the cross-entropy loss. This technique helps optimize the embedding feature to obtain highly discriminative features for MFR.

Loss Function
The loss function helps optimize the model and stabilize the training process. This method uses Additive Angular Margin Loss (ArcFace) [5], a margin loss function constructed by modifying the softmax loss function, which improves the discriminative power of the model. Furthermore, ArcFace optimizes the feature embedding to have the smallest distance possible among the same classes and the largest distance possible among the different classes. ArcFace can be defined as follows: log e s(cos (θ y i +m)) e s(cos (θ y i +m)) + ∑ n j=1, j =y i e s cos θ j where θ j denotes the angle between the weight and deep features, s denotes the feature scale, m denotes angular margin penalty, and N and n respectively denote batch size and class number.

Datasets
The developed network needs to be verified on both simulated and real masked face datasets. A data augmentation method presented by [9] is used to generate the masked face images version e from the existing normal face datasets for model training and evaluation. First, a multi-task cascaded convolutional neural network (MTCNN) [39] is used to detect faces from the raw images. The MTCNN detects the face and obtains five facial landmark key points: nose, right-eye, left-eye, right-mouth, left-mouth, and then face alignment and rotation are performed. To generate more realistic masked face images the method uses Dlib [40] library to detect 68 key points of the face. Lastly, to overlay a mask on the face, the method calculates the masked positions of the face and selects the suitable facial mask. All generated masked face datasets are listed in Table 1. A small set of real masked face MFR2 [9] is also used to evaluate the model. CASIA-WebFace_m is generated from CASIA-WebFace [41] dataset for model training. This dataset is a large-scale public face recognition dataset. It contains 494,414 images of 10,575 unique identities. During masked face generation, around 20% of face images could not be detected by the data augment tool. Therefore, after masked face generation, 394,648 masked images remain. The generated masked face image version is then combined with the corresponding regular face images to produce CASIA-WebFace_m for the model training. This means that the total training samples are 789,296 images.
More masked face images are generated from the most widely used benchmark dataset, LFW [42], AgeDB [43], and CFP [44], respectively. MFR2 [9] is a genuine masked face dataset instead of a simulated mask dataset. LFW_m, AgeDB-30_m, CFP-FP_m, and MFR2 datasets among them are used for model evaluation. Each simulated dataset is described briefly here.

•
LFW_m is generated from the LFW dataset, which is most used for face verification. This dataset contains 5749 unique identities and a total of 13,233 face images. The experiment in this paper follows the LFW standard protocol using 6000 predefined comparison pairs, of which 3000 pairs have the same identities and the other 3000 pairs have different identities. • AgeDB-30_m is generated from the public benchmark dataset AgeDB, which is an unconstrained face recognition dataset which is most used for cross-age face verification. This dataset contains 568 unique identities and a total of 16,588 face images. The experiment follows the protocol of AgeDB-30 using 6000 predefined comparison pairs, of which 3000 pairs have the same identities and the other 3000 pairs have different identities. • CFP-FP_m is generated from the public benchmark dataset CFP, which contains 500 celebrities in frontal and profile views. This dataset has two verification protocols: CFP-FF and CFP-FP. In the experiment, the method uses the CFP-FP protocol using 7000 predefined comparison pairs, of which 3500 pairs have the same identities and the other 3500 pairs have different identities.

•
MFR2 is a small set of real masked face images. It contains 53 identities of celebrities and politicians among 269 images, where each identity has an average of five images. This dataset consists of strange mask patterns. We collect 800 pairs of images for real masked face verification in the experiment. This means that 400 pairs have the same identities whereas 400 pairs have different identities.
Typical sample images of different datasets are shown in Figure 6.

Experimental Setting
Initially, this work follows [5] to generate normalized face crops (112 × 112) in the data processing and applies the Batch-Normalization (BN) [45]-Dropout [46] structure after the last convolutional layer to obtain the output embedding feature of 512D. Dropout can effectively help avoid over-fitting and obtain a better generalization for deep face recognition. In the experiment, the dropout parameter is set to 0.4. The feature scale s is set as 64 based on [4] and angular margin penalty m is chosen as 0.5 based on [5]. All experiments in this work are implemented using Python programing language and a Pytorch-based [47] open-source deep learning framework. The batch size is 128, and the model is trained on NVIDIA Quadro RTX 6000 (48GB) GPUs. The overall model architecture is trained up to 100 epochs, and the only CASIA-WebFace_m dataset is used to train the model. The learning rate was set as 0.01 and divided by ten at 13 and 21 epochs. Lastly, the momentum and weight decay are set as 0.9, and 5 × 10 −4 , respectively.

Evaluation Metrics
To assess the proposed method, four evaluation parameters, accuracy, precision, recall and F1 score are adopted.
Accuracy. The accuracy is an intuitive performance measure, and it is defined to describe the accuracy of the algorithm for recognition and classification problems. It represents the ratio of the correctly predicted sample to the total of sample, which can be computed as shown in Equation (7). Accuracy = (TP + TN)/(TP + TN + FP + FN), where TP, TN, FP, and FN are true positive, true negative, false positive, and false negative, respectively. Precision. The precision is a metric that determines the number of accurate positive predictions. Therefore, precision computes the accuracy for the minority class. It is computed as the ratio of correctly predicted positive samples divided by the predicted number of positive samples. Precision can be computed as defined in Equation (8).
Recall. The recall is a metric that measures the number of correct positive predictions made from all positive predictions that could have been made. This is as opposed to precision, which only considers the correct positive predictions out of all positive predictions. Recall can be computed as defined in Equation (9).
F1 score. The F1 score allows for precisions and recalls to be combined into a single measure that captures both properties. It can express high precision with poor recall or, alternately, terrible precision with perfect recall. The F1 score can be computed as defined in Equation (10).

Experimental Results
This section reports the model evaluation results. We performed experiments in the face verification task and used the 10-fold cross-validation technique to evaluate the predictive model by randomly dividing the evaluation dataset into ten partitions: nine partitions are used as a training set whereas the remaining partition is used as a validation set. The model repeated training ten times and used the average of the ten validation results as the recognition accuracy. The model was evaluated on simulated masked face images LFW_m, AgeDB-30_m, CFP-FP_m, and real masked face images MFR2. The model extracts the features of all face pairings and then computes the cosine similarities between the face pairs. The accuracy is expressed as the percentage of right predictions, with the highest accuracy being chosen as the threshold. Table 2 reports measurements of the performance of the model in terms of accuracy, precision, recall, and F1 score metrics. The results show that the proposed method achieved high performance in the face verification task. The average accuracy of 10-fold cross-validation on the LFW_m, AgeDB-30_m, and CFP-FP datasets reached rates of 99.43%, 95.86%, and 97.74%, respectively. MFR2 achieved a rate of 96.75%, since this dataset contains different facial postures, expressions, and cloth masks in different textures and colors. We conducted experiments with other state-of-the-art FR methods. Only the proposed method used the CASIA-WebFace_m dataset, as other methods used the original CASIA-WebFace dataset from scratch. The results of the verification accuracies were compared by validating on the same validation dataset. The recognition accuracy results are listed in Table 3. As reported in Table 3, our method yielded better results in both generated masked face images and real masked face images. The accuracy rates with the generated images are high and comparable to the results of the existing FR methods. While the accuracy rates of compared methods drop considerably with real mask images (MFR2), the proposed method maintains similar accuracy throughout all benchmark datasets.
Several MFR methods are conducted with their proposed training and validation datasets. To compare the proposed method to current existing MFR methods, this study separated the comparison into two parts: In the first part, we compared the presented method results with the results of other MFR methods, as presented in Table 4. In the second part, another experiment was conducted to compare the current advanced method MFCosface [11] with their masked dataset VGG-Face2_m. We trained the proposed network model using the same VGG-Face2_m and tested with 400 pairs of the MFR2 dataset for face verification. The verification performance of the recognition accuracy, precision, recall, and F1 score results are shown in Table 5. Tables 4 and 5 show that the proposed method performs slightly better than MFCosface [11] for both LFW_m and MFR2 datasets, if it is trained with VGGFace2_m. However, MFCosface shows better performance with MFR2 when the proposed method is trained with CASIA-Webface_m.

Ablation Experiments
To prove the effectiveness of the proposed method, ablation experiments were performed. All experimental settings-including image size, batch size, and learning rate were applied-to match the previous experiments. First, we experimented with the CBAM attention module on proposed masked face dataset, and then explored each attention module with the backbone. We searched for an effective approach to channel attention and then spatial attention using our backbone network. Then each of the experimental models was evaluated on all validation datasets. Table 6 shows the performant reports of the ablation experiments. It can clearly be seen that the best performance is achieved when both channel and spatial attention modules are applied throughout all datasets.

Discussion
MFR is a significantly challenging problem that is currently attracting substantial research interest in computer vision and the face recognition field. As the key features such as the mouth, nose, and chin are occluded by mask wearing, existing face recognition methods perform poorly. Further, the insufficient availability of training and validating datasets currently represents a major barrier to the adoption of deep learning approaches in MFR. Figure 7 illustrates the loss and accuracy curves of the model. The loss curve shows that the proposed model is learning from the data by trying to reach the minimum point and the accuracy curve still slightly increase until the last epoch. The experimental results show that the proposed method can achieve high performance in the verification task on simulated masked datasets. However, this method exhibited slightly decreased performance when evaluated on the real masked dataset due to the small size of the training real face data. By contrast, other methods exhibited a substantial decrease in performance when evaluated on the real masked dataset.

Conclusions
This paper presents a new method to solve the masked face recognition problems using deep learning technology. Traditional FR methods based on deep learning can address normal face recognition problems and achieve high performance. However, such methods show dramatically reduced performance when a face is covered with a mask. Through the analysis of the masked face images, we found that some of the key facial features are covered by a facial mask which makes the FR methods cannot recognize the face properly. To tackle the problem, this study introduced a new network architecture based on an attention mechanism that can focus on the most informative part around the eyes of the masked face images and obtain more discriminative feature information. Moreover, one of the most widely used ArcFace loss functions is implemented into the proposed network to optimize the feature embedding and to increase the similarity of the intra-class samples and diversity of the inter-class sample. To handle the problem of insufficient masked face datasets, new simulated masked face images are generated by using data augmentation for model training and evaluation. Through the various experiments, the following points summarize the findings in this paper:

•
The attention module can focus on the non-occluded part of the masked face and significantly improve the recognition performance.

•
The newly generated masked face dataset can effectively help the model training and evaluation.

•
The results show that the proposed method provides outstanding performance and a better recognition rate on both generated masked face and real masked image datasets compared to the state-of-the-art methods.
We hope this research study becomes a useful solution to solve the masked face recognition problem. In future work, the improvement of the method to solve masked face recognition with different postures, expressions, illumination, and the presence of a hat are considered.
Author Contributions: V.P. designed and developed the proposed method, conducted the experiments and wrote the manuscript. H.J.L. designed the new concept, provided the conceptual idea and insightful suggestions to refine it further, and reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.