1. Introduction
The outbreak of COVID-19 has led to many studies related to mask face recognition (MFR) [
1] using deep learning, which can be divided into two primary forms: single network model architecture and multi network model block combination architecture. The single network model architecture is characterized using only one primary network model in MFR studies, which can be divided into the combination of loss functions [
2,
3], enhanced primary network model architecture [
4,
5,
6,
7,
8,
9], and both [
10,
11].
In the method of matching loss functions, Hsu et al. [
2] used ResNet-100 as the primary network architecture for training with different loss functions, including Center Loss, Marginal Loss, SehereFace, CosFace, and ArcFace for classification training. In the final test, the neural network proved more accurate than human visual recognition. Cheng et al. [
3] completed MFR based on FaceNet [
12] training method combined with Cosine Annealing (CA) [
13] annealing mechanism. In the research, three different sizes of convolutional neural networks (CNN), InceptionResNetV2 [
14], InceptionV3 [
15], and MobileNetV2 [
16], were used as the primary network architecture of FaceNet. The advantage of using the FaceNet method is that when new users are added, there is no need to retrain the model.
Next is the method of modifying the primary model network architecture in the single network model architecture. Hariri [
4] first obtained images of the unmasked eyes and forehead region of a masked human face, then fed the images into a pre-trained CNN model to obtain the features, quantified the features using the bag-of-features paradigm method, and finally classified them using Multilayer Perceptron (MLP). Qiu et al. [
5] proposed a deep learning model called Face Recognition with Occlusion Masks (FROM) that uses many occluded face images to learn the features of the destroyed area and dynamically removes these features. Wang et al. [
6] based on the FaceNet training method and used ConvNeXt-T as their model framework, and added the Efficient Channel Attention (ECA) mechanism to increase the feature extraction for the unobscured features of the face. Chong et al. [
7] used a histogram-based recurrent neural network (HRNN) model to solve the underfitting problem and reduce the computational burdens of training large datasets. On the other hand, Ge et al. [
8] designed a Convolutional Visual Self-Attention Network (CVSAN) model for MFR studies using attentional mechanisms. Zhu et al. [
9] proposed a single framework for mask face recognition based on two prompt strategies integrating primitive and masking based on Vision Transformers (ViTs) [
17,
18], namely FaceT.
Finally, in the architecture of combining loss function and enhanced primary network model architecture, Deng et al. [
10] designed a network model named MFCosface, which combines a large margin cosine-based loss function and attention mechanism for the MFR research. Zhang et al. [
11] proposed an architecture called masked face data uncertainty learning (MaskDUL). The problem of sample uncertainty and intra-class distribution in MFR is solved using two weight-sharing CNNs combined with the Kullback-Leibler Divergence (KLD) loss function.
The single network model architecture is simpler to deploy than the multi network model block combination architecture. It requires fewer hardware resources and can be easily applied to edge computing devices. The disadvantage is that it is less accurate than the multi-network model block combination architecture.
Multi network model block combination architecture refers to the MFR approach consisting of multiple models with their roles. Ge et al. [
19] proposed a model architecture called LLE-CNN, divided into three parts. The first part uses a pre-trained CNN model to obtain the features of faces in the input images. In the second part, a feature module with a normal face and a masked face is created, and the output of the two parts is transformed by a locally linear embedding (LLE) algorithm to recover the masked face region. Finally, the results are fed to the third verification module to perform classification and regression validation and identify the results. Wan et al. [
20] fed images of faces wearing masks into two models. The first CNN model, MaskNet, assigns higher weights to the unmasked face features. The second is the recognition network, which can be divided into two parts, U and V. U is the feature obtained from the input image. The output weights of MaskNet and U are combined and turned into V for the final classification recognition. Song et al. [
21] proposed a framework called the mask learning strategy, which is divided into two parts. The first part uses the Pairwise Differential Siamese Network (PDSN) to build the Feature Discarding Mask (FDM), which is the correspondence between the masked face region and the corrupted features. The second part uses a pre-trained CNN model for face recognition to output the face features and then merges the features with the FDM to output the results. Li et al. [
22] first used generative adversarial networks (GAN) for face restoration. A module of the distillation framework is then used, in which the correct face features are learned using the teacher model. The student model receives the output of the GAN and learns the facial features from the teacher model. Boutros et al. [
23] based their study on three pre-trained CNN models. The first model inputs a face wearing a mask, obtains the features, and then feeds an Embedding Unmasking Model (EUM) model to obtain the parts of the face after unmasking. The second model inputs the same face image as the first, and the third model inputs a different face image. After obtaining the features from the three models, the EUM recognition results are evaluated using a self-restrained triplet loss. Chen et al. [
24] divided the process into four steps. The first step is to obtain the eye and forehead regions of the face image and perform image super-resolution using ESRGAN; the second step is to use the YCbCr color space in the image for frequency domain broadening analysis and then perform fast independent component analysis feature reduction to obtain features. The third step is to use the image RGB and enhance the MBConvBlock in EfficientNet to obtain the features. Finally, the results of the second and third steps are combined into a new feature, and then the result is output by connecting to Multi-Layer Perceptron (MLP). Yuan et al. [
25] proposed a network architecture called multiscale segmentation-based mask learning (MSML). The first is the face recognition branch (FRB) for face recognition, and the second is the occlusion segmentation branch (OSB) to obtain the features of the masked region of the human face. The third part is the hierarchical feature masking (FM), which obtains the output of FRB and OSB and purifies the masked face features at multiple levels. Shakeel et al. [
26] proposed a model consisting of two Bidirectional Attention Modules (BAM). A BAM is composed of a spatial attention block (SAB) and a channel attention block (CAB). The SAB highlights the spatial feature of high information content in the first stage. Then a CAB assigns a higher weight to the information-rich features. Finally, the final feature is generated by combining the two BAMs. Yang et al. [
27] proposed knowledge distillation hashing (KDH) to process obscured face images based on the deep hashing approach. Using only the obscured face images as input, a teacher model of a normal face is trained first, and then the knowledge of the teacher model is used to guide and optimize the student model.
The advantage of the multi network model block combination architecture is that each model has its role, and a single model has a single task, so if the performance of a model is found to need improvement, it only needs to be improved for that model. The disadvantages are the complexity of the computation, the need to ensure that each model passes information to each other without error, the difficulty in practical deployment, and the need for a better hardware environment.
In our previous research, we completed the MFR research based on the FaceNet method combined with Cosine Annealing (CA) annealing mechanism, and the accuracy reached about 93% [
3]. To further improve the accuracy, we propose a single network model architecture based on the FaceNet approach, combining the loss function and the enhanced master network model architecture in this research. The training method uses the triplet loss (TL) [
12] function together with SoftMax [
28] classifier and categorical cross entropy loss (CCEL) [
29] function to complete the model training, in which the classifier aims to enhance the training of FaceNet’s CNN single network model architecture function and retain the advantages of FaceNet’s method.
On the other hand, Thompson et al. [
30] and Strubell et al. [
31] studied and reported [
32] that deep learning requires a large amount of computational power, often at very high economic, energy, and environmental costs, to train a model or improve the accuracy rate. Mario Lucic et al. [
33] concluded that the network model architecture of deep learning has little effect. Still, the hyperparameters and the weight random restarts are the most influential aspects of the performance. Therefore, this research uses the pre-trained InceptionResNetV2, InceptionV3, and MobileNetV2 with three different sizes of CNNs for migration learning after the imagenet [
34] database and the CA dynamic mechanism for adjusting the Learning Rate (LR). The model is trained using the TL function combined with the SoftMax classifier and CCEL function. The advantage of using SoftMax classifier-assisted training is that it accelerates the process of making image features of the same classification closer together and those of different classifications more distant. Because CCEL allows the model output features to be more representative between different categories of face images, it enables better performance in the custom MASK600 dataset. In addition, the TL function has the advantage that after the model is trained, there is no need to retrain the model if there is a demand for new users, and there is no limitation on the number of categories.