Combining Classifiers for Deep Learning Mask Face Recognition

Cheng, Wen-Chang; Hsiao, Hung-Chou; Huang, Yung-Fa; Li, Li-Hua

doi:10.3390/info14070421

Open AccessArticle

Combining Classifiers for Deep Learning Mask Face Recognition

¹

Department of Computer Science & Information Engineering, Chaoyang University of Technology, Taichung City 413310, Taiwan

²

Department of Information Management, Chaoyang University of Technology, Taichung City 413310, Taiwan

³

Department of Information and Communication Engineering, Chaoyang University of Technology, Taichung City 413310, Taiwan

^*

Author to whom correspondence should be addressed.

Information 2023, 14(7), 421; https://doi.org/10.3390/info14070421

Submission received: 5 June 2023 / Revised: 9 July 2023 / Accepted: 19 July 2023 / Published: 21 July 2023

(This article belongs to the Special Issue Blending Artificial Intelligence and Machine Learning with the Internet of Things: Emerging Trends, Issues and Challenges)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This research proposes a single network model architecture for mask face recognition using the FaceNet training method. Three pre-trained convolutional neural networks of different sizes are combined, namely InceptionResNetV2, InceptionV3, and MobileNetV2. The models are augmented by connecting an otherwise fully connected network with a SoftMax output layer. We combine triplet loss and categorical cross-entropy loss to optimize the training process. In addition, the learning rate of the optimizer is dynamically updated using the cosine annealing mechanism, which improves the convergence of the model during training. Mask face recognition (MFR) experimental results on a custom MASK600 dataset show that proposed InceptionResNetV2 and InceptionV3 use only 20 training epochs, and MobileNetV2 uses only 50 training epochs, but to achieve more than 93% accuracy than the previous works of MFR with annealing. In addition to reaching a practical level, it saves time for training models and effectively reduces energy costs.

Keywords:

categorical cross entropy loss; cosine annealing; deep learning; FaceNet; mask face recognition; SoftMax output classifier

1. Introduction

The outbreak of COVID-19 has led to many studies related to mask face recognition (MFR) [1] using deep learning, which can be divided into two primary forms: single network model architecture and multi network model block combination architecture. The single network model architecture is characterized using only one primary network model in MFR studies, which can be divided into the combination of loss functions [2,3], enhanced primary network model architecture [4,5,6,7,8,9], and both [10,11].

In the method of matching loss functions, Hsu et al. [2] used ResNet-100 as the primary network architecture for training with different loss functions, including Center Loss, Marginal Loss, SehereFace, CosFace, and ArcFace for classification training. In the final test, the neural network proved more accurate than human visual recognition. Cheng et al. [3] completed MFR based on FaceNet [12] training method combined with Cosine Annealing (CA) [13] annealing mechanism. In the research, three different sizes of convolutional neural networks (CNN), InceptionResNetV2 [14], InceptionV3 [15], and MobileNetV2 [16], were used as the primary network architecture of FaceNet. The advantage of using the FaceNet method is that when new users are added, there is no need to retrain the model.

Next is the method of modifying the primary model network architecture in the single network model architecture. Hariri [4] first obtained images of the unmasked eyes and forehead region of a masked human face, then fed the images into a pre-trained CNN model to obtain the features, quantified the features using the bag-of-features paradigm method, and finally classified them using Multilayer Perceptron (MLP). Qiu et al. [5] proposed a deep learning model called Face Recognition with Occlusion Masks (FROM) that uses many occluded face images to learn the features of the destroyed area and dynamically removes these features. Wang et al. [6] based on the FaceNet training method and used ConvNeXt-T as their model framework, and added the Efficient Channel Attention (ECA) mechanism to increase the feature extraction for the unobscured features of the face. Chong et al. [7] used a histogram-based recurrent neural network (HRNN) model to solve the underfitting problem and reduce the computational burdens of training large datasets. On the other hand, Ge et al. [8] designed a Convolutional Visual Self-Attention Network (CVSAN) model for MFR studies using attentional mechanisms. Zhu et al. [9] proposed a single framework for mask face recognition based on two prompt strategies integrating primitive and masking based on Vision Transformers (ViTs) [17,18], namely FaceT.

Finally, in the architecture of combining loss function and enhanced primary network model architecture, Deng et al. [10] designed a network model named MFCosface, which combines a large margin cosine-based loss function and attention mechanism for the MFR research. Zhang et al. [11] proposed an architecture called masked face data uncertainty learning (MaskDUL). The problem of sample uncertainty and intra-class distribution in MFR is solved using two weight-sharing CNNs combined with the Kullback-Leibler Divergence (KLD) loss function.

The single network model architecture is simpler to deploy than the multi network model block combination architecture. It requires fewer hardware resources and can be easily applied to edge computing devices. The disadvantage is that it is less accurate than the multi-network model block combination architecture.

Multi network model block combination architecture refers to the MFR approach consisting of multiple models with their roles. Ge et al. [19] proposed a model architecture called LLE-CNN, divided into three parts. The first part uses a pre-trained CNN model to obtain the features of faces in the input images. In the second part, a feature module with a normal face and a masked face is created, and the output of the two parts is transformed by a locally linear embedding (LLE) algorithm to recover the masked face region. Finally, the results are fed to the third verification module to perform classification and regression validation and identify the results. Wan et al. [20] fed images of faces wearing masks into two models. The first CNN model, MaskNet, assigns higher weights to the unmasked face features. The second is the recognition network, which can be divided into two parts, U and V. U is the feature obtained from the input image. The output weights of MaskNet and U are combined and turned into V for the final classification recognition. Song et al. [21] proposed a framework called the mask learning strategy, which is divided into two parts. The first part uses the Pairwise Differential Siamese Network (PDSN) to build the Feature Discarding Mask (FDM), which is the correspondence between the masked face region and the corrupted features. The second part uses a pre-trained CNN model for face recognition to output the face features and then merges the features with the FDM to output the results. Li et al. [22] first used generative adversarial networks (GAN) for face restoration. A module of the distillation framework is then used, in which the correct face features are learned using the teacher model. The student model receives the output of the GAN and learns the facial features from the teacher model. Boutros et al. [23] based their study on three pre-trained CNN models. The first model inputs a face wearing a mask, obtains the features, and then feeds an Embedding Unmasking Model (EUM) model to obtain the parts of the face after unmasking. The second model inputs the same face image as the first, and the third model inputs a different face image. After obtaining the features from the three models, the EUM recognition results are evaluated using a self-restrained triplet loss. Chen et al. [24] divided the process into four steps. The first step is to obtain the eye and forehead regions of the face image and perform image super-resolution using ESRGAN; the second step is to use the YCbCr color space in the image for frequency domain broadening analysis and then perform fast independent component analysis feature reduction to obtain features. The third step is to use the image RGB and enhance the MBConvBlock in EfficientNet to obtain the features. Finally, the results of the second and third steps are combined into a new feature, and then the result is output by connecting to Multi-Layer Perceptron (MLP). Yuan et al. [25] proposed a network architecture called multiscale segmentation-based mask learning (MSML). The first is the face recognition branch (FRB) for face recognition, and the second is the occlusion segmentation branch (OSB) to obtain the features of the masked region of the human face. The third part is the hierarchical feature masking (FM), which obtains the output of FRB and OSB and purifies the masked face features at multiple levels. Shakeel et al. [26] proposed a model consisting of two Bidirectional Attention Modules (BAM). A BAM is composed of a spatial attention block (SAB) and a channel attention block (CAB). The SAB highlights the spatial feature of high information content in the first stage. Then a CAB assigns a higher weight to the information-rich features. Finally, the final feature is generated by combining the two BAMs. Yang et al. [27] proposed knowledge distillation hashing (KDH) to process obscured face images based on the deep hashing approach. Using only the obscured face images as input, a teacher model of a normal face is trained first, and then the knowledge of the teacher model is used to guide and optimize the student model.

The advantage of the multi network model block combination architecture is that each model has its role, and a single model has a single task, so if the performance of a model is found to need improvement, it only needs to be improved for that model. The disadvantages are the complexity of the computation, the need to ensure that each model passes information to each other without error, the difficulty in practical deployment, and the need for a better hardware environment.

In our previous research, we completed the MFR research based on the FaceNet method combined with Cosine Annealing (CA) annealing mechanism, and the accuracy reached about 93% [3]. To further improve the accuracy, we propose a single network model architecture based on the FaceNet approach, combining the loss function and the enhanced master network model architecture in this research. The training method uses the triplet loss (TL) [12] function together with SoftMax [28] classifier and categorical cross entropy loss (CCEL) [29] function to complete the model training, in which the classifier aims to enhance the training of FaceNet’s CNN single network model architecture function and retain the advantages of FaceNet’s method.

On the other hand, Thompson et al. [30] and Strubell et al. [31] studied and reported [32] that deep learning requires a large amount of computational power, often at very high economic, energy, and environmental costs, to train a model or improve the accuracy rate. Mario Lucic et al. [33] concluded that the network model architecture of deep learning has little effect. Still, the hyperparameters and the weight random restarts are the most influential aspects of the performance. Therefore, this research uses the pre-trained InceptionResNetV2, InceptionV3, and MobileNetV2 with three different sizes of CNNs for migration learning after the imagenet [34] database and the CA dynamic mechanism for adjusting the Learning Rate (LR). The model is trained using the TL function combined with the SoftMax classifier and CCEL function. The advantage of using SoftMax classifier-assisted training is that it accelerates the process of making image features of the same classification closer together and those of different classifications more distant. Because CCEL allows the model output features to be more representative between different categories of face images, it enables better performance in the custom MASK600 dataset. In addition, the TL function has the advantage that after the model is trained, there is no need to retrain the model if there is a demand for new users, and there is no limitation on the number of categories.

2. FaceNet

FaceNet, proposed by the Google team in 2015, is a neural network-based training method for face recognition [12]. As in Figure 1, FaceNet defines the output layer as a regression numerical vector output layer with a specified number of dimensions (commonly 128 dimensions, described below as 128). There are three inputs for samples

A

,

A^{'}

, and

B

, where

A

and

A^{'}

are different face images of the same person, and

B

is another face image from

A

and

A^{'}

. After the computation of the CNN model, a set of 128-dimensional vectors of

x_{i}^{A}

,

x_{i}^{A^{'}}

, and

x_{i}^{B}

will be obtained individually. The similarity of these three vectors is calculated by a loss function called triplet loss (TL).

The TL function

L_{T L} ()

is defined by Equation (1):

L_{T L} (x_{i}^{A} {, x}_{i}^{A^{'}}, x_{i}^{B}, α) = \sum_{i}^{N} m a x ([| | x_{i}^{A} - x_{i}^{A^{'}} {| |}_{2}^{2} - | | x_{i}^{A} - x_{i}^{B} {| |}_{2}^{2} + α], 0),

(1)

where

N

is the number of sample sets,

x_{i}^{A}

is the positive sample of Anchor,

x_{i}^{A^{'}}

is the positive sample of approximate Anchor, and

x_{i}^{B}

is the negative sample of different Anchor.

| | 、 {| |}_{2}^{2}

denote L2 regularization, where α is a margin enforced between positive and negative pairs. In this research, α is still set to 0.2 according to the original paper; please refer to the papers [12,35] for more details.

3. Proposed Methods

To accomplish the technical novelty, we combined the FaceNet method with the Cosine Annealing mechanism shown in Figure 2 to accelerate the model training. From Figure 2, the proposed single model network architecture combines the loss function and the strengthened master model, a CNN model and a fully-connected SoftMax output classifier, which aims to enhance the training of CNN. Therefore, the model training method combines the TL proposed by FaceNet and the CCEL. Take three image samples (

A

,

A^{'}

and

B

) as a batch; for example, the image samples are input to a pre-trained CNN model, and the last layer obtains 128-dimensional vectors

x_{i}^{A}

,

x_{i}^{A^{'}}

and

x_{i}^{B}

. The first process is similar to the FaceNet method in Section 2, which calculates the TL of

x_{i}^{A}

,

x_{i}^{A^{'}}

and

x_{i}^{B}

. Another process takes

x_{i}^{A}

,

x_{i}^{A^{'}}

and

x_{i}^{B}

as input to the classifier of a fully-connected network with the SoftMax output layer. Calculate the output predicted probability values for the classification

y_{i}^{A}

,

y_{i}^{A^{'}}

and

y_{i}^{B}

. Then the CCEL is calculated using the corresponding desired output

y_{j}^{d}

. CCEL is a loss function commonly found in classification problems and is defined by

L_{C C E L} (y, y^{d}) = - \sum_{k = 1}^{K} y_{k}^{d} \log (y_{k}),

(2)

where

y_{k}^{d}

is the kth value of the desired output vector

y^{d}

,

k

is the kth value of the probability vector

y

predicted by the classifier, and

K

is the number of classifications. Moreover, we redefine the loss function of the proposed model. The final model training loss function method is a combination of TL and CCEL called Sum of Losses (SOL), defined by

L (x_{i}^{A} {, x}_{i}^{A^{'}}, x_{i}^{B}, α, y, y^{d}) = λ_{T L} \sum_{i = 1}^{N} m a x ([| | x_{i}^{A} - x_{i}^{A^{'}} {| |}_{2}^{2} - | | x_{i}^{A} - x_{i}^{B} {| |}_{2}^{2} + α], 0) + λ_{C C E L} \sum_{i = 1}^{N} \sum_{j \in {A, A^{'}, B}} \sum_{k = 1}^{K} - y_{k}^{d_{j}} \log (y_{k}^{j}),

(3)

The number of triples is the number of training sets, where

N

is the number of triplet groups. TL makes the distance between face images of the same classification close and makes the distance between face images of different classifications distant. The CCEL of the classifier accelerates the TL so that the CNN output feature vectors more representative of the different categories of face images. Since the meaning of the loss function is different,

λ_{T L}

and

λ_{C C E L}

are defined as the weights of TL and CCEL, respectively, and better results can be obtained by adjusting the weights. Finally, during the training process, a CA-mechanical simulated annealing method was used to accelerate the convergence for LR. For a more detailed description of the CA mechanism, please refer to the paper [13].

4. Experiment and Discussion

The experimental environment for this research is Ubuntu 18.04 and Tensorflow 2.3.4, and the GPU is Nvidia 2080 Ti. This section details the dataset, CNN models, training, testing, and discussion used in the research.

4.1. Dataset

This research used MASK600, self-defined in previous research, as the dataset [3]. MASK600 was modified from VGGFace2_HQ_CROP [36]. Manual visual inspection reduces the possibility of repetitive (equal) images as much as possible. MASK600 selected 600 subjects from VGGFace2_HQ_CROP and used MaskTheFace [37] to composite all the images into masks. Therefore, the experimental results will be limited to synthetic face mask images. VGGFace2_HQ_CROP has completed the face detection, alignment, and the pixels of each face image are uniformly 160

\times

160. Table 1 shows the MASK600 information, in which the Train set comprises 500 subjects, the number of original unmasked face images is 46,758, and after adding the synthetic face images with mask, the total number of images is 93,516. The Test set comprises 100 subjects, the number of original unmasked face images is 9528, and the number of images is 19,056 after adding synthetic face images with the mask. The total number of images reached 112,572 for a total of 600 subjects. Figure 3 is a part image example of MASK600, the upper Figure 3a is the original face image without the mask, and the lower Figure 3b is the synthetic face image with the mask.

4.2. CNN Models

This research uses three different model architectures with good results in imagenet competitions, namely InceptionResNetV2, InceptionV3, and MobileNetV2. Since the same model is used in the previous research and this research, for the sake of distinction, the model of the previous research will be abbreviated as TL architecture [3] in the latter part, and the model of this research will be abbreviated as SOL architecture. Table 2. shows the model size and the number of parameter weights (calculated using Keras in TensorFlow2) after the training of the previous research and this research. Since the model in this research has an additional layer of fully-connected networks with the SoftMax output layer, the number of parameters in the SOL architecture in this research is 64,500 more than that in the TL architecture (128

\times

500 + 1

\times

500).

To perform the experimental results and comparisons, we develop the experience which parameters are shown in Table 3, which shows the hyperparameter for the model training. The SIZE of the input image is 160

\times

160. The optimizer is Adam. The maximum and minimum values of LR for CA are set to

10^{- 4}

and

10^{- 6}

according to the optimal performance of [3]. To enable the model to learn better, the initialization parameters of the training are all initialized using imagenet pre-trained weights instead of random initialization weights. In addition, the weighting of both TL and CCEL loss function values in the model training is 1 to 1.

4.3. Training

After the acceleration procedure the trend of loss values of the training process is improved more as shown in Figure 4, Figure 5 and Figure 6. In Figure 4, Figure 5 and Figure 6, the trend of loss values of the training process of InceptionResNetV2, InceptionV3, and MobileNetV2 models are all compared with the previous training convergence results using only the TL architecture, respectively. From the results, the convergence process is more stable using only the TL architecture. In contrast, the SOL architecture oscillates more in the convergence process. This is because TL and CCEL sum the loss value of SOL. In the convergence process, the two losses affect each other, and the loss value becomes smaller and smaller, making the loss value oscillate more obviously. However, the overall loss value is still decreasing. Table 4 compares model loss after 100 epochs of training. Table 4 shows that the loss of SOL architecture is lower than that of TL architecture when the model is trained after 100 training epochs.

4.4. Testing

The performance of the model during the experiment was evaluated by Accuracy and F1 score, expressed by

A c c u r a c y = \frac{T P + T N}{T P + F N + T N + F P},

(4)

and

F 1 s c o r e = \frac{2 \times p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l},

(5)

respectively, where true positive (TP) is the result of correct classification of the detected target image after comparison, and false negative (FN) is the result of incorrect classification of the detected target image after comparison. True negative (TN) is a correct result after comparing images other than the detection target, and finally, false positive (FP) is an incorrect result after comparing images other than the target. Accuracy is suitable for situations where the number of image samples is relatively average. In Equation (5), the precision represents the ratio of all detected target images to correctly classified target images. The recall is the ratio of all target images in the data to classify images as targets correctly. The precision and recall can be expressed by

p r e c i s i o n = \frac{T P}{T P + F P},

(6)

and

r e c a l l = \frac{T P}{T P + F N},

(7)

respectively. Therefore, the F1 score is a combination of precision and recall, and it is more suitable for performance evaluation than accuracy in the unbalanced number of image samples.

In the testing phase, the image comparison process requires a predefined threshold (Th.) to compare the Euclidean distance of 128-dimensional vector values between two images as the basis for determining whether they are the same face. In this research, we still refer to the method of [38] to find the suitable Th. values. Since training with the SOL architecture leads to more significant variation across categories, the range of Th. values found were set from 0.3 to 1.7. The comparison method divides the 500 categories in the MASK600 training set into 50 groups, with each group randomly taking out 10 images (including the synthetic face with the mask and the original face without the mask) and the group image category for 128-dimensional Euclidean distance comparison. The average of Th. was obtained after 50 sets of calculations. Each category is randomly sampled, so this step is performed five times, and the final Th. total average is calculated. In addition, due to the unbalanced sample size, the positive samples will be less than the negative samples when compared, so the average F1 score of each group of models is used for the Th. evaluation criteria instead of accuracy.

According to the above scoring criteria, if the Euclidean distance between the feature vectors of the predicted image and the target image is less than Th. The same classification is considered TP. Otherwise, it is regarded as FP. If the Euclidean distance between the feature vectors of the predicted image and the target image is greater than or equal to Th. the same classification is considered FN. Otherwise, it is considered TN.

Then the comparisons of the average F1 score and accuracy of the model under different Th are shown in Figure 7, Figure 8 and Figure 9 under different Th. after training for 100 epochs. Due to the unevenness of the positive and negative sample size, the accuracy of the model does not vary significantly outside of very small or large Th. The F1 score varies significantly with the change in Th. Therefore, the best F1 score was used to find the most suitable Th. in this research. From Figure 7, Figure 8 and Figure 9, it is observed that the best value of Th for the InceptionResNetV2 model, the InceptionV3 model, and the MobileNetV2 model are 0.98, 1.01 and 1.14, respectively.

Although the SOL architecture is used in this research during the training process, it is still only necessary to compare the 128-dimensional generated by the model during the testing process after the training is completed. The output of the model is not limited by the number of classifications added to the training process during testing. The images used during the test were the 100 subjects from the MASK600 test set that the model had not learned during the training. Table 5 shows the average F1 score and accuracy after testing with the MASK600 test set. In the F1 score, the InceptionResNetV2 model in this research is approximately 74.22%, the InceptionV3 model is approximately 71.96%, and the MobileNetV2 model is approximately 76.42%. While in accuracy performance, the InceptionResNetV2 model in this research is approximately 93.76%, the InceptionV3 model is approximately 93.31%, and the MobileNetV2 model is approximately 93.59%. The average F1 score and accuracy of all models were better than the previous studies.

Table 6 shows the training time of one epoch model for all architecture, and non-essential programs were turned off as much as possible during the statistical process. The number of images trained is 93,516. Since the InceptionResNetV2 model architecture is the largest, the single training time is the longest for both SOL and TL architectures, with a training time of about 334.20 s for one epoch using only the TL architecture and about 337.90 s for one epoch using the SOL architecture. The training time for the InceptionV3 and MobileNetV2 models is similar. In the TL-only architecture, the training time for a one epoch model is about 208.88 s for InceptionV3 and 211.82 s for MobileNetV2. The training time of one epoch model using SOL architecture is about 209.56 s for InceptionV3 and 213.64 s for MobileNetV2. Although the SOL architecture has 64,500 more parameters than the TL architecture, the training time for a one epoch model in the SOL architecture is slightly more than that in the TL architecture.

Figure 10 and Figure 11 show the comparison of F1 score and accuracy of the InceptionResNetV2 model architecture by saving the model once every 10 epochs, respectively. As can be seen from the figure, both F1 score and accuracy are better than the TL architecture from the beginning to the end using the SOL architecture. In Figure 10, the F1 score is also more stable using the SOL architecture, and in Figure 11, the benchmark is 93% accuracy. The accuracy of the SOL architecture has stabilized at over 93% after 20 epochs, while the TL architecture must start accuracy at 60 epochs before it has a chance to reach over 93%. The TL architecture takes about 5.57 h to train 60 epochs on average, while the SOL architecture takes about 1.88 h to train 20 epochs on average, which is more than three times faster than the model training time.

Figure 12 and Figure 13 show the comparison of F1 score and Accuracy for the medium model InceptionV3, respectively. Similarly to the large model, the F1 score and accuracy of the SOL architecture are better than the TL architecture from 10 epochs to the last 100 training epochs. In Figure 12, the F1 score of the SOL architecture is more stable than that of the TL architecture, and the accuracy of the SOL architecture in Figure 13 is over 93% after 20 epochs, while the accuracy of the TL architecture is 93% only after 90 epochs. The average time to train 20 epochs in the SOL architecture is only about 1.16 h, which is about four times faster than the average time to train 90 epochs in the TL architecture, which is approximately 5.22 h.

Figure 14 and Figure 15 show the comparison of F1 score and accuracy for the small model MobileNetV2. As in the previous two architectures, the F1 score and accuracy of the SOL architecture are better than those of the TL architecture. The F1 score using the SOL architecture in Figure 14 is also stable. The SOL architecture of Figure 15. achieves more than 93% Accuracy after 50 epochs, while the TL architecture achieves more than 93% accuracy after 90 epochs. The average time for training 50 epochs in the SOL architecture is about 2.97 h, which is about two times faster than the average time for training 90 epochs in the TL architecture, which is approximately 5.30 h.

In summary, the F1 score and accuracy of the SOL architecture are better than the TL architecture. The reason is that the SoftMax function in the SOL architecture makes the differences between the same categories smaller and the differences between different categories more significant, which accelerates the training efficiency of the model in exchange for the time efficiency and allows the model to produce the best results in a short period. The large model InceptionResNetV2 architecture is more than three times faster than the model training time, the medium model InceptionV3 architecture is about four times faster than the model training time, and the small model MobileNetV2 model is about two times faster than the model training time. Time and energy costs are saved, so the benefits of using the SOL architecture for training models are enormous in terms of time cost analysis.

5. Discussion and Conclusions

In this research, a single network model architecture is used to train the model, and a FaceNet training method is used to train the MFR model using migration learning combined with a CNN model and a fully-connected SoftMax output classifier with CCEL-assisted training and a CA mechanism to dynamically update the optimizer’s LR. The model output is not limited by the number of classifications used during training due to the use of the classifier. The Euclidean distance between images is calculated using 128 dimensions as the model output. Due to the addition of the SoftMax classifier and CCEL function, the model output features are more representative among different categories of face images. It is faster to make the features of faces with the same classification closer and those with different classifications more distant during the training. In particular, the F1 score improved performance more significantly than previous works using only the TL architecture. With the SoftMax classifier-assisted training, InceptionResNetV2 and InceptionV3 require only 20 training epochs to perform best. MobileNetV2 requires 50 training epochs to perform best. This results in more time and energy efficiency than the TL architecture in the previous works.

In summary, this research uses a single network model architecture combined with loss function and enhanced network model architecture, using FaceNet’s TL training method with classifier-assisted training, and combining transfer learning with CA mechanism to dynamically adjust optimizer’s LR for better convergence of the model during the training process. Using the SOL architecture training method, CNN models of three different sizes, large, medium, and small, achieved an accuracy of over 93% in the test experiments, reaching a practical level. It also confirms the conclusion of Mario Lucic et al. [33] that hyperparameters and weight random restarts are the most influential aspects of the performance of network models for deep learning, and the network architecture has little influence. Comparing the training time of the models, we can see that using the SOL architecture saves more than half of the training time of the models compared to using the TL architecture. Therefore, compared with the previous works, the same low-complexity model accomplishes the same task but improves performance and significantly reduces economic and environmental costs.

Author Contributions

Conceptualization, W.-C.C. and H.-C.H.; methodology, H.-C.H.; software, H.-C.H.; validation, H.-C.H.; formal analysis, W.-C.C. and H.-C.H.; investigation, H.-C.H.; resources, H.-C.H.; data curation, H.-C.H.; writing—original draft preparation, H.-C.H.; writing—review and editing, W.-C.C., H.-C.H. and Y.-F.H.; visualization, H.-C.H.; supervision, W.-C.C. and L.-H.L.; project administration, W.-C.C. and L.-H.L.; funding acquisition, Y.-F.H. and W.-C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is funded by the National Science and Technology Council, Taiwan with Grant No. MOST-111-2637-E-324-001.

Data Availability Statement

https://www.kaggle.com/datasets/zenbot99/vggface2-hq-cropped (accessed on 3 May 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Alzu’bi, A.; Albalas, F.; Tawfik, A.H.; Lojin, B.Y.; Bashayreh, A. Masked Face Recognition Using Deep Learning: A Review. Electronics 2021, 10, 2666. [Google Scholar] [CrossRef]
Hsu, G.J.; Wu, H.; Tsai, C.; Yanushkevich, S.N.; Gavrilova, M.L. Masked Face Recognition from Synthesis to Reality. IEEE Access 2022, 10, 37938–37952. [Google Scholar] [CrossRef]
Cheng, W.-C.; Hsiao, H.-C.; Li, L.-H. Deep Learning Mask Face Recognition with Annealing Mechanism. Appl. Sci. 2023, 13, 732. [Google Scholar] [CrossRef]
Hariri, W. Efficient masked face recognition method during the COVID-19 pandemic. Signal Image Video Process 2022, 16, 605–612. [Google Scholar] [CrossRef] [PubMed]
Qiu, H.; Gong, D.; Li, Z.; Liu, W.; Tao, D. End2End Occluded Face Recognition by Masking Corrupted Features. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6939–6952. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Li, Y.; Zou, H. Masked Face Recognition System Based on Attention Mechanism. Information 2023, 14, 87. [Google Scholar] [CrossRef]
Chong, W.-J.L.; Chong, S.-C.; Ong, T.-S. Masked Face Recognition Using Histogram-Based Recurrent Neural Network. J. Imaging 2023, 9, 38. [Google Scholar] [CrossRef] [PubMed]
Ge, Y.; Liu, H.; Du, J.; Li, Z.; Wei, Y. Masked face recognition with convolutional visual self-attention network. Neurocomputing 2023, 518, 496–506. [Google Scholar] [CrossRef] [PubMed]
Zhu, Y.; Ren, M.; Jing, H.; Dai, L.; Sun, Z.; Li, P. Joint Holistic and Masked Face Recognition. IEEE Trans. Inf. Forensics Secur. 2023, 18, 3388–3400. [Google Scholar] [CrossRef]
Deng, H.; Feng, Z.; Qian, G.; Lv, X.; Li, H.; Li, G. MFCosface: A Masked-Face Recognition Algorithm Based on Large Margin Cosine Loss. Appl. Sci. 2021, 11, 7310. [Google Scholar] [CrossRef]
Zhang, L.; Xiong, W.; Zhao, K.; Chen, K.; Zhong, M. Maskdul: Data Uncertainty Learning in Masked Face Recognition. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. In Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI), San Francisco, CA, USA, 4–9 February 2017; pp. 4278–4284. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2818–2826. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929v2. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. arXiv 2020, arXiv:2005.12872v3. [Google Scholar]
Ge, S.; Li, J.; Ye, Q.; Luo, Z. Detecting Masked Faces in the Wild with LLE-CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Wan, W.; Chen, J. Occlusion robust face recognition based on mask learning. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3795–3799. [Google Scholar]
Song, L.; Gong, D.; Li, Z.; Liu, W. Occlusion Robust Face Recognition Based on Mask Learning with Pairwise Differential Siamese Network. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 773–782. [Google Scholar]
Li, C.; Ge, S.; Zhang, D.; Li, J. Look Through Masks: Towards Masked Face Recognition with De-Occlusion Distillation. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 3016–3024. [Google Scholar]
Boutros, F.; Damer, N.; Kirchbuchnera, F.; Kuijper, A. Self-restrained Triplet Loss for Accurate Masked Face Recognition. Pattern Recognit. 2022, 124, 108473. [Google Scholar] [CrossRef]
Chen, H.Q.; Xie, K.; Li, M.R.; Wen, C.; He, J.B. Face Recognition with Masks Based on Spatial Fine-Grained Frequency Domain Broadening. IEEE Access 2022, 10, 75536–75548. [Google Scholar] [CrossRef]
Yuan, G.; Zheng, H.; Dong, J. MSML: Enhancing Occlusion-Robustness by Multi-Scale Segmentation-Based Mask Learning for Face Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; pp. 3197–3205. [Google Scholar]
Shakeel, M.S. BAM: A Bidirectional Attention Module for Masked Face Recognition. In Proceedings of the 2022 IEEE International Conference on Visual Communications and Image Processing (VCIP), Suzhou, China, 13–16 December 2022. [Google Scholar]
Yang, Y.; Tian, X.; Ng, W.W.Y.; Gao, Y. Knowledge Distillation Hashing for Occluded Face Retrieval. IEEE Trans. Multimed. 2023. [Google Scholar] [CrossRef]
Bridle, J. Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters. In Proceedings of the Advances in Neural Information Processing Systems 2 (NIPS 1989), Denver, CO, USA, 27–30 November 1989; pp. 211–217. [Google Scholar]
Zhang, Z.; Sabuncu, M.R. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. In Proceedings of the 2nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
Thompson, N.C.; Greenewald, K.; Lee, K.; Manso, G.F. The Computational Limits of Deep Learning. arXiv 2020, arXiv:2007.05558v2. [Google Scholar]
Strubell, E.; Ganesh, A.; McCallum, A. Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019; pp. 3645–3650. [Google Scholar]
The Generative AI Race Has a Dirty Secret. Available online: https://www.wired.com/story/the-generative-ai-search-race-has-a-dirty-secret/ (accessed on 11 April 2023).
Lucic, M.; Kurach, K.; Michalski, M.; Gelly, S.; Bousquet, O. Are GANs Created Equal? A Large-Scale Study. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 3–8 December 2018; pp. 700–709. [Google Scholar]
IMAGENET. Available online: https://www.image-net.org/challenges/LSVRC (accessed on 24 May 2023).
Cheng, W.C.; Hsiao, H.C.; Lee, D.W. Face recognition system with feature normalization. Int. J. Appl. Sci. Eng. 2021, 18, 1–9. [Google Scholar]
VGGface2_HQ_cropped. Available online: https://www.kaggle.com/datasets/zenbot99/vggface2-hq-cropped (accessed on 3 May 2023).
Anwar, A.; Raychowdhury, A. Masked Face Recognition for Secure Authentication. arXiv 2020, arXiv:2008.11104v1. [Google Scholar]
Deep Face Recognition with Keras, Dlib and OpenCV. Available online: https://github.com/krasserm/face-recognition/blob/master/face-recognition.ipynb (accessed on 3 May 2023).

Figure 1. FaceNet architecture.

Figure 2. Architecture of proposed methods.

Figure 3. Part image example of MASK600 [3] (a) original face image without a mask (b) synthetic face image with the mask.

Figure 4. InceptionResNetV2 model training loss comparison chart.

Figure 5. InceptionV3 model training loss comparison chart.

Figure 6. MobileNetV2 model training loss comparison chart.

Figure 7. Average F1 score and accuracy variation of InceptionResNetV2 model under different Th.

Figure 8. Average F1 score and accuracy variation of InceptionV3 model under different Th.

Figure 9. Average F1 score and accuracy variation of MobileNetV2 model under different Th.

Figure 10. InceptionResNetV2 model test F1 score comparison chart.

Figure 11. InceptionResNetV2 model test accuracy comparison chart.

Figure 12. InceptionV3 model test F1 score comparison chart.

Figure 13. InceptionV3 model test accuracy comparison chart.

Figure 14. MobileNetV2 model test F1 score comparison chart.

Figure 15. MobileNetV2 model test accuracy comparison chart.

Table 1. MASK600 information.

Sets	Subjects	Numbers
Train set	500	93,516
Test set	100	19,056
Total	600	112,572

Table 2. CNN model size after training.

Methods	Parameters		Size (MB)
Methods	Cheng et al., 2023 [3]	Proposed	Cheng et al., 2023 [3]	Proposed
InceptionResNetV2	56,106,336	56,170,836	214.0	215.0
InceptionV3	24,162,208	24,226,708	92.5	92.7
MobileNetV2	6,354,112	6,418,612	24.4	24.6

Table 3. Hyperparameters for model training.

Hyperparameters
Image Size	160 $\times$ 160 $\times$ 3
Initial weights	imagenet
Epochs	100
Batch Size	192
Optimizer	Adam
Annealing	Cosine
Loss	TL + CCEL
Max Learning Rate	$10^{- 4}$
Min Learning Rate	$10^{- 6}$

Table 4. Comparison of model loss after training 100 epochs.

Methods	Loss
Methods	Cheng et al., 2023 [3]	Proposed
InceptionResNetV2	${8.15 \times 10}^{- 4}$	${4.42 \times 10}^{- 5}$
InceptionV3	${1.67 \times 10}^{- 3}$	${7.62 \times 10}^{- 5}$
MobileNetV2	${2.01 \times 10}^{- 3}$	${5.96 \times 10}^{- 4}$

Table 5. Average F1 score and accuracy comparison of 100 epochs of model testing.

Methods	F1 score (%)		Accuracy (%)
Methods	Cheng et al., 2023 [3]	Proposed	Cheng et al., 2023 [3]	Proposed
InceptionResNetV2	58.19	74.22	93.60	93.76
InceptionV3	56.56	71.96	93.04	93.31
MobileNetV2	57.06	76.42	93.17	93.59

Table 6. One epoch of training time for all architecture.

Methods	Time (Seconds)
Methods	Cheng et al., 2023 [3]	Proposed
InceptionResNetV2	334.20	337.90
InceptionV3	208.88	209.56
MobileNetV2	211.82	213.64

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, W.-C.; Hsiao, H.-C.; Huang, Y.-F.; Li, L.-H. Combining Classifiers for Deep Learning Mask Face Recognition. Information 2023, 14, 421. https://doi.org/10.3390/info14070421

AMA Style

Cheng W-C, Hsiao H-C, Huang Y-F, Li L-H. Combining Classifiers for Deep Learning Mask Face Recognition. Information. 2023; 14(7):421. https://doi.org/10.3390/info14070421

Chicago/Turabian Style

Cheng, Wen-Chang, Hung-Chou Hsiao, Yung-Fa Huang, and Li-Hua Li. 2023. "Combining Classifiers for Deep Learning Mask Face Recognition" Information 14, no. 7: 421. https://doi.org/10.3390/info14070421

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Combining Classifiers for Deep Learning Mask Face Recognition

Abstract

1. Introduction

2. FaceNet

3. Proposed Methods

4. Experiment and Discussion

4.1. Dataset

4.2. CNN Models

4.3. Training

4.4. Testing

5. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI