Facial Recognition System for People with and without Face Mask in Times of the COVID-19 Pandemic

: In the face of the COVID-19 pandemic, the World Health Organization (WHO) declared the use of a face mask as a mandatory biosafety measure. This has caused problems in current facial recognition systems, motivating the development of this research. This manuscript describes the development of a system for recognizing people, even when they are using a face mask, from photographs. A classiﬁcation model based on the MobileNetV2 architecture and the OpenCv’s face detector is used. Thus, using these stages, it can be identiﬁed where the face is and it can be determined whether or not it is wearing a face mask. The FaceNet model is used as a feature extractor and a feedforward multilayer perceptron to perform facial recognition. For training the facial recognition models, a set of observations made up of 13,359 images is generated; 52.9% images with a face mask and 47.1% images without a face mask. The experimental results show that there is an accuracy of 99.65% in determining whether a person is wearing a mask or not. An accuracy of 99.52% is achieved in the facial recognition of 10 people with masks, while for facial recognition without masks, an accuracy of 99.96% is obtained.


Introduction
In recent decades, facial recognition has become the object of research worldwide [1][2][3][4][5]. In addition, with the advancement of technology and the rapid development of artificial intelligence, very significant advances have been made [6,7]. For this reason, public and private companies use facial recognition systems to identify and control the access of people in airports, schools, offices, and other places [8][9][10][11][12]. On the other hand, with the spread of the COVID-19 pandemic, government entities have established several biosafety regulations to limit infections [13][14][15]. Among them is the mandatory use of face masks in public places, as they have been shown to be effective in protecting users and those around them [16][17][18][19].
As the spread of the virus occurs through physical contact, conventional recognition systems (such as fingerprints) or typing a password on a keyboard become insecure. Thus, facial recognition systems are the best option, as they do not require physical interaction as in other cases. However, the use of the face mask within these systems has represented a great challenge for artificial vision [20], because at the time of facial recognition, half of the face is covered and several essential data are lost. This clearly denotes the need to create algorithms that recognize a person when they are wearing a face mask [21]. This has made it necessary to implement new strategies to achieve robustness in the current systems [22].
In this sense, convolutional neural networks (CNN) belong to a set of techniques grouped under the so-called deep learning [23][24][25]. Thus, over the years, this technology In continuity with the works described in the bibliography, this document presents a facial recognition system for people regardless of whether they use a face mask or not. For this purpose, the work has been organized into four sections, Section 2 contains the materials and methods, Section 3 shows the results, and Section 4 presents the discussion.

Description of the Problem
The effects of COVID-19 on the global economy can be seen with the naked eye, as the confinement of people in the homes brings with it less production and slows down the commercial dynamism. However, it should be noted that in situations of health crisis such as the one that continues to be experienced, it is relevant to put people's health before any productive activity. That is why biosecurity measures and social distancing protocols have been implemented to limit the spread of this dangerous virus. As well as the capacity in public institutions, industries and other establishments has been limited, highlighting the so-called telework (in certain cases). Thus, companies have implemented various methodologies, strategies, and techniques to protect the integrity and health, both when entering and staying in face-to-face work sessions. As previously mentioned, CNN have been an important technological tool during this pandemic. Although most approaches have been taken towards the diagnosis of the disease, monitoring and prevention has also been covered.
Today, the use of a personal face mask is a mandatory preventive measure. Keeping the mouth, nose, and cheeks covered has now made people only recognizable by their eyes, eyebrows, and hair, which is a problem for the human eye, which tends to find similarities in several faces that have similar features. This problem also affects computer systems, as facial recognition systems are now very common. They are used to unlock the smartphone, access sensitive applications, and to enter certain places. Current systems usually process information from the entire face of the person, which is why technology must adapt to these new conditions. All this is done with the purpose of maintaining the biosecurity of the user, but giving them the opportunity to continue with the activities as naturally as possible.
The literature has shown that there are systems that seek to identify whether people use it properly. These works have had very good results. However, facial recognition using biosecurity material has not yet been explored. All of this motivated the present investigation, in which a detection system with two approaches is presented. The first is to develop a face classifier, starting from a database of people with and without a mask. The second describes a facial recognition algorithm in controlled environments, which allows for personnel to be identified automatically, without removing the face mask. This can be implemented as an access system to an institution or a home, but at a low cost. This is ensured by using open-source programming software and simple features that reduce computational expense. For this reason, the possibility of improving the adaptability of current facial recognition systems, in the face of new circumstances, has been established as a starting hypothesis.

Requirements
The programming language used here is Python. For optimal operation, a highprocessing equipment (GPU) is needed. However, we received no external financing, so we chose to work with free Google servers, which are available in Google Colab. Another of the essential requirements is to have the necessary databases in order to carry out the training and obtain the classification and recognition models. Taking into account that building the database of these databases requires a high investment of time when working with artificial intelligence and especially with convolutional neural networks.
Additionally, it is necessary to develop a consent form for the people who will allow for taking photographs for the facial recognition algorithm database. This is necessary because there are currently no databases for the recognition of people with face masks. For this, it is necessary to rely on Art 6.1d of the European General Data Protection Regulation (RGPD), in connection with article 46.3 of the LOU. Here, it is mentioned that the data of a person will be treated in accordance with the exercise of public powers conferred on the universities as responsible for the treatment of the data of the students. As well as biometric data ((article 9.2.a) of the RGPD), in which consent will be needed so that it can be part of the exams where facial recognition techniques are applied. The collection, filing, processing, distribution, and dissemination of these information data will require the authorization of the owner and the mandate of the law.

System Development
It is proposed to design a system that is capable of identifying a person's face, even if it is with or without a mask. For the system to work properly, it is necessary to use two databases: the first is for classifier training and consists of a large number of images of people who wear a face mask and others who do not. The second is used for training the facial recognition system, and here there are people with and without the biosafety material (face mask). The input data are obtained either from an image, or a video and the architecture used is MobileNet, with the aim of having a better precision and robustness. This project is divided into three stages, which are described below.

First Stage
This stage focuses on finding the location and dimension of one or more faces, regardless of whether or not they wear a mask, within an image. For this, the OpenCV Deep Learning-based face detection model is used and, as a result, the region of interest (ROI) is obtained, which contains data such as the location, width, and height of the face.

Second Stage
A diagram of the operation of the second stage is shown in Figure 1. This is where the classifier training is performed to detect faces with a mask and without a mask. For this, the "Real-World-Masked-Face-Dataset" database available on Git-Hub is used. Unzipping the files makes available a large number of images of people of Asian origin wearing a mask. From this database, the training of the classifier of the first stage is carried out. For the classifier, MobileNet V2 architecture is used, as it uses smaller models with a low latency and low parameterization power. This improves the performance of mobile models in multiple tasks and benchmarks, resulting in a better accuracy. It also retains the simplicity and does not require any special operator to classify multiple images and various detection tasks for mobile applications. MobileNetV2 is 35% faster at object detection compared with the first version, when combined with Single Shot Detector Lite. It is also more accurate and requires 30% fewer parameters, as the Bottleneck encodes the intermediate inputs and outputs, as shown in Figure 2. At the same time, the inner layer encapsulates the model's ability to transform of lowerlevel concepts (pixels) to higher-level descriptors (image categories). This is available in the "Tensorflow" library in Python, and through "Transfer Learning", a change is made in the last layer of the convolutional neural network. It should be noted that this architecture is selected because it is a model that does not require excessive computational expense and is therefore efficient in terms of processing speed. For the training of the first classifier presented in Figure 1, the input data of the neural network come from a scaling of the color images to a size of 224 × 224 pixels. The architecture used comprises a max-pooling layer (7 × 7), a flatten layer; a hidden layer of 128 neurons with a "ReLu" activation function, a Dropout of 0.5, an output layer with two neurons, and a Softmax activation function. Regarding the training of the classifier, the configurations from Table 1 are used. At this point (first and second stage together), we have information about where the face is and whether or not it has a face mask. Therefore, it is possible to pass this information to the third stage, which is where the recognition task is carried out. Figure 3 shows a diagram of the operation of the second stages.

Third Stage
Once the face of the person has been identified, in the third stage, facial recognition is carried out, for which a set of own observations is used that is built based on the faces of various people. For the construction of the set of observations, a balance is sought in terms of gender, namely, five women and five men from whom the images are obtained. Figure 4 shows the set of faces using a mask and Figure 5 shows without a face mask.

Training of Facial Recognition Models
The procedure to obtain the images is as follows: (i) during a week, daily videos of the face of each individual are obtained, seeking to capture different angles and different environmental conditions (lighting changes). (ii) From the videos obtained, images are captured at different moments in order to build a set of observations with images. (iii) The images where the face is not found in the entirety are eliminated. At the end of the procedure, a total of 13,359 images are obtained; 7067 (52.9%) with a face mask and 6292 (47.1%) without a face mask. In practice, acquiring so many images of a face requires a short video recording of a few seconds showing different views of the face. Regarding the identification of the images used for the recognition of people, the images have been labeled from left to right, as follows: In this way, when a person is recognized in an image, the name information can be accessed and placed in the image. Therefore, once the necessary data have been obtained, the recognition model is trained. The two facial recognition models follow the architecture of Figure 6, which is briefly explained below. To do this, from this set of observations, the facial recognition model has two approaches: 1. The first model aimed at recognizing people using a face mask.
2. The second model aimed at recognizing people not using a face mask. Database: Set of observations of the faces of people using a mask and without using a mask for both approaches of the third stage.
Preprocessing: For the facial recognition model of people using a face mask, only 3/5 of the upper part of the face is extracted. This in order to discriminate the mask that the person is using (as in the experimental tests, this information is not useful for the recognition model). Whereas, for the facial recognition model without a face mask, the image of the full face is used. In both cases, the resulting image is scaled to a size of 164 × 164 pixels.
Characteristics extraction: The FaceNet model is used to extract the most essential characteristics of the face. This model extracts the most essential features from the input image, in this case a face, and returns a vector of 128 features. The input of the network is an image with a human face and, using a deep metric learning technique, it calculates the metric to generate vectors of real characteristics [17]. This network is a model belonging to PyTorch, which can generate neural networks similar to Caffe [18]. The image is inserted into the network, then passes through the neural network and obtains the embedding of each face represented by f(x) R d . This method attempts to ensure that the image of a specific person (x i a ) is closer to all of the images of the same person (x i p ) and away from images of other people (x i n ). Equation (1) shows the calculation of the loss (L), where α is the margin applied between positive and negative pairs [54]. It receives an image of 164 × 164 pixels as the input data, and a vector of characteristics of 128 elements called "face embedding" is obtained at the output.
ANN classification: Once the vector of characteristics has been obtained, any machine learning classification model can be applied; in this case, a feedforward multilayer perceptron is chosen.
A feedforward multilayer perceptron (ANN) is chosen, because it does not need a large amount of data, it has been demonstrated that an ANN is a universal approximator with excellent results, and it does not require as much computational cost. After investigating the literature, it is best to use convolutional neural networks, but at this stage it is not necessary, as the best characteristics have been extracted by FaceNet and it only uses a vector of characteristics, instead of a raw image.
The architecture of our simple neural network applied to the two approaches is as follows:

Application of the Facial Recognition System
Once the facial recognition models have been trained, they are applied following the results obtained in the second stage. In this way, in addition to defining whether or not the person uses the biosafety material, it is possible to know their identity. This as long as the face is within the selected database. These obtained models are applied to the previously identified faces, and a label is returned with the name and the probability of a match in the face. The procedure is described in Figure 7.

Implementation Costs
In the market, there are a number of high-processing equipment oriented to the development of data science and deep learning, and because of a lack of resources for acquiring equipment like this, the free servers that Google provides through the GoogleColab platform are used.
The characteristics of a computer that Google offers for free are as follows: Duration of 12 h, if it is used for more than 90 min, it disconnects. • Using a random GPU.
In these conditions, the implementation costs do not exceed 400 euros, as it uses a portable computer with basic characteristics to send the Python programming of the cloud computer to execute.
The cost of a computer with similar characteristics in the international market, including a 4.1 Ghz processor, has an average price of 1500 euros, and if a monitor and camera with an approximate value of 220 euros is added to this, the complete equipment reaches a cost of 1720 euros.

Results
The experiments carried out have the purpose of demonstrating the potential use of the system, so tests are carried out using the recall metrics, Precision, F1-score, and the corresponding macro avg and weighted avg. The objective of using these metrics is to evaluate the system from different perspectives. Recall and precision indicate the ability of the model to correctly detect true positives. Recall also considers the false negatives detected, and the precision of the false positives detected by the model. False positives, in this case of face detection with masks, occur when an object is labeled as a face. For example, the system frames a plant as a face with a face mask, as it is false that a plant is a positive face. The reasons this can occur in our system are numerous, which is why hard work is needed in order to collect a large database so that the model being trained can better distinguish the desired classes (faces). False negatives occur when a face is not detected in the first stage, because the face has covered areas that make classification difficult; in this proposal, this initial classifier is a pre-developed tool. On the other hand, the F1-score provides a global measure of the system's performance, it is a combination of precision and recall (in a single value), with 0 being low performance and 1 being the best possible performance (all cases detected correctly).
By considering the macro avg metric, sd can get a general idea of the average of all of the experiments, while the weighted avg establishes an average measure of all of the experiments, but considering the number of observations of each class. Thus, in the event that a class has a higher score, the final weighted avg score will not be affected by it, but will give a value of importance to each score depending on the amount of observation. When considering these metrics, what is sought is to verify the robustness of the method by classifying both classes. In the second stage, the face classifier training with a mask and faces without a mask takes a period of approximately 10 h, and in the third stage the extraction of face embeddings takes approximately 5 h and the ANN-based classifier training takes 10 min.

Face Classification-First and Second Stages
For the training of face classification, 20% of the data is used for testing and the rest are used to train the model; Figure 8 shows the graph corresponding to the training. The convergence obtained from the model occurs approximately at 10 epochs and an accuracy of 99.6% is reached for training data. Finally, analyzing the trained model with the test data, the confusion matrix is obtained (Figure 9  With the values obtained during the training, the accuracy, precision, sensitivity, and specificity are calculated-the results of which are seen in Table 2. The values obtained confirm the good performance of the classifier to detect when there is or is not a mask on the face. It can be noted that the F1-score for both classes (mask and no mask) has a value close to one that indicates the good performance of the model. In addition, when observing the weighted avg, it can be verified that for both classes, the model works correctly, regardless of the amount of data per class.

Facial Recognition-Third Stage
For the facial recognition model, 20% of the data is used in training and 80% in testing in order to avoid overtraining in the model. Figure 10 shows the graph corresponding to facial recognition training with a face mask. In this model, convergence is reached at approximately 10 epochs, with an accuracy of 98%. When evaluating the model with the test data, the accuracy is 99.52% and the confusion matrix that is observed in Figure 11 is obtained. Table 3 shows the metrics obtained from the facial recognition model with a face mask.  When observing the F1-score, it can be seen that, in almost all cases, the score is closest to 1. When comparing this result with respect to the weighted avg, it can be seen that a good performance occurs in all classes, regardless of the amount of data that exists in each class.
On the other hand, in Figure 12, the training graph of the facial recognition model of people without a mask is shown. In training, an accuracy of 99% is obtained. When evaluating the model with the test data without a face mask, an accuracy of 99.96% is obtained and the confusion matrix is observed in Figure 13.
As can be seen in Table 4, the model successfully classifies the face of a certain person. When comparing the results obtained in the macro avg and the weighted avg of the F1score, an adequate functioning is corroborated, which is not influenced by the number of observations in a class. It should be noted that, if the person in the image is in the database, the name and the probability of success are placed on the label. On the other hand, if a face is not in the database, it is still detected, but a label of "mask" or "no mask" is added (because the name is not known), referring to whether said person is using or not using mask. In addition, it should be considered that the confidence percentage used in the system to accept if a face belongs to a person is 60%. The final results obtained from the complete system in operation are shown below in Figure 14. The average time to run the entire system is 0.8399 s, which is acceptable for use in a practical application.

Discussion
As time passes worldwide, the COVID-19 pandemic poses greater challenges for humans. Over the past two years, leaders in biometric systems, such as FLIR Systems, Zkteco, among others, have launched proposals for updated access control and temperature detection. They also highlight the products previously launched by the companies SenseTime, Telpo, and Herta. All of this is focused mainly on temperature control in airports and high-traffic places, with which the development of easy recognition systems has been relegated.
As previously mentioned, the regular use of face masks has directly affected facial recognition systems. This has altered the usual development of activities and delayed innovation processes. For example, Apple's Face ID is designed to identify users based on the mouth, nose, and eyes, which must be fully visible. In the same way, other important companies worldwide, such as Go from Amazon and Walmart's "Store of the Future", are affected, as they use this technology to interact with their customers. There are independent investigations as in [55], where Google resources have been used to develop systems for detecting the use of masks, but with still low yields.
As can be seen in the bibliography presented, during the advancement of this pandemic, CNNs have been used as a priority for the diagnosis of the disease. Various standard training methods have been used, such as EfficientNet [39] and y ResNet-34 [43], as well as other proprietary ones such as nCOVnet [42]. Similarly, for face mask use detection systems, several methods have been tested, such as RCNN [46], VGG-16 [47], MobileNetV2 [48], SVM [49], InceptionV3 [50], and y SRCNet [51]. All of them have been analyzed and it is determined that using Mobile-NetV2 would have efficient results, with a low consumption of computational resources. In addition, when comparing the efficiency of the various systems, it is found that this proposal is competitive, as it has 99.65%, only surpassed by [49,50] (100% in both cases). From the point of view of facial recognition, a similar proposal has not been found, which clearly shows the added value offered. By having a precision greater than 99%, the starting hypothesis is corroborated, establishing variability in this type of system and also adding robustness. Table 5 presents a comparison of this proposal with the other works. Other proposals in face detection have also reviewed, but with the ocular recognition variant, as seen in [52,53]. However, it should be clarified that focusing on the human eye represents a different approach than the one analyzed in this document. In addition, this would require the acquisition of other hardware and software components that would increase the requirements presented. This document seeks to capture the upper part of the face, i.e., eyes, eyebrows, forehead, and hair. Thus, it has been shown that this approach can be successful.
On the subject of convolutional neural networks, the most important thing is to have a good database (thousands of observations). This is not a rule, but the larger the database, the better the resulting model, as, in a certain way, more characteristics are extracted and a model that represents them can be generalized. It does not matter if there are many or few classes (people); for each class (person), a considerable amount of observations is needed (images, database, etc.). This type of architecture is ideal when it comes to artificial vision, as it is demonstrated in the literature that convolutional neural networks are leading the advances in artificial vision, which is why this architecture is chosen, in order to have the best possible results, as the topic of facial recognition is an area where the system should fail as little as possible.
The small number of people is not a problem, because the proposal is aimed at access control, which is a structured environment, in order for a restricted area to be limited to 10 or fewer people with entry authorization. In addition, the literature reviewed presents cases with the same performance for differentiating people with and without a face mask. The architecture has a heavy component in Transfer Learning (MobileNet) and is used to train the first classifier (second stage); this model is the one that classifies whether a face is using a mask or not. This has nothing to do with the classes for facial recognition, as facial recognition uses a fairly simple and lightweight architecture such as a feedforward multilayer perceptron.
Although the tools used in this work use existing models, the authors did not find a proposal that uses them in combination with a similar application. Another contribution is the use of cloud computing for this architecture, offering a low-cost system.

Conclusions and Future Works
This prototype system allows for the facial recognition of people with and without a mask, and could be used as a low computational consumption proposal for personnel access control. The two models of this system are tested with images, thus achieving better precision and optimization for each model. The face of someone found in the database is successfully classified to provide the name tag and probability of success. The three stages of the system allowed for the relevant characteristics of a person's face to be extracted, and thus use a simple neural network for the classification task. In this sense, the use of a "Face Embeddings" as input to the neural network obtained satisfactory results in the experiments carried out.
During the training of the third stage, it is possible to notice that there is an overadjustment, this fact is due to the fact that the database is built for this stage with a few participants, although it is composed of several images, and does not exist much variability. However, the system shows potential to be used in differentiated facial recognition applications. It should be noted that if a face is not found in the database, it will be detected, but the tag "mask" or "no mask" will be added, which refers to whether the person is wearing a mask. It should be considered that the level of confidence used in the system to accept if a face belongs to someone is 0.6. When defining whether people are wearing a mask or not, the accuracy is 99.65%. When evaluating the facial recognition model with the test data of people who do not use a mask, an accuracy of 99.96% is obtained, and for those who use a mask, an accuracy of 99.52% is obtained. In this way, the basis for future research that can expand the study in this field is established.
In the bibliographic review, the use of MATLAB has been evidenced as an alternative proposal that could provide a lower number of false positives that should be evaluated. It is also proposed to investigate new extraction architectures that can be compared with FaceNet, and to thus choose the best one. One important thing to note is that the system has difficulty detecting certain faces when wearing a mask. This problem is due to the fact that initially, the Open CV Deep Learning based face detector was being used and it is not designed to detect faces with masks. It could also be observed that the face recognition stage is not robust when the detected face presents a certain angle of inclination. However, this is not a problem of great impact, as this application is oriented to access control and at this point, the person must maintain a firm and straight posture in front of the device that acquires the image.
Therefore, as a future work, merging the first and second stages in the same model and creating an own algorithm that directly searches for a face with and without a mask should be considered. This avoids first using the Open CV face detector and then classifying faces with and without mask. This will further reduce the processing time and make the model more robust. In addition, it is proposed that in the future, a comparative study of the models used for the transfer of learning can be carried out in order to determine the best model and network trained in unfavorable evaluation conditions. Once the final models have been trained, they can be compressed and deployed on low-cost embedded devices such as Raspberry Pi or mobile devices. Funding: This research and the APC was funded by Universidad Tecnológica Indoámerica (Ecuador).

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Conflicts of Interest:
The authors declare no conflict of interest.