Face Mask Detection in Smart Cities Using Deep and Transfer Learning: Lessons Learned From the COVID-19 Pandemic

: After different consecutive waves, the pandemic phase of Coronavirus disease 2019 does not look to be ending soon for most countries across the world. To slow the spread of the COVID-19 virus, several measures have been adopted since the start of the outbreak, including wearing face masks and maintaining social distancing. Ensuring safety in public areas of smart cities requires modern technologies, such as deep learning and deep transfer learning, and computer vision for automatic face mask detection and accurate control of whether people wear masks correctly. This paper reviews the progress in face mask detection research, emphasizing deep learning and deep transfer learning techniques. Existing face mask detection datasets are ﬁrst described and discussed before presenting recent advances to all the related processing stages using a well-deﬁned taxonomy, the nature of object detectors and Convolutional Neural Network architectures employed and their complexity, and the different deep learning techniques that have been applied so far. Moving on, benchmarking results are summarized, and discussions regarding the limitations of datasets and methodologies are provided. Last but not least, future research directions are discussed in detail.


Introduction 1.Preliminary
In early 2020, the world health organization (WHO) considered Coronavirus disease (COVID-19) as a spreading epidemic [1].It has seriously threatened the life of individuals worldwide since its outbreak.Moving on, different variants and mutations of COVID-19 have worsened the situation nowadays [2].To that end, the focus of the global scientific community has been on proposing adequate mechanisms to handle this calamity, and a significant research effort has been made.This includes developing different kinds of vaccines, introducing timely treatment approaches for severely infected patients, etc.Many studies, e.g., [3][4][5] show that COVID-19 spreads by droplet and aerosol transmission, mainly occurring during social interactions with infected persons.Thus, pushing individuals to wear protective face masks properly has been established as a pressing and scientific solution to slow down the spread of COVID-19.Typically, face masks help filter and block virus particles in the air [6].This approach has been recommended by WHO and adopted by most governments across the globe.However, ensuring an efficient adoption of the masks wearing strategy and compliance of people with the requested norms requires developing automatic face mask detection (FMD) techniques.The latter helps check whether individuals are wearing masks or not and whether they are wearing them correctly or not in public areas.In doing so, intelligent control and prevention of the COVID-19 epidemic are facilitated.This comes in addition to insisting people respect social distancing norms and take routine temperature screenings [3].
FMD and masked face recognition (MFR) has become a key challenge due to the growing numbers of people wearing masks to prevent the infection of the COVID-19 virus [7,8].However, the lack of efficient technological solutions to detect face masks and non-adherence to COVID-19 policy measures leave the door open for new infection waves [9].Additionally, the detection of face masks with the use of image processing techniques is challenging due to (i) the diversity and forms of masks, (ii) the camera pixel variation, (iii) the variation of obstruction degrees (e.g., illumination, rotation, resolution, angle of view, etc.), and (iv) the computational complexity of AI-based video analytics.In addition to the above, maintaining people's privacy during the mask detection task is another critical issue [10].To that end, various studies have been proposed to develop face detection systems using AI-based video analytics, ensuring that only the proper wearing of masks is detected and individual privacy is not exposed [11].The success of artificial intelligence (AI) and machine learning (ML) models in object detection and face recognition makes this technology suitable for developing face mask detection techniques [12].
Monitoring, managing and preventing the COVID-19 pandemic and other infectious disease outbreaks require path-breaking tools and innovative solutions.To that end, AI and ML have played a crucial role in our fight against the Corona pandemic [13].For instance, computer vision (CV) has significantly contributed to teaching computers to comprehend and analyze visual scenes thanks to ML and deep learning (DL) models [14].Specifically, intelligent machines can be deployed for (i) identifying and tracking objects, (ii) measuring the distance between them, and (iii) responding to ascertained scenes using smartphones, cameras, and other sensory devices [15].In this regard, CV augmented with DL has considerably contributed to monitoring human crowds, capturing human activity, monitoring social distancing behaviors, and detecting the violation of face mask wearing in public, such as Paris Metro system's surveillance cameras [15].

Contributions
This survey aims to shed light on the scientific community's progress since the start of the pandemic on the development of DL and Deep Transfer Learning (DTL) tools for FMD.More specifically, a well-designed taxonomy of frameworks is created to organize existing approaches better and guide future researchers in the correct branch of work.The taxonomy covers two principal AI tasks related to controlling COVID-19 spreading: social distancing analysis and face-mask-wearing detection.The comparative study of existing AI solutions used to perform the aforementioned tasks, emphasizing state-of-the-art DL models, provides a sturdy basis for future research and applications.Additionally, insightful observations are made to identify both solved challenges and those that remain unresolved, such as the lack of publicly available and annotated real-world datasets, computational cost requirements and the demand for real-time solutions, lack of storage space on edge devices to manage image/video datasets, etc. Accordingly, the main contributions of this paper can be summarized as follows: • Identifying and systematically reviewing existing DL and TL models used to monitor social distancing in indoor and outdoor environments.

•
Describing the state-of-the-art ML-and DL-based methods applied to detect maskcovered faces in the wild.

•
Analyzing and discussing the performance of ML and DL models in detecting social distancing respect and face mask usage and identifying their pros and cons.

•
Highlighting the open issues for the ongoing research in the field and providing insights about the research directions and applications that can attract considerable interest in the near future.

Review Methodology
The bibliometric research is performed in the context of a narrative review.The recent works related to the FMD based on DL and DTL have been searched.We searched the relevant keywords, like, "face mask detection", "deep learning", "transfer learning", "deep transfer learning", "domain adaptation", and with different combinations on different databases (including IEEEXplore, Elsevier, ACM Digital Library, Scopus, etc.) in titles, abstracts, and keywords.The adopted search procedure is explained in Figures 1 and 2.
A total of 196 articles have been returned in response to the relevant queries, out of which 179 are non-duplicates.These articles are further investigated for their relevancy to the theme of the present study, as explained in Figure 2. Finally, 130 papers are considered for this study.

Background 2.1. FMD Related Tasks
The detection of face masks in the related literature has evolved into a multi-faceted problem that defines different tasks for each aspect.In the following, we examine each task separately, explain the differences from other tasks and the challenges it raises, and highlight the leading works in the field.

Mask Occlusion Detection
The occluded mask detection task refers to the automatic inference that a person has covered his/her face entirely or partially.It is a binary classification task in which the main concepts of face detection are employed but in reverse principle.The main aim of the ML algorithms is to detect whether the face is covered by a mask, glasses, or any other object.Such applications are used for security [16] (e.g., of automated teller machines (ATMs)), authentication [17] (e.g., for fraud prevention), and public surveillance.

Incorrect Face Mask Wearing Detection
Detecting whether a face mask is worn correctly is of utmost importance since incorrect wearing can remarkably drop the mask's efficiency.This is strengthened by the fact that many individuals still refuse to wear protective masks correctly.To that end, numerous studies have been proposed to investigate this issue by detecting the placement of masks in the faces [18][19][20].In [21], the authors propose a dataset for evaluating incorrect face maskwearing detection techniques.The dataset includes images of incorrectly and correctly masked faces (CMFD), which are artificially created by applying a mask on many different faces with different levels of face coverage.Also, the authors in [18] introduce a realtime CNN-based approach to identify incorrect utilization of face masks.Authors in [19] combined super-resolution (SR) networks with simpler neural networks trained on the Medical Masks Dataset (MedMasks) for classifying faces as correctly covered by a mask, improperly covered or not wearing a mask in images with multiple persons and achieved an accuracy of 98%.

Masked Face Recognition (MFR)
Wearing a mask obstructs conventional face recognition methods, which focus on detecting facial components such as the eye, nose, mouth, and ears to identify individuals.In this detection task [22], which falls under the generic task of occluded face detection, the focus is on whether the detected mask covers the mouth and nose.The recognition task focuses on matching masked faces with either unmasked or masked faces.This challenge has gained increasing interest since the Corona pandemic, when mask-wearing has become part of people's daily life to slow down the spread of the virus.
Three main issues differentiate the task from the conventional occlusion face recognition and make it much harder to solve: (i) the lack of large-scale face datasets with masks, (ii) the fact that the characteristics of the nose and mouth are seriously damaged, and the number of pertinent features is considerably decreased; and (iii) the community interest for detecting and recognizing faces with masks, especially in the case of pandemics.In the same direction, re-using existing DL models can face a significant challenge related to the discrepancy between the source data (SD) and target data (TD).For instance, when unmasked faces are used for training, masked faces are considered in the test or the opposite.Also, when different types of masks are applied, apart from the medical masks, the standard face recognition systems may fail to adapt to new tasks (i.e., new mask types), and further training is needed to efficiently and effectively recognize masked faces [23].

Partial Face Recognition
Another task related to FMD is handling distortion that may occur in unconstrained environments, when the photo shoot may have an angle or other objects and faces may partially cover the target face [24].The resulting faces are of different sizes; some of their features are hidden or out of view, and consequently, conventional algorithms will fail.DL techniques can automatically learn visual features that can be used for face recognition [25] and can thus be beneficial for this task.When the task has to be combined with mask detection, the difficulty increases exponentially.The detection of masks can then be based on advanced DL techniques that combine low-level features with attention mechanisms to use all the visible face features and the potential mask features if a mask is employed [26].

Datasets
Since the pandemic's start, a great effort has been devoted to launching different FMD datasets.However, many of these repositories are either artificially created, so they don't appropriately represent real-world scenarios, or they have noisy samples and wrong annotations, further hindering the detection task.Choosing the suitable dataset for training a mask detection model requires effort.
The SSDMNv2 model [27] was trained using a dataset that included a variety of open-source datasets and images, such as the Medical Mask Dataset from Kaggle (KMMD), the PyImageSearch dataset (PBD), and the masked face recognition dataset (MFRD) [28].The KMMD dataset contains 678 pictures of people wearing medical masks, along with XML files that describe the masks and provide other details.The PBD dataset includes 1,376 images divided into two classes: people wearing masks (690 images) and people not wearing masks (686 images).This dataset was created using standard images of faces with facial landmarks applied.The use of facial landmarks in the training dataset allowed for the artificial creation of mask-wearing images by positioning masks on faces based on the location of specific facial features, such as the eyes, eyebrows, nose, mouth, and jawline.
To prevent introducing bias into the task, the artificially masked images were not used as non-masked samples.Instead, a larger dataset called the Face Mask Dataset (FMD) was created by combining images from the Wider Face dataset [29] and the MAsked FAces (MAFA) dataset [30].The resulting FMD dataset includes 7959 images from two classes labeled as "with a mask" and "without a mask".The MAFA dataset, created before the COVID-19 pandemic in 2017, includes a total of 30,811 internet images and 35,806 images of masked faces, some of which are masked using hands or other objects rather than physical masks.The Wider Face dataset is much larger, containing 393,703 faces annotated in 32,203 images and featuring a wide range of scales, poses, and occlusions.Moving on, A much smaller dataset is the masked-faced dataset (MFDS) [31] that includes 200 images.This small number does not allow the dataset to be used for training DL-based algorithms.In addition, different objects have been used to mask faces, reducing the samples that can be used for mask detection tasks.The Face Mask Dataset (FMDS) is designed and released in [32], consisting of 853 images.
In [27], a real-time medical mask detection (RTMMD) is proposed, which includes 5521 images that have two classes annotated as "with_mask" and "without_mask".It is also well adapted to detect assailants covering their faces while performing unlawful deeds.In [33], two FMD datasets are presented for masked face verification (MFV) and masked face identification.The first one includes 800 images of 200 different persons (identities), whereas the second comprises 4916 images of 669 identities.In [34], the medical masks dataset (MMD) is launched by Mikolaj Witkowski in terms of a Kaggle data contest 1 .The MMD dataset contains 682 images, including more than 3000 medical masked faces wearing masks.
In [35], the face mask 12k images dataset (FMID-12k) is released, which includes 11,792 images gathered from various backgrounds.Moving forward, the face mask classification (FMC) data is presented in [36], which has 440 images equally divided into two classes and collected on noisy backgrounds.
In [37], the real-world masked face dataset (RMFD) has been proposed, which contains 90,000 normal faces and 5000 masked faces of 525 individuals.In [38], two face image datasets (FIDS 1 and FIDS 2 ) have been proposed.The balanced FIDS 1 consists of 3835 images that belong to two distinct categories, i.e., 1919 images of "without_mask" and 1916 images of "with_mask".These images have been collected from different sources, such as the RMFD, Kaggle datasets, and Bing search application programming interface (API).FIDS 2 comprises 1376 images divided into two classes (686 images of "without_mask", and 690 images of "with_mask").FIDS 2 is designed using images from the simulated masked face dataset (SMFD) [39].Authors in [40] introduced the thermal mask dataset (TMD) to cover the lack of masked-faced image datasets in the infrared or thermal spectrum.The dataset contains 153,360 images in both visual and thermal spectra of people wearing surgical masks.Finally, because of the lack of unified and diversified datasets for evaluating both FMD and masked facial recognition (MFR), the authors in [41] develop a large-scale mask detection to quantify the performance of algorithms for both tasks.
Table 1 lists the publicly accessible datasets used for face mask-wearing detection and classification and summarizes their main features.As can be seen, most of these datasets are small and can not be used to train DL models, so they have been augmented in most face mask detection frameworks using standard data augmentation techniques.This approach introduces a bias, which can make the comparative analysis of the results collected on the augmented repositories unfair.

Evaluation Metrics
Various metrics have been used to assess the performance of FMD techniques in the literature.Each method's performance is often evaluated using different criteria to ensure its reliability under different scenarios.The metrics most frequently used for assessing FMD solutions are based on the confusion matrix of the binary classification problem (i.e., mask, no mask).Moreover, by using the True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) sample predictions, the researchers measure the prediction accuracy, error rate, recall, precision, and F 1 score as given in the following equations.
Researchers also examine the time performance of their solutions, measuring the times needed for training their models and the inference times.They also measure the throughput of their systems, which is critical for developing real-time face mask detection systems.Table 2 summarizes current evaluation metrics widely used in FMD frameworks.
The fitness of a regression model to the data.It measures the proportion of variance in the dependent variable that can be explained by the independent variable(s).
Theil U1 index Quantify the difference between the observed and predicted values, with a higher value indicating a better fit and more accurate predictions.
Theil U2 index Measures the quality of the predicted results.

FMD Based on Conventional ML
The rapidly growing literature on Face Mask detection (FMD) [13] indicates that FMD approaches are divided into several categories.In this section, we review the most significant works done on FMD.As elucidated in the previous sections, even though the research on the detection of masks that cover people's faces has been going on for decades before the onset and outbreak of the coronavirus pandemic, the methods and algorithms earmarked for the FMD task still need to be improved.Existing FMD techniques can be divided into two categories conventional ML-and DL-based models.This work focuses on the latter group (i.e., deep FMD techniques), demonstrating better prediction performance and more benefits than the ML alternative [49].In the following paragraphs, we briefly provide the main characteristics of ML-based methods.
As explained in Section 2, a main subtask of FMD is the detection of human faces in the images or video streams.The related ML literature on face detection employs Haar-like features and Haar classifiers to detect faces.This statistical approach builds on the categorization of subsections of an image, which implicitly correspond to facial characteristics (e.g., eyes, eyebrows, nose, nostrils, etc.) and their relative positions [50].Works that build on face detection and then extract image features to perform FMD employ decision tree classifiers.These give a speedy, simple, and efficient method for detecting masks [51].Similarly, other techniques build on conventional object detection approaches, such as Haar Cascade [52] and Histogram Of Gradients [53] to detect valuable features, which are then employed by an ML classifier to solve the FMD task.Such techniques have proven to be prominent and influential.Still, they heavily depend on Feature Engineering and demonstrate worse prediction accuracy performance when compared to modern deep learning techniques.
A group of ML methods for detecting partially occluded faces has been presented before 2010 and proposed the use of Support Vector Machines [54], neural networks [55], Gappy PCA that fits around the wrong values [56], Adaboost and other classification techniques on top of handcrafted features extracted from images.The occlusion could be on any part of the face, caused by a mask, the hair, sunglasses, etc. [57].Since the methods take as input the features detected in each face image, they can be modified to be used for detecting whether the face is clear or has any kind of occlusion, for performing face detection and identification, or for any other face-related classification or regression task (e.g., age estimation, sentiment analysis) [58].Still, it would take more work to detect the occluded parts and their effect on the accuracy of predictions.
As mentioned above, this work focuses on the DL-based face mask detection (deep FMD) techniques, which are illustrated in Figure 3 using a comprehensive taxonomy.The objective of the taxonomy is to provide a visual summary of the various face mask detection frameworks focusing on (i) the detectors, (ii) the learning processes, and (iii) the data collection alternatives..In the following section, we survey existing deep FMD methods and the main architectures and models proposed in the literature.We organize their extended presentation according to different criteria, such as the model type, the number of processing steps, and the complexity of the resulting models.

FMD frameworks based on DL & DTL CNN-based detectors GAN Hybrid Transformers
Single-stage Two-stage Transfer learning

DL-Based FMD
Even though the majority of research works on FMD that employ DL methods have been published during the last three years, the related literature is much richer than that of ML methods.In this section, we list the main approaches in the field, organized by the general architecture employed, the number of processing stages, and the complexity.The works detailed below and their main features and performance are summarized in Table 3.

Sorted by the Employed Architecture
The detection of face mask wearing is mainly performed on images; thus, image analysis DL models are employed for the task.Although Convolutional Neural Networks (CNNs) dominate the field, a few works still use different DL architectures to solve the task.

Convolutional Neural Networks (CNNs)
CNNs are the most popular architectures in object detection and recognition from images and image classification.Several pre-trained CNN architectures, such as AlexNet, VGG16, VGG19, Inceptionv3 (GoogLeNet), ResNet50, EfficientNet, etc., have been used in a variety of image classification and detection tasks,including medical image detection [59], sentiment analysis and face expression classification [60], etc.However, they have been trained for entirely different tasks.Transfer learning is the primary ML technique that allows these pre-trained models to adapt to new tasks and achieve state-of-the-art performance quickly [61,62].In [63], transfer learning has been applied to a face mask annotated dataset using AlexNet and VGG16 models to detect if people are using face masks.The authors also combined these architectures with Long short-term memory (LSTM) and Bidirectional LSTM (BiLSTM) layers to further improve the classification process.The dataset used for training comprised a total of 2,000 images organized into four classes: "unmasked", "masked", "masked but under the chin", and "masked but nose open".
The VGG-16 architecture has been used in [64] as a basis to train a CNN classifier for FMD in real-time.The authors collected over 20,000 images from the web, and the trained image classifier was used as a face mask-wearing alarm system.Authors report an accuracy of 98% in test images and claim that the developed solution is computationally efficient for real-time setups.
A two-stage CNN architecture for FMD is proposed in [65] for detecting instances of improper face mask wearing.The proposed method was trained using 7855 images divided into two classes in the main task.In a side task, authors tried to detect faces in images with multiple persons, using a pre-trained RetinaFace model, and then classify the persons as "mask-wearing", or "no wearing a mask".Finally, they employed a centroid tracking technique to track the detected faces between consecutive images and proposed architecture for video analysis.

Generative Adversarial Networks (GANs)
Different models of GANs have been utilized to develop reliable FMD solutions that can preserve individuals' identities.For instance, a GAN model that automatically removes the masks covering the face areas and regenerates face images by building the missing holes is proposed in [66].The outcome of this framework is a full-face image looking realistic and natural and realistic.A GAN-based method has been employed in [67] for generating masked faces.A domain-constrained loss is used to train this architecture, bringing the inpainted masked faces as close as possible to their corresponding identified complete faces.Similarly, a GAN-based identity-preserved inpainting solution that alleviates the problem of occluded face recognition is proposed in [68].Besides, Ding et al. [33] attempted to overcome the lack of large-scale annotated FMD datasets by creating two datasets of synthetic masked face images designed for mask-face detection and recognition.Explicitly, the former includes 400 pairs of 200 identities for verification, while the latter encompasses 4,916 images of 669 identities for identification.

Sorted by the Number of Processing Stages
DL-based object detection performs well because of its robustness and capability to extract pertinent features.Two popular categories are found in the literature, one-stage, and two-stage object detectors.Examples of one-stage detectors that process images in a single step, are a single shot detector (SSD), you look only once (YOLO), and RetinaNet.Twostage detectors that select regions and then refine them comprise region-based CNN and its variations and techniques that combine popular CNN architectures for object detection with classifiers for filtering them.

One-Stage FMD
The simpler one-stage approaches divide the image into fixed-sized cells and then apply an object locator to find objects in each cell.The YOLO family of architectures is quite popular for the detection of objects.However, these approaches could perform better in detecting objects of different sizes.A better alternative is to apply multi-scale detection to a single-shot detector and then conduct detection on several feature maps to detect faces of various sizes.Another problem to be handled is the class imbalance problem.Several one-stage techniques include focal loss functions to reduce the loss for easy samples (i.e., cells) and focus on the hard ones.
Single-Shot Detector (SSD): It is based on a multi-scale detection to perform detection on different feature maps and detect faces of distinct sizes [69].In [27], the authors introduce a real-time CNN-based FMD system using SSD and MobileNetV, namely SSDMNv2.In [70], Anithadevi et al. introduce a single-shot multi-box FMD approach based on MobileNetv2 to extract pertinent features and enhance object detection.After considering the significance of developing an accurate model and the limitations of existing models, it proposes a framework based on deep learning, which would automate and simplify the task of monitoring social distancing and face masks through intelligent video analytics.In [19], Qin.et al. combined super-resolution images and SRCNet to develop an FMD approach.They were able to classify three categories of facemask-wearing conditions (i.e., correct facemaskwearing, incorrect facemask-wearing, and no facemask-wearing), and their proposed method achieved 98.70% accuracy in the face detection phase.
YOLOv1: [71] presents a YOLOv1-based FMD, which has been combined with a stackbased virtual machine (WebAssembly) and high-performance neural network inference computing framework (NCNN) to improve performance.This approach enables the privacy preservation of users' data as it is implemented on an edge computing architecture.
YOLOv2: It is a single-stage real-time object detection model that has improved the performance of YOLOv1 in different aspects, e.g., using (i) anchor boxes to predict bounding boxes, (ii) a high-resolution classifier, (iii) batch normalization, and (iv) Darknet-19 as a backbone.Despite these improvements, it did not attract significant attention in FMD applications.In [72], YOLOv2 with ResNet50 has been used to develop a medical FMD scheme.Typically, this approach is designed in two stages: (i) extracting deep features using a ResNet50-based DTL architecture and (ii) detecting face masks using YOLOv2.
YOLOv3: It is an improved version of YOLOv2, which is built using a DarkNet53 backbone.It includes 107 layers in total, 53 of them being convolutional layers [73].Ref. [74] discusses the use of DL techniques for visual analytics of the spread of COVID-19 infection in crowded urban environments.The authors describe a CNN model that they developed for visual analytics of the spread of COVID-19 and demonstrate the effectiveness of their model on a dataset of real-world video sequences from crowded urban environments.They show that it is possible to detect individuals potentially infected with COVID-19 based on their physical behavior and interactions with others, with an average detection precision of 69.41%.In [75], the authors use a YOLOv3 model for monitoring social distance in the context of the COVID-19 pandemic.The model is trained to identify individuals in video sequences and to estimate the distance between them, allowing it to identify instances of social distancing violations.The authors evaluate their model on a dataset of real-world video sequences and demonstrate that it can accurately estimate the distance between individuals and identify social distancing violations with a high degree of accuracy.They also discuss the potential applications of their model for real-time monitoring and enforcement of social distancing measures in various settings.
In the same spirit, the authors in [76,77] calibrate videos into bird's eye view before feeding them as inputs to the pre-trained YOLOv3 model.However, both studies need to provide more assessment results.Wu et al. [78] introduce FMD-YOLO, an FMD approach based on Im-Res2Net-101.The latter relies on combining the Res2Net module with deep residual networks (DRN), where non-local mechanisms, deformable convolution, and hierarchical convolutional structure are applied to extract information from the input thoroughly.It is worth noting that many more frameworks that are based on YOLOv3 architecture have been found in the literature , but they did not provide extensive evaluation results.
The main objective in [100] was the analysis of the effectiveness of deep-learning-based face detection algorithms when applied to thermal images, especially images that depict faces covered by virus-protective face masks.As part of their work, the authors compiled more than 7900 thermal images containing faces with and without masks.Selected raw data pre-processing methods were also investigated, and their influence on the face detection results has been evaluated.It was shown that the use of transfer learning based on features learned from visible light images results in a mean average precision (mAP) greater than 82% for half of the investigated models.The best model was based on the YOLOv3 model (mAP was at least 99.3%, while the precision was at least 66.1%).
YOLOv4: A significant number of FMD systems have been built upon YOLOv4 [101-108], but still not all research works present satisfactory experimental results.So in the following, we focus on the studies presenting significant contributions and empirical validations.While YOLOv4 uses the CSPDarknet53 as the original backbone with many parameters with deep layers, the realistic scenes of real-time FMD are generally simple.To detect whether people are wearing masks or not, two kinds of objects are detected: face with a mask or face without a mask.Using the complex CSPDarknet53 as the backbone is inap-propriate in this case because the computational cost unnecessarily increases.To overcome this issue, an improved CSPDarknet19, which combines the benefits of CSPDarknet53 and Darknet19, has been proposed in [109].Cao et al. [110] propose MaskHunter, a real-time FMD based on an improved YOLOv4.MaskHunter uses an improved CSPDarkNet-19 backbone.Concretely, the improved YOLOv4 with CSPDarkNet-19 backbone, neck, and prediction head is developed along with an enhanced mosaic data augmentation approach.The neck architecture of MaskHunter comprises path aggregation network (PAN) [111], FPN, and spatial pyramid pooling (SPP) modules [112].Figure 4 portrays the flowchart of MaskHunter, which includes the improved CSPDarknet19, the improved neck with BiFPN, the double-head prediction heads, and the mask-guided module that is used to discriminate between people wearing or not masks in night environments.
The authors in [113] propose a DL-based VSD detection scheme based on the YOLO v4 model.A fixed and single motionless time of flight (ToF) camera is used to record video data.After detecting people's objects, the Euclidean distance is used to measure the distance between the detected bounding boxes, which is consequently mapped to a real-world distance.An empirical evaluation has shown a mean average precision (mAP) score of 97.84% and a mean absolute error (MAE) between actual and measured social distance values is 1.01 cm.In [114], a DL-based crowd-counting solution is developed for monitoring and controlling the capacity of commercial buildings during the COVID-19 pandemic.It is based on YOLOv4 and has been validated on the COCO dataset.Using route and direction information, the system can determine whether a person leaves or enters the building and count the people still inside it.It can detect violations by comparing the results with a pre-defined threshold.A limitation of this study is the lack of significant assessment.Similarly, in [101], a YOLOv4-based FMD is presented but lacks a thorough empirical evaluation.Authors in [115] developed a YOLOv4-based FMD and thermal scanning kiosk to (i) ease the classification of individuals wearing or not wearing masks and (ii) measure pedestrians' body temperatures using a temperature sensing unit.In [116], Gola et al. have shown the superiority of YOLOv4 versus MobileNetv2, SSD and YOLOv3 in terms of FMD accuracy, whereas in [117] the authors verified the superiority of YOLOv4 over other variations, with a mAP value of 71.69%, and proposed YOLOv4-tiny for limited computational resources environments.
YOLOv5: It is a Python implementation of an improved version of YOLOv3 [118] for PyTorch7.It includes changes to activation functions and data augmentation with postprocessing to the YOLO architecture, as in YOLOv4.It employs self-adversarial training (SAT) and aggregates images in training, which results in accelerated inference [119].It has been released with five different models sizes: YOLOv5n (nano), YOLOv5s (small), YOLOv5m (medium), YOLOv5l (large), YOLOv5x (extra large).In [120], Walia et al. proposed an approach using YOLOv5, which involves taking input images from CCTV Cameras.Once a face is detected, it is processed using Stacked ResNet50 to classify between two mask-wearing conditions: correct or not.A robust dataset consisting of 1916 masked and 1930 unmasked images has been collected from various sources, then data augmentation is applied.Experimental Evaluation has shown an accuracy score of 96%, but the method robustness has not been tested with images from other collections and datasets.A promising alternative could be to use cutting-edge facial detection algorithms and transfer learning (hyper-parameter tuning) to detect whether the face has a mask.
YOLOR: It is a unified network that uses both implicit and explicit knowledge.The former is used to identify features of the deep layers, while the latter can be attained from annotated data.YOLOR has been employed in FMD tasks and applied in related datasets, such as on VIDMASK [121] where it outperformed YOLOv4 and YOLOv5 with a maximum mAP of 92.1%.
Others: the authors in [122] introduce a single-stage FMD approach called RetinaFace-Mask, which relies on RetinaFace [123].It has been built upon (i) using FPN for high-level semantic information fusion and (ii) presenting a context attention module to enable Reti-naFaceMask to focus on the characteristics of masks and faces.A cross-class removal approach is also developed to remove the regions with low scores and high IoU values.While most existing FMD methods are developed for simple scenes, monitoring facemask utilization is often challenging in dense crowds with different occlusions and scales.Additionally, privacy preservation is another issue that impedes the use of public data in centralized training.To overcome these challenges, a cascaded network is introduced in [124] using the Dilation RetinaNet Face Location (DRFL) Network, which helps reduce network parameters and identify faces at different scales.In [125], the authors introduced a new human ear detection pipeline based on the YOLOv3 detector.A well-known face detector named RetinaFace was also added to the detection system to narrow the regions of interest and enhance accuracy.The proposed method has been evaluated on an unconstrained dataset, which shows its effectiveness.The authors in [126] also employed YOLOv3 and FMNMobile usingNASNet Mobile and Resnet_SSD300 algorithms.This achieved an mAP of 91.7% on two datasets of 680 and 1400 images of masked and non-masked faces.

Two-Stage FMD
In the first stage, two-stage FMD techniques are based on generating region proposals, while these proposals are then fine-tuned in the second stage.Overall, two-stage FMD methods can perform better than one-stage FMD solutions but at a higher computational cost.
R-CNN: In region-based CNN (R-CNN), candidate regions that may contain objects of interest are proposed using selective search [127].The proposals are then injected in a CNN module to extract the characteristics, and an SVM is also deployed to recognize object classes.Unfortunately, R-CNN's second stage has a high computation load as the proposals are detected one by one in addition to using an SVM algorithm for final classification.
In [128,129], the authors have chosen R-CNN, Fast R-CNN, and Faster R-CNN algorithms for detecting Mask detection and Social distance.In [130], to address the problems of large inter-class and small intra-class distances and the absence of real FMD datasets, Zhang et al. first proposed a practical dataset containing 8,635 faces with various mask wear conditions.Next, a context-attention RCNN (CA-RCNN) was introduced to extract distinguishing features and improve the discrimination ability when classifying face masks.
Fast R-CNN: This object detector has been used in a few FMD frameworks, mainly for comparison purposes, because it has two main issues: (i) it performs many passes through a single image to extract all the objects required, and (ii) since multiple modules work one after one, the performance of a specific module depends on that of the previous modules.For instance, in [129] Fast R-CNN is utilized along with R-CNN and Faster R-CNN to detect face masks.Typically, face mask regions are identified with CNN based on pixel prediction, mixing pictures, and specific enhancements.Similarly, in [121], Fast R-CNN with a feature Pyramid Network (FPN) and an R101 backbone is deployed for FMD along with other object detection (i.e., YOLOv4, YOLOv4-tiny, YOLOv5, and YOLOR).
Faster R-CNN: Sahraoui et al. [131] propose DeepDist, a DL framework for real-time detection of objects and distance violation detection from video sequences, which is based on a Faster R-CNN model.Evaluation of a dataset of real-world video sequences demonstrates the framework's ability to identify objects and estimate their distances accurately.In [42], a transfer learning-based FMD is introduced, which also employs a Faster R-CNN for detecting face masks and counting persons wearing them.Faster R-CNN is shown to have better precision but is less efficient than YOLOv3.
Mask R-CNN: An expanded Mask R-CNN (Ex-Mask R-CNN) model, based on ROI wrapping with Resnet-152, is proposed and used for developing an FMD solution in [132].It is mainly used to reduce the computation complexity by (i) detecting whether a pedestrian is wearing a mask and (ii) using multi-CNN to forecast the suspicious conventional abnormalities in video frames.
Other methods: More two-stage methods (also called multi-stage) are found in the FMD literature.For instance, in [133], the authors introduce a DL-based FMD, which localizes faces and detects masks in video frames using MTCNN and MobileNetv2, respectively.However, this approach still needs further improvement and more experiments to evaluate its performance.Loey et al. [134] present a hybrid FMD scheme that combines CNN and conventional ML, which is based on developing two components that (i) extract features using a pre-trained ResNet50 and (ii) classify facemasks using Support Vector Machines (SVM), an ensemble algorithm and decision trees.In [135], the authors propose an ensemble of one-stage and two-stage detectors to reach accurate and real-time results.Firstly, they use a pre-trained ResNet50 as a baseline and then apply the transfer learning concept to improve performance.In [136], a multi-stage FMD approach is proposed based on (i) integrating FPN and ResNet50 (FPN-ResNet50) into a unique DL structure for detecting pedestrians in video frames, (ii) utilizing an MTCNN model for detecting and extracting human faces from these videos.
In [136], a multi-stage FMD approach is proposed based on (i) integrating ResNet50 and FPN into a unique DL structure for detecting pedestrians in video frames, (ii) utilizing an MTCNN model for detecting and extracting human faces from these videos.
Most existing object detectors generally rely on designing CNN-based network architectures to extract discriminative characteristics.However, the difference between faces (with and without masks) is essential when the training dataset size is small.To overcome these issues, Yan et al. [137] propose an FMD scheme, namely CenterFace, which uses a context attention module to activate the adequate attention of the feedforward CNN by adjusting their attention maps' feature refinement.Additionally, an anchor-free detection scheme based on triplet-consistency representation learning is proposed.This helps to integrate the triplet loss and consistency loss to overcome the data scarcity issue and reduce the similarity between occlusions and masks.Figure 5   In [138], a two-stage FMD, based on VGG-16 and CNN classifiers, is developed and evaluated on public transportation systems (i.e., buses).The solution has been implemented on a Raspberry Pi and employs an open-source data analytics toolkit.In [139], Bansal et al. propose a CNN-based FMD approach based on (i) detecting face masks using an object detection API, applying different face mask classifiers, including SSD-MobileNetv1, SSD-MobileNetv2, and SSD-Resnet50v1, to classify masks into four classes, i.e., "bare", "N95", "surgical", and "homemade".

Discussion
Overall, the YOLO series and Faster R-CNN attract increasing attention, especially YOLOv3, YOLOv4, and YOLOv5.Moreover, light-weighted models, e.g., Tiny YOLO-based detectors, are gaining special attention as they can play a crucial role in deploying real-time FMD systems.Improved face detectors like RetinaFaceMask are also promising techniques.By transfer learning strategy, existing object and face detectors can be applied for masked facial detection.

Sorted by the Complexity of the Models
Researchers have handled the face mask detection task either as an offline or as an online task.In the former case, more complex and resource-demanding architectures are trained and evaluated on offline datasets that comprise unmasked and masked-faced images, with the latter being either synthetically generated (masks are inpainted) or really masked.In the case of FMD as an online task, the driving requirement is that of providing predictions in almost real-time, and in this case, more lightweight and fast models are employed, which can run on edge devices with limited resources.

Complex Object Detectors
In [140], a deep color 2-D PCA (principal component analysis)-CNN (deep C2D CNN) is used to detect face masks.Typically, the attributes of original pixels are mixed with feature representations learned by CNN before performing decision fusion to improve detection.Moving on, the classification of detected face masks is performed using AlexNet.In [141], an MTCNN-based FMD approach is introduced, which is though not appropriate for real-time monitoring due to its computational requirements.In [41], the authors propose DeepmaskNet, a unified model that can be employed in FMD and masked facial recognition (MFR) tasks.The respective framework relies on two CNN scaling techniques, i.e., scaling network depth and input image resolution.The first helps improve the FMD accuracy, while the second allows capturing pertinent fine-grained characteristics with higher-resolution input images.In [142], a near real-time CNN-based FMD approach is proposed.The proposed method first relies on detecting human posture to perform spatial reduction and background filtering in video frames.Openpose [143] is exploited in the first stage to recognize the human body's skeletons and capture facial regions before spatially reducing them.In the second stage, a CNN model that processes images detects the presence of face masks.

Lightweight Object Detectors
Developing lightweight face mask detectors is of utmost importance to enable their implementation on edge devices with limited computation resources, e.g., drones, mobile cameras, etc.Therefore, various studies have focused on developing FMD solutions using lightweight object detectors.
MobileNetv1: This lightweight detector is effective for mobile devices, and its performance has been investigated in various studies, such as [144][145][146].Typically, it has been used for comparison purposes with other improved variants, e.g., MobileNetv2, MobileNetv3, and MobileNetv4.
In the same direction of lightweight models, after investigating different DL models for FMD, including EfficientNet, ResNet50, ResNet-101, Inceptionv3, VGG19, VGG16, MobileNetv1, and MobileNetv2, Habib et al. [145] introduced a real-time FMD solution which is appropriate for edge devices.This system relies on the MobileNetv2 architecture to extract discriminative features from video frames, which are then fed to an auto-encoder for feature representation abstraction and finally to the classification layer.In [155], an FMD system is developed by employing a classification model based on the MobileNetv2 architecture and the OpenCv's face detector to identify the location of the face and determine whether or not it is wearing a mask.Additionally, the FaceNet model is utilized as a feature extractor and a feedforward multilayer perceptron to perform facial recognition.The model was trained using a set of 13,359 images, including 52.9% with masks and 47.1% without masks.
YOLOv1-tiny: YOLOv1-tiny is a real-time object detection model designed based on YOLOv1, where it has been pre-trained on the VOC dataset with 20 classes.Very few FMD studies have used YOLOv1-tiny, such as [156], where the performance of an FMD system based on YOLOv1-tiny is assessed and compared with YOLOv1, YOLOv2, YOLOv2-tiny, YOLOv3-tiny, and YOLOv4-tiny.
YOLOv2-tiny: It is the compressed version of YOLOv2, which has been pre-trained on 80 object classes from the COCO dataset.YOLOv2-tiny has been adopted in some studies to develop real-time FMD systems.For example, in [157], a weight quantization scheme is presented to design a compact CNN model that detects individuals with or without masks based on YOLOv2-tiny.
YOLOv3-tiny: It is the real-time compressed version of YOLOv3; it has been pre-trained on the COCO dataset with 80 object classes.Many studies have considered YOLOv3-tiny as the core of their FMD systems, including [158,159].For instance, in [159], the accuracy of 95% has been reached on a customized dataset of 135 images.Moving on, the FMD approach proposed in [160] is based on YOLOv3-tiny, which has been improved accordingly to solve the FMD task.In [161], the FMD problem is considered a multi-task object detection problem to detect wrong and correct ways of wearing masks using a YOLOv3-Slim.
YOLOv4-tiny: It is the compressed version of YOLOv4 developed for training on low computational resources.It has 16 megabytes of weight, which enables it to be trained using 350 images in less than one hour using a Tesla P100 GPU.Many studies have utilized YOLOv4-tiny to develop real-time DFM solutions, such as [162][163][164][165].Typically, in [163] where a lightweight network FMD approach based on YOLOv4-tiny, namely SAI-YOLO, is proposed to detect drivers wearing masks.Moving on, in [164], the SMD-YOLO approach for FMD is presented using an improved variant of YOLOv4-tiny.In [165], the superiority of YOLOv4-tiny against YOLOv4 has been demonstrated regarding the recall and frame per second (FPS) processing.In [166], ETL-YOLOv4, an improved version of YOLOv4tiny, is introduced for FMD tasks.In [167], the performance of YOLOv4-Tiny for FMD is investigated and compared with YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x.In [168], a lightweight FMD approach is proposed based on an improved YOLOv4-tiny.Moving on, an SPP structure, the backbone network of YOLOv4-tiny, is designed for pooling and fusing the input features at multi-scale.This helps improve the receptive field of the network before combining these multi-scale features with the path aggregation network to repeatedly fuse and enhance them in two paths and then enhance the expressive ability of feature representations.Lastly, label smoothing is utilized to optimize the loss function and overcome over-fitting.In the end, it is worth noting that the YOLO series has been the most widely used detector to build FMD systems.Figure 6 portrays the number of FMD research studies conducted based on the different YOLO series detectors.
NasNetMobile: Chavda et al. [65] presented a deep learning-based model to detect people who do not wear face masks properly.In their study, where NasNet-Mobile, DenseNet-121, and MobileNetv2 architectures were used, the highest classification accuracy was achieved via DenseNet-121 architecture, with 99.49%.Other methods: In [169], a lightweight-based FMD scheme based on CNN is developed.The proposed model consists of four convolutional, one fully connected, and one output layer.The performance of this approach has been compared against that of MobileNet, NasNetMobile, ResNet101, VGG19, VGG16, and AlexNet, where its superiority has been shown in terms of better balance among accuracy, model size, and time complexity.In [170], inappropriate mask use is targeted by (i) collecting 2,075 face mask usage images, (ii) labeling them as either "mask", "no masked", or "improper mask", and (iii) studying three scenarios, namely "scenario 1-mask versus no mask versus improper mask", "scenario 2-mask versus no mask + improper mask", and "scenario 3-mask versus no mask".A hybrid deep feature-based face mask detector is then trained and tested.The detector is implemented in three steps: (i) using pre-trained DenseNet201 and ResNet101 as feature generators, (ii) selecting the most discriminative characteristics using an improved RelieF selector, and (iii) considering the selected characteristics for classification by an SVM.A balanced masked face dataset of 3832 images.

N/A
The CNN architecture has been trained using the balanced dataset and used in real-world scenarios, though without evaluation.
[150] CNN The CNN architecture has been made similar to MobileNetv2 for an efficient computational cost.

FMDD Acc = 99%
MobileNetv2 has been used to classify pre-processed video frames using OpenCV.
There is no comparison with other methods on the training/test splits of the FMDD dataset.The performance drops under noisy environments (i.e., the hard set of Wider Face), and the privacy preservation are not addressed.

FMD Based on Deep Transfer Learning (DTL)
When developing FMD techniques, the authors have faced different challenges.For instance, training DL models is challenging, considering the diversity of mask types, camera angles in video frames, etc.Additionally, another issue is the unavailability of real largescale FMD datasets (data scarcity) to train DL models.To that end, DTL has recently been adopted in many FMD solutions.DTL consists of two steps: (i) using the knowledge learned on one domain/task and (ii) transferring it to a new, similar domain/task.For example, the knowledge from detecting face masks in annotated datasets could be transferred to another face mask dataset that does not have labels [177,178].In the context of DL, one popular way to apply transfer learning is by a series of steps that comprise: (i) taking layers from a previously trained network, (ii) freezing them to preserve the information they contain (during future training stages), (iii) adding some new and trainable layers on the top of the frozen ones (this helps in turning the old features into predictions on the target dataset), and (iv) training the new layers on the target dataset [179].
An FMD approach combining TL and DL is proposed in [180], which is based on (i) introducing an Efficient-YOLOv3-based DTL, where EfficientNet is the feature extraction backbone, (ii) using a loss function based on CIoU to improve the accuracy of face mask and decrease the number of network parameters.Moreover, the masks have been categorized into two unqualified masks (scarves, sponge masks, cotton masks, etc.) and qualified masks (disposable medical masks, N95 masks, etc.).Also, an FMD dataset has been created, and the DTL approach has been combined with MobileNet to improve the overall solution's generalizability, solve the data scarcity problem and tackle the over-fitting problem.In [181], the YOLOv3-based DTL scheme is proposed for creating a mask detection model, where an improved FMD that includes 300 images has been created using data augmentation, i.e., image filtering.In [182], a DTL scheme based on PaddlePaddle-YOLOv3 (PP-YOLOv3) combined with data augmentation and model compression is introduced to solve an FMD task.This resulted in an mAP of 86.69% in public scenes with a speed processing time of 11.84 ms per frame.Experimental results showed that compared with YOLOv3 and FasterRCNN, the model has faster accuracy and detection speed.In [183], DTL has been applied on YOLOv3 using public datasets and donation datasets for training.resulting models can recognize faces with a 98.7% accuracy rate and identify faces, including those with face masks, with a 92.7% accuracy rate.
Fine-tuning: A last, optional, step in DTL is fine-tuning, which consists of unfreezing the entire model (or part of it) and re-training it on the new data with a very low learning rate.This can potentially achieve meaningful improvements by incrementally adapting the pre-trained features to the new data.An increasing number of DTL studies have been developed to tackle different FMD tasks.For instance, in [18], an approach for detecting incorrect face mask-wearing using CNN and DTL is proposed.Accordingly, various CNN models are fine-tuned, including MobileNetv2, Xception, Inceptionv3, ResNet50, NASNet, and VGG19.
In a survey of deep learning models for the FMD task [184], the authors concluded that deep learning models such as the Inceptionv3 CNN achieved almost perfect results and stated the need for sharing real-world FMD images in order to fine-tune DL methods.
In [43], a DTL-based FMD system is proposed by initializing several CNN models with parameters learned from ImageNet dataset.Then, they are fine-tuned on FMLD using a standard cross-entropy loss.Jiang et al. [122] propose the RetinaFaceMask, a single-stage FMD approach to assist in the monitoring of the COVID-19 pandemic.The approach helps to resolve some of the problems already encountered by other studies by (i) establishing a new annotated dataset that helps in fine-tuning the DL models to discriminate between correct and incorrect mask-wearing scenarios, (ii) proposing a context attention framework to earn discriminating characteristics corresponding to the states of wearing face masks, (iii) transferring the knowledge from the face detection task to FMD, emulating humans' ability to adapt to similar tasks quickly.
In [171], Chowdary et al. propose a DTL method on inceptionv3.They fine-tune the pre-trained Inceptionv3 and test it on an SMFD task.In [185], the pre-trained MobileNet, ResNet Classifier, and VGG are used to an FMD and SDM framework.A real-time DTLbased FMD approach for mobile devices that employ MobileNetv2 is implemented in [144].The pre-trained MobileNetv2 network has been utilized to extract discriminative features, while a softmax classifier has been used to classify video frames into a mask or no mask.Additionally, an image augmentation technique has been deployed to avoid overfitting and improve the overall performance [186].In [46], a hybrid TL and broad learning FMD solution are proposed, where Faster R-CNN and Inceptionv2 are combined and fine-tuned to detect face masks regions.Second, broad learning is used to verify real facial masks.
In [171], a DTL-based FMD framework is proposed by fine-tuning the pre-trained Inceptionv3 model.Moreover, data augmentation is employed to overcome the data shortage problem and improve training and testing performance.In [187], a DTL-based FMD on the Spartan Face Detection and Facial Recognition System is proposed.The approach is based on a stacking ensemble of deep learning models and covers four primary tasks: mask detection, mask type classification, mask position classification, and identity recognition.In [188], the pre-trained Faster R-CNN Inception ResNet v2 model is used to implement a DTL-based FMD, where the pre-trained model weights on the COCO datasets are used as the starting point for DTL on another FMD dataset, which includes 3300 images with different types of face masks.Besides, because the lower parts of human faces are usually occluded and cannot be utilized in the face recognition learning process, the authors in [189] propose a framework that recognizes human faces using any available facial components.Such components may vary depending on wearing or not wearing masks and on the mask position.Typically, a DTL-based FMD approach based on an improved FaceNet model, is proposed.With DTL, the initial weights have been transferred, and the model has then been fine-tuned on the masked CASIA (M-CASIA) dataset [190].Similarly, in [191], a Faster R-CNN model is fine-tuned and assessed for the FMD task on the FMD dataset.In [192], a DTL-FMD is introduced by fine-tuning a MobileNetv2 base model with the imageNet weights and adding a new fully-connected head to process input data for analysis.Finally, in [193], a real-time CNN-Based lightweight mobile FMD system is proposed.Three different pre-trained models, including VGG16, MobileNet, ResNet50 are fine-tuned to enhance the detection performance.For instance, the first 23 layers of the MobileNet architecture are frozen, then three more dense layers are added before fine-tuning the new model.
Domain adaptation (DA): It refers to the process of adapting models across different domains.This is because training and test data could have a discrepancy or fall from other data distributions.For example, when an FMD model is trained only on unmasked faces data, while it is tested on masked faces data, the performance significantly drops, and DA can be called to solve this issue.DA aims at building ML tools that are able to generalize well into a TD process and deal with the gap across domain distributions.In this regard, some studies have attempted to investigate the roles of DA for better generalization of developed models.For instance, in [135] and the FMD scheme aggregating both onestage and two-stage object detectors are proposed to achieve accurate and real-time mask detection.Typically, ResNet50 is first adopted as a baseline before applying a DTL technique.Figure 7 illustrates the architecture of the FMD system using ResNet50-based TL.In [194], Mandal et al. propose a supervised DA to boost the performance of a MFR scheme.In doing so, faces without masks are considered in the source domain (SD), while faces with masks have been reserved for the target domain (TD).Moving on, a ResNet50 model has been trained and validated under two case studies.The model has been trained only on the SD and the tested on the TD in the first scenario.In the second scenario, the model is trained on the SD and a portion of TD before being tested on the remaining part of TD.Table 4 summarizes most relevant DTL-based FMD frameworks and their characteristics.No assessment on real-world masked face datasets.[195] Transfer learning Relies on adopting transfer learning to detect face masks in both images and video streams.

RMFD Acc = 98%
(i) Works on a variety of devices (e.g., smartphones, etc.) and is also able to process in real-time images and video streams, (ii) the approach is not well interpretable activation since they do not use activation maps.

SMFD Acc = 99%
Validation on a small masked face dataset.FMD assessment in real-life video streaming is missing. [ Detection of incorrect face mask-wearing using CNN and DTL.
Private data Acc = 83% (i) Implemented via Android app that works with real scenarios, and the solution can identify mask misuse, (ii) Unable to detect incorrect lateral adjustment and glasses underneath.Moreover, the system was applied with surgical and FP2 masks.Not applied with masks that have sequins and other drawings.[196] DTL based on combining SVM and MobileNetv2 MFD using deep feature selection and award-winning pre-trained DL models.
Collected data of 1376 images Acc = 97.1% Tested on a small-sized dataset.Not tested on the challenging occluded face.[197] YOLOv3 and Darknet53 Data augmentation and DTL for FMD.
data collected from Kaggle.Prec = 99.8% The automated system detects masks using an augmented dataset.
[ Other approaches to domain adaptation for FMD include using adversarial training or cycle consistency to align the feature distributions between the source and target domains or using domain-invariant features that are less sensitive to domain shift.

Comparative Analysis
Singh et al. [42] utilize two object detectors, e.g., YOLOv3 and Faster R-CNN FMD.While Faster R-CNN has outperformed YOLOv3 in terms of accuracy, YOLOv3 has pre-sented can run faster, which makes it a good fit for real-time applications.Overall, selecting the best model for a specific application is mainly related to environmental conditions.Moving on, the work conducted by Alganci et al. [199] has resulted in similar conclusions.Besides, SSD, Faster R-CNN, YOLOv3, and YOLOv3-Tiny are employed in [44] to address the issues of FMD in medical environments.Based on the results derived from the Moxa3K dataset, YOLOv3-Tiny outperforms the other models in terms of accuracy and inference speed, which makes it the most appropriate for real-time applications.Table 5 compares existing FMD frameworks in terms of their performance and characteristics.In [170], pre-trained ResNet101 and DenseNet201 have been fine-tuned and aggregated to form an efficient feature generation module tested on three different scenarios.In Scenario 1, 2075 images are used to form a three-class classification task that is investigated by considering three individual classes, i.e., mask, no mask, and improper mask.Moving on, wrong mask and no mask classes are combined in Scenario 2 to form a two-class classification task.Hence, a "non-compliance" set is formed using 2075 images.Similarly, by excluding improper mask sets, another two-class classification problem is developed in Scenario 3 while only 1546 are used.Figure 8a reports obtained results in terms of different evaluation criteria, including the accuracy (ACC), average precision (AP), unweighted average recall (UAR), Mathew correlation coefficient (MCC), F1 score, Cohen's kappa (CK), and geometric mean (GM) [71,72].
Besides, various pre-trained CNN models have been fin-tuned and applied as feature generators to solve the problem of FMD under the first scenario.Figure 8b portrays the accuracy performance of 12 pre-trained CNN models.

Critical Discussion
The detection of face masks is challenging due to the variations in the appearance of faces with masks, including the types and degrees of obstruction, as well as the diverse types of masks that are used.Face mask detection is essential for facilitating interactions between humans and computers and managing image databases.Despite the successes of existing face detectors, there is a need for more advanced models that can handle event analysis and video surveillance tasks, which can be challenging due to a lack of suitable datasets with correctly masked faces and facial recognition, as well as the presence of noise on the face caused by the mask.While some research has addressed these issues, there is still a need for a large dataset to develop an efficient face mask detection model.Overall, there are several current challenges to the task of FMD: • Lack of suitable datasets: one major challenge is the Lack of datasets with a sufficient number of images of faces with masks, as well as a diverse range of mask types and wearing conditions.This can make it difficult to train and evaluate face mask detection algorithms, as the performance of these algorithms is often dependent on the quality and diversity of the training data.• Small intra-class distance and significant inter-class distance: another challenge is the small intra-class distance and large inter-class distance between masked and nonmasked faces, making it difficult to accurately distinguish between these two classes.This may require specialized algorithms or techniques that can extract distinguishing features and increase the separation between these classes.• Noise caused by masks: the presence of masks on the face can also introduce noise that can interfere with the performance of face mask detection algorithms.This may be due to factors such as the mask's texture, reflections or shadows, and the occlusion of facial features.Typically, these challenges highlight the need for developing robust and efficient face mask detection algorithms that can handle a wide range of conditions and variations in the appearance of masked faces.Besides, the use of masks has been shown to reduce the infection rate of COVID-19 by 40% effectively, but detecting the wearing of masks in the real world can be challenging due to factors such as lighting conditions, occlusion, and the presence of multiple objects.This can lead to poor detection performance, and using non-medical masks such as cotton masks, sponge masks, and scarves may also reduce the protective effect of mask-wearing.

Open Challenges
Over the past two years, researchers have proposed various deep learning-based methods for face mask detection.However, many of these methods have struggled with detecting small or poorly shot masks occluded by other objects, leading to low detection accuracy.

Lack of Annotated Datasets
A major challenge for researchers is the Lack of properly annotated real datasets containing images of faces with and without masks shot under various conditions (e.g., during the day or night, indoors or outdoors, from a distance or close, etc.).Most of the datasets that were presented in Section 2.2 and summarized in Table 1 either contain synthetic images of face masks, where masks have been painted over the faces or contain properly shot and cropped face images, without distortion or noise, which is typical for the images collected by surveillance cameras.The datasets that contain images from such cameras mostly refer to the social distancing detection task, and it is hard to detect mask-wearing on them.
Another issue that is not properly addressed yet is that face masks (including those used for preventing COVID-19 diffusion) come in different shapes and colors.There are still no properly annotated datasets that cover this.Most of the existing datasets assume the typical white or teal medical masks only and ignore all other fabric-based or multi-colored medical masks that became so popular in the last few years [137].
Ignoring all the above issues and creating synthetic datasets that reproduce the researchers' beliefs on how a face mask looks, is shaped or is worn, we introduce a bias to the dataset, which may hinder the trained model's performance in real conditions.

Computational Cost
There are many challenges to using computational platforms for face mask detection (FMD).The hardware used for this purpose must be affordable, compact, and energyefficient, with enough memory and processing power to quickly analyze images using convolutional neural networks (CNNs) or other models.To protect the privacy of the people being monitored, all processing should be done on the device itself, without any communication with cloud servers.One potential solution to these challenges is to use low-power binary neural network classifiers to detect face masks' presence and proper positioning.These classifiers can be implemented on field-programmable gate arrays (FP-GAs), capable of high-throughput binary operations, and can perform FMD classification tasks on an edge device.
Low-power binary NN-based COVID-19 FMD, such as BinaryCoP [202], can be employed to identify correct facial mask wearing and positioning using edge computing and mobile devices.Another solution is to use light-weight CNN models, such that in [203], for performing multiple access control-related tasks, including the detection of face masks, the monitoring of body temperature, and the counting of people entering a building or an area (e.g., a concert hall).Such models can be trained offline and used for inference on singleboard computers (SBCs) such as Raspberry Pi, Jetson Nano, or others.The performance of edge-based solutions in FMD tasks in indoor and outdoor environments has been proven better than their cloud-based alternatives [204], with the respective time performance for image processing slightly worse.However, if we also consider the time needed to transfer images to the cloud, edge-based solutions are almost 50% faster.
Cooperative edge computing [205], task offloading, and federated learning (FL) [124] are a few of the techniques that are expected to gain the interest of researchers in the next few years since they preserve data privacy transfer processing to the edge and minimize transfer times.

Security and Privacy
Large-scale monitoring and FMD of individuals in public areas do not only need well-performing AI tools but also require to include privacy preservation modules.This is because any kind of surveillance can violate citizens and regulations, including the general data protection regulation (GDPR), which sets strict regulations on the use and share of personal [11].Although FMD developers deny privacy problems that come with FMD systems as they do not identify individuals in most cases, and detect the existence of face masks, adding privacy protection mechanisms to them will increase the users' trust and acceptance of the FMD technology.
To cope with this challenge, the authors in [11], explain a privacy-protection FMD system that might be developed, demonstrating various implementation and performance evaluation options.

Difficulty in Recognizing People's Emotions
Due to the obligation of wearing masks in public areas that have been set in various countries, emotion recognition has become challenging.Despite the promising progress achieved in tackling this problem, the new requirements to wear face masks compromise what has been done in recent years.New studies have been explored to detect persons' moods using DL and ML models when face masks are worn to close this gap.For instance, ref. [206] validates a real-time CNN-based emotion recognition approach on a modified version of the AffectNet dataset.Synthetic face masks have been added to each subject from the AffectNet dataset.Additionally, the number of emotions has been reduced from eight emotions (Happiness, Disgust, Anger, Fear, Surprise, Sadness, Contempt, Neutral) to five (Anger-Disgust, Fear-Surprise, Happiness, Sadness, Neutral).

Masked Face Attacks
With the start of the COVID-19 pandemic, face masks have significantly flattened the COVID-19 curve.However, this opens new challenges for face recognition as wearing a mask hides multiple discriminative features of a face.On the other hand, face presentation attack detection (PAD) is critical for ensuring the security of face recognition systems.Although the increasing number of FMD frameworks, few studies have been proposed to investigate the impact of the masked face on PAD.For instance, to reflect the actual real-world situations, Fang et al. [207] investigate (i) attacks with subjects wearing masks and (ii) attacks with real face masks placed on presentations.Moving on, in [208], a physical universal adversarial perturbation (UAP), namely the adversarial mask, is proposed to simulate attacks against face recognition systems, which are deployed on face masks using carefully crafted patterns.

Interpretability and Explainability
Explainable Artificial Intelligence (XAI) refers to methods and techniques for creating AI systems that can provide understandable and interpretable explanations for their predictions, decisions, and actions.XAI aims to build AI systems that can be trusted and used safely and effectively by humans.This can be achieved through techniques such as transparency, interpretability, and post-hoc explanations [209,210].
As it is made evident from this study, the accuracy of face mask detection algorithms is the main concern of researchers, along with the complexity of the models and the respective training and inference (time) performance.Although neural network architectures have significantly boosted the prediction/detection performance over their ML predecessors, the research community still needs to understand deep learning architectures, break down the structure of pre-trained models, and show how predictions are made.As a consequence, the results of FMD models are expected to be interpretable, and explainable [211], and this is in accordance with the general trend for interpretability and XAI.The works in the field of face recognition aim to develop comprehensible facial depictions in which different dimensions represent different face segments or features and propose end-toend learnable filters, which are locally activated using spatial loss functions, such as the spatial activation diversity loss function in [212].The generation of expression patternmaps and their association with expression features is another approach that improves the interpretability of facial expression detection tasks [213].Such approaches can be further combined with ML or DL classifiers in a two-stage approach and provide interpretable FMD and robustness in the cases of partial face occlusion.
In an attempt to explain the results of the facial matching process, authors in [214] have defined a new evaluation protocol called the "inpainting game", and introduced an explainable face recognition (XFR) algorithm that explains which facial characteristics (e.g., nose, eyebrows, etc.) are matched in each case.The XFR algorithm generates network attention maps for facial images consistent across the human participants and provides insight into the features that make each face unique.In a similar line of research, saliency maps can be employed to explain face matching by measuring the contribution of different face parts [215].Examining the effect of perturbation, occlusion, or noise on each of the parts alone or in a combination of parts is a first step towards the explainability of face recognition tasks.Such techniques can be combined with FMD algorithms to provide explainable decisions that link the detection of a mask (or proper mask-wearing) with the coverage of specific facial features (i.e., nose, mouth).

Further Generalization for FMD Techniques
The current study has revealed the large variety of techniques and models for FMD and the plethora of datasets used for their training.Many studies are trained and validated on custom and private datasets, whereas others employ more significant benchmarks.The main characteristic of the datasets is the variety of images they contain (e.g., controlled or uncontrolled poses, real or synthetic images, two or more classes, etc.), and this consequently affects the models being trained and limits their ability to generalize to more, but relevant, tasks.
Despite the many works, datasets, and models trained for the general image classification task, such as VGG-16, ResNet50, Inceptionv3, or EfficientNet, or the works that focus on building generalized models for face detection, such as MTCNN or FaceNet [216] there is still an open need for generalized pre-trained models that detect masked faces and extract their features.Research in this direction will allow deep transfer learning techniques to be applied to multiple FMD-related tasks and boost the performance of lightweight models trained with only a few training samples [217][218][219].The first works that apply transfer learning of Inceptionv3 [171], RetinaFace [65], MobileNet [220], and similar networks show the way for future researchers that want to build general models for the Faced Mask Detection task.
Transformers can also be employed in the specific FMD task, following the successful paradigms in the generic image classification task [221,222].The resulting models can be further trained using only a small number of images for the task and achieve state-of-the-art performance [223].

Federated FMD
FL is an ML technique where multiple decentralized devices, such as smartphones or edge devices, work together to improve a shared model without sharing their raw data.The devices train a local version of the model on their data and then share the updates with a central server, aggregating the updates to improve the global model.This allows for training models on distributed data without the need to centralize and share the data [224].The need to continuously monitor the proper use of masks, especially in public areas, and more importantly, without breaching citizens' privacy requests the implementation of lightweight detection methods.The latter can process video streams on edge in real-time without storing any information locally.Solutions, such as WearMask [71], or BinaryCoP [202] apply transfer learning to very fast and lightweight models such as YOLO-Fastest or adopt fast binarized NN inference frameworks such as Finn [225], and operate as mask-wearing sensors.They can be deployed as mobile apps or run on FPGA-powered hardware that allows high-speed face detection and classification in two stages [226].
FL is another approach that can help protect user privacy by design and, simultaneously, by training DL or ML models locally using sensitive data, such as people's faces.By splitting the FMD task into two or more cascade stages, such as the location of faces in images, the extraction of image features, and the classification of the detected faces as masked and non-masked (and/or improperly masked), it would be easier to federate the learning task [124] and take advantage of locally trained models.Similar architectures can be employed for social distancing screening [227], detecting littered face masks [228], etc. Besides, another way to implement FL for face mask detection is to have multiple parties contribute to the model training by providing their datasets of images.Each party could train a local model on their data and then send the model parameters to a central server, aggregating the parameters and using them to update the global model.This process could be repeated iteratively until the worldwide model has learned to detect masks accurately from the combined datasets [229].
Moreover, there are many potential benefits to using FL for face mask detection.For example, it can allow organizations to train a model without requiring access to sensitive or personal data.It can also enable organizations to collaborate on ML projects without sharing data [230].Additionally, FL can be used to train models on data distributed across multiple devices, such as smartphones or cameras, which can be useful for real-time face mask detection in various settings [231].
Lastly, it is worth noting that all the aforementioned technologies constitute the future of face detection and identification tasks and consequently find their application to face mask detection.An in-depth knowledge of the existing literature as presented in the current study and a careful examination of the open challenges and future solutions is the first step for researchers that wish to contribute to this field.The next step involves the collection of benchmark datasets that can be used for training and fine-tuning the new state-of-the-art models.

Conclusions
The use of FMD systems has become increasingly important in recent times due to the COVID-19 pandemic.Wearing a face mask is an effective way to reduce the spread of the virus.Many businesses and organizations have implemented policies requiring face masks in public spaces.Face mask detection systems can help enforce these policies and ensure that people follow the guidelines to protect themselves and others from the virus.This paper comprehensively reviewed FMD based on deep and transfer learning models.It was clear that DL and TL have the potential to be practical tools for FMD, with the ability to handle a wide range of variations in the appearance of masked faces and achieve high accuracy rates in many cases.At the same time, it is essential to note that any FMD algorithm's performance is likely to depend on the quality and diversity of the training data, as well as the specific design and implementation of the algorithm.As such, further research is needed to fully understand the limitations and potential of deep learning for face mask detection and to develop robust and efficient algorithms that can handle real-world conditions and variations.
On the other hand, it was demonstrated that the FMD is a multi-faceted task that poses various challenges to computer vision and ML engineers.The challenges related to the proper definition of the task are detecting a covered face or a mask that has been properly worn, which affects the model that has to be trained to handle the respective task properly.
Another challenge relates to the scale of the face and face mask detection, which spans from a single person in controlled conditions to the detection in the wild, usually in combination with the detection of proper social distancing.Different algorithms apply to the massive identification of persons in a video or image stream and the detection of mask-wearing than in the case of face mask-wearing detection in controlled environments and single-person images.However, the biggest challenge relies on the scalability of solutions, which must handle video streams with dozens of persons in the case of public surveillance applications.This, in turn, sets the need for distributed processing using lightweight models and for federated training of the respective models.
The future of face mask detection in intelligent cities resides in using generic models that will be fine-tuned on specific detection tasks, using fewer resources for training and processing.The FL approach will take advantage of the resources available on edge devices and allow lightweight models to better and faster adapt to the emerging needs of each task.

Figure 3 .
Figure 3.A taxonomy of the approaches introduced for solving the Face Mask Detection problem using deep learning.
portrays the flowchart of the CenterFace solution and the CBAM module.

Figure 6 .
Figure 6.Statistics on the number of FMD studies built upon the different YOLO detectors.
, and Faster-RCNNReal-time analysis of FMD and SDM.Private dataset Acc = 93% Validation on a small dataset without assessment in real-world scenarios.

Figure 8 .
Figure 8. Performance comparison of DTL-based CNN models: (a) Performance of combining pretrained ResNet101 and DenseNet201 under three different scenarios, and (b) accuracy performance of 12 pre-trained CNN models under scenario 1.Data from: [170].

Table 1 .
Summary of existing publicly accessible FMD datasets and their characteristics.

Table 2 .
A summary of the common evaluation metrics used to evaluate FMD frameworks.

Table 3 .
Summary of existing DL-based FMD techniques and their characteristics.

Table 4 .
Summary of existing DTL-based FMD frameworks and their characteristics.

Table 5 .
Comparison of existing FMD frameworks in terms of different characteristics, (i) adopted method, (ii) support of real-time implementation, (iii) used dataset, (iv) sample size, (v) mask type, (vi) recognition type, and (vii) validation accuracy.
• Variability in mask appearance: face masks come in a wide variety of shapes, sizes, and colors, and they may also be worn in different ways (e.g., covering the nose and mouth, covering only the nose, or hanging around the neck).This variability can make it challenging for a model to detect masks accurately.•Occlusions:facemasks can occlude parts of the face, making it difficult for the model to identify features such as the eyes, nose, and mouth.•Lightingand background: the model may have difficulty detecting masks in low light conditions or against cluttered or complex backgrounds.• False positives and false negatives: the model needs to minimize false positives (incorrectly identifying a mask when none is present) and false negatives (failing to identify a mask when one is present).