FMDNet: An Efficient System for Face Mask Detection Based on Lightweight Model during COVID-19 Pandemic in Public Areas

A new artificial intelligence-based approach is proposed by developing a deep learning (DL) model for identifying the people who violate the face mask protocol in public places. To achieve this goal, a private dataset was created, including different face images with and without masks. The proposed model was trained to detect face masks from real-time surveillance videos. The proposed face mask detection (FMDNet) model achieved a promising detection of 99.0% in terms of accuracy for identifying violations (no face mask) in public places. The model presented a better detection capability compared to other recent DL models such as FSA-Net, MobileNet V2, and ResNet by 24.03%, 5.0%, and 24.10%, respectively. Meanwhile, the model is lightweight and had a confidence score of 99.0% in a resource-constrained environment. The model can perform the detection task in real-time environments at 41.72 frames per second (FPS). Thus, the developed model can be applicable and useful for governments to maintain the rules of the SOP protocol.


Introduction
The primary and essential precautions against COVID-19 have been maintaining distance between people and wearing a mask [1]. The efficient way to curb the spread of COVID-19 according to many health experts is social distancing [2]. This involves consciously maintaining a certain distance between individuals to prevent the spread of The advantage of a fast face detector is that it is suitable for programmable devices and has a homogeneous structure. Zuo et al. [14] presented a face detector using a hierarchical ensemble. Hong et al. [15] introduced both phone-to-face and face detection in real-time. The algorithm retrieves two face regions based on the data estimated from experiments and identifies the face region using vertical and horizontal histograms. Sun et al. [16] designed a DL-based face detection model tested on an FDDB benchmark. The proposed work was more efficient than the traditional faster RCNN model in face detection. Xiao et al. [17] presented a face detection scheme using the positioning of the optimal occlusion area (POOA) algorithm. This algorithm employs the Haar feature, principal component analysis (PCA), and AdaBoost. It provided good results for face detection. Sundararajan and Biswas (2019) [18] developed different face detection methods by enhancing the images. The research was tested on an IDEAL-LIVE Distorted Face Database dataset and provided good results. Zhang et al. [19] proposed a feature agglomeration network (FANet) to build a novel single-stage face detector. The proposed model was tested on several datasets and achieved good results. Guo et al. (2020) [20] proposed a method for face detection based on complete discriminative features associated with a CNN. This approach is useful for face detection and had outstanding results. Sen and Sawant [21] proposed a face mask detection system using the MobileNet V2 model, trained on 9000 face images with and without masks, and the accuracy was 79%. Rahman et al. [22] proposed a deep learning architecture for the FMD task trained on 858 images and obtained an accuracy of about 98.7%.
Previously, Khandelwal et al. [23] developed a DL model that was trained on more than 300 images with masks and 460 images without masks. In [24], the authors proposed a method to identify the condition of face mask wearing by using a classification network with image super-resolution to prevent COVID-19. Jiang et al. [25] proposed a face mask detector. Their model used ResNet and MobileNet on a dataset of 7959 images. Li et al. [26] proposed a method of HGL for the classification of head pose. Their model achieved an accuracy equal to 93.64%. Matthias et al. [27] proposed an FMD model. They used a dataset that was used to detect the main facial features such as the nose, eyes, and mouth. A different number of studies have been proposed for head pose estimation and FMD using several deep learning models such as FSA-Net [28], FaceMaskNet [29], and ResNet [30]. Sethi et al. [31] proposed a deep-learning-based approach for detecting masks over faces in public places. The proposed work handled occlusions utilizing an ensemble for different pre-processing methods in one and two stages. Gupta et al. [32] developed a human face mask detection method from images and videos using the deep learning concept. They used the Region-based Convolutional Neural Network (RCNN) with ResNet-152 as a base model for detecting the masks and action recognition in public places, which can help manage social distancing to mitigate COVID-19. Ullah et al. [33] proposed a novel DeepMaskNet model that detects face masks and performs masked face recognition. Teboulbiat et al. [34] explored a comparative study for face masks when they used different pre-trained models. The evaluation results achieved a confidence score of 100%. Goyal et al. [35] (2021) proposed a face mask detection approach using a Convolutional Neural Network (CNN) based model to handle the task of the presence or absence of a mask on the given input image or video sequence. The approach was able to attain 98% accuracy for the detection part, compared with existing CNN-based architectures, such as the Visual Geometry Group (VGG), MobileNet, DenseNet, and Inception models. Furthermore, the Resnet-based Single-Shot Detector framework model was used to identify faces. In other studies, Mestetskiy et al. [36] proposed biomedical image classification and detection using image processing and computational geometry [37]. The DL-based face mask detection task plays an essential role in various scenarios such as public transport, workplace management, schools, and hospitals.
In the present research work, a novel deep learning architecture, FMDNet, was designed to mitigate the challenges identified regarding the state-of-the-art methods. The proposed approach, an AI-based "FMDNet" model, was developed to identify people violating the COVID-19 protocols by not wearing masks or wearing them incorrectly. This model first scans the complete face and detects and recognizes the facial landmarks. Thereafter, the region of interest is identified and selected based on the facial landmarks for the detection of the mask. If a mask is not found, it raises an alarm. If the person is wearing a mask, the system scans to detect if the mask is being worn properly or not, and if the mask is found to be worn inappropriately, it raises an alarm. Hence, the "FMDNet" model detects people who do not wear face masks or improperly wear face masks. The contributions of our proposed work are summarized as follows:

•
We designed a novel FMDNet model based on deep learning for the detection of face masks.

•
We designed and developed a face mask recognition system based on computer vision for real-time deployment.

•
Our system achieved the best accuracy compared to existing techniques.

•
Our system easily found those people who were not wearing a mask in a gathering place.

Materials and Methods
The high-level workflow of the proposed research work is presented in Figure 1. It consists of mainly three steps: (i) data augmentation, (ii) the development of the proposed FMDNet, and (iii) the model building using FMDNet. In the present research work, a novel deep learning architecture, FMDNet, was designed to mitigate the challenges identified regarding the state-of-the-art methods. The proposed approach, an AI-based "FMDNet" model, was developed to identify people violating the COVID-19 protocols by not wearing masks or wearing them incorrectly. This model first scans the complete face and detects and recognizes the facial landmarks. Thereafter, the region of interest is identified and selected based on the facial landmarks for the detection of the mask. If a mask is not found, it raises an alarm. If the person is wearing a mask, the system scans to detect if the mask is being worn properly or not, and if the mask is found to be worn inappropriately, it raises an alarm. Hence, the "FMDNet" model detects people who do not wear face masks or improperly wear face masks. The contributions of our proposed work are summarized as follows: • We designed a novel FMDNet model based on deep learning for the detection of face masks.

•
We designed and developed a face mask recognition system based on computer vision for real-time deployment.

•
Our system achieved the best accuracy compared to existing techniques.

•
Our system easily found those people who were not wearing a mask in a gathering place.

Materials and Methods
The high-level workflow of the proposed research work is presented in Figure 1. It consists of mainly three steps: (i) data augmentation, (ii) the development of the proposed FMDNet, and (iii) the model building using FMDNet.

Pins Face Recognition Dataset
This dataset was collected from Kaggle (https://www.kaggle.com/hereisburak/pinsface-recog, accessed on 1 October 2022) for facial recognition purposes [38]. A detailed description of the dataset is given in Table 1. 1 November 2022). We developed a face-mask-detection model that was trained on 7553 RGB with-mask and without-mask images. There were 3725 images of faces with a mask and 3828 images of faces without a mask. The training accuracy on the custom CNN architecture model was 94%, while the validation accuracy was 96%. The dataset is available online.

Labelled Faces in the Wild Dataset
The Labelled Faces in the Wild (LFW) dataset (https://www.kaggle.com/datasets/ muhammeddalkran/lfw-simulated-masked-face-dataset, accessed on 1 December 2022) was used to create our dataset. Images of notable individuals were taken from the website and are part of the LFW dataset. The collection includes 5713 unique faces, totaling 13,117. The training dataset includes 5713 images of people wearing masks out of 13,027 faces. The testing dataset comprises 70 masked faces of 48 people. The dataset is available online.

Real-World-Masked-Face-Dataset
People around the nation wear masks as a result of the COVID-19 pandemic, which has spread across the globe, and several samples of masked faces can be found (https: //github.com/X-zhangyang/Real-World-Masked-Face-Dataset, accessed on 20 March 2023). In order to gather data resources for potential future intelligent administration and control of comparable public safety incidents, they, as a result, have generated the largest masked-face dataset in the world. When the community was under lockdowns, appropriate masked face detection and identification algorithms were created based on a masked face dataset to assist those entering and leaving the community. They created and trained a face-eye-based multi-granularity masked face recognition model using the datasets they created. The dataset's face identification accuracy is over 95%. The dataset is available online.

YOLO-Medical-Mask-Dataset
This dataset is based on Mikolaj Witkowski's dataset (https://www.kaggle.com/ datasets/gooogr/yolo-medical-mask-dataset, accessed on 2 April 2023). All these images have been translated into jpg format. At the same time, the images are only of people properly wearing medical masks. Mikolaj Witkowski's dataset gives more details about these data. The dataset is available online.

Data Augmentation
Subsequently, the faces in the images were detected by applying facial landmarks [39] and the regions of interest, i.e., the landmarks of the detected "nose" and "chin", to apply a transparent face mask over the face. The illustration of the work is shown in Figure 2. To place the mask over the face, the four main points were detected from the facial landmarks, namely: (i) the nose's top, i.e., the topmost point on the bridge of the nose detected in the previous step, (ii) the chin's bottom, i.e., the bottommost points of the chin detected in the previous step, (iii) the left of the chin, i.e., the leftmost point of the chin detected in the previous step, and (iv) the right of the chin, i.e., the rightmost point of the chin detected in the previous step. After obtaining these four points, resizing was performed by calculating the new length and width for the transparent mask image as in Equations (1) and (2) below: . Then, the center coordinates were computed for positioning the mask as in Eq tions (3) and (4) below:

Proposed Model
Deep learning models require considerable resources to work efficiently. Still, majority of the problems in the natural world require lightweight applications so that th can be deployed in a resource-constrained environment. The MobileNet deep learn model is highly lightweight, so it operates efficiently in a resource-constrained envir ment. The MobileNets vary from traditional CNNs through the practice of depthwise s arable convolution. The proposed model was based on the enhanced MobileNetV2. Th have been used to expand the input representations [30]. In addition, non-linearities m be removed to maintain representational power. The proposed FMDNet model has so information such as the structural details, dimensions, and architecture, as shown in Ta  2 and Table 3, respectively. The FMDNet architecture includes two types: (i) a resid block, which has a stride of one, and (ii) a block for downsizing with a stride of two. E block of these blocks has three different layers. The operation of the 1 × 1 convolution performed in the first layer along with Rectified Linear Unit (ReLU). The operation depth wise convolution is performed in the second layer. The third layer is similar to first layer (operation of 1 × 1 convolution), barring non-linearity. The speculation is tha ReLU is used again, the deep networks only have the power of a linear classifier on non-zero volume part of the output domain. Additionally, a head of five fully connec layers, including one average pooling layer with a pool size of 5, a flattening layer, an dense layers, were added on top of the model.  Then, the center coordinates were computed for positioning the mask as in Equations (3) and (4) below:

Proposed Model
Deep learning models require considerable resources to work efficiently. Still, the majority of the problems in the natural world require lightweight applications so that they can be deployed in a resource-constrained environment. The MobileNet deep learning model is highly lightweight, so it operates efficiently in a resource-constrained environment. The MobileNets vary from traditional CNNs through the practice of depthwise separable convolution. The proposed model was based on the enhanced MobileNetV2. They have been used to expand the input representations [30]. In addition, non-linearities must be removed to maintain representational power. The proposed FMDNet model has some information such as the structural details, dimensions, and architecture, as shown in Tables 2 and 3, respectively. The FMDNet architecture includes two types: (i) a residual block, which has a stride of one, and (ii) a block for downsizing with a stride of two. Each block of these blocks has three different layers. The operation of the 1 × 1 convolution is performed in the first layer along with Rectified Linear Unit (ReLU). The operation of depth wise convolution is performed in the second layer. The third layer is similar to the first layer (operation of 1 × 1 convolution), barring non-linearity. The speculation is that if ReLU is used again, the deep networks only have the power of a linear classifier on the non-zero volume part of the output domain. Additionally, a head of five fully connected layers, including one average pooling layer with a pool size of 5, a flattening layer, and 3 dense layers, were added on top of the model.
The architectural highlights are summarized as follows: • A kernel size of 3 × 3 was used for the spatial convolutions.

•
The total number of parameters was 2,619,074, out of which 2,584,962 were trainable parameters and 34,112 were non-trainable parameters.

•
The network was trained on an NVIDIA-SMI 455.32.00, with 32 as the batch size.

Experimental Environment
The architecture was trained on the CoLab environment with 12GB of RAM and GPU settings. The experiments were carried out using many libraries, such as OpenCV-4.5.5, Keras 2.12.0, and TensorFlow v2.12.0. MobileNetV2 and FMDNet were evaluated using the accuracy, recall, precision, F1-score, and loss performance metrics in a real-time environment. The customized Dataset-1 and the benchmark datasets, Dataset-2, Dataset-3, Dataset-4, and Dataset-5, were split into two portions: 80% and 20%, respectively. MobileNetV2 and FMDNet were integrated with YOLOV5 for real-time testing and deployment. The first portion was used for training, and the second portion was used for testing. The training process was performed for 100 epochs, and the batch size was kept at 32.

Confidence and Frames Per Second
During the real-time implementation of the proposed model, after providing the input to the face detector model, firstly, it returns the coordinates of the bounding box of the detected face from the input image along with the confidence score of the detection. The confidence score represents how likely the box is to contain an object of interest and how accurate the bounding box is. If no object exists in that cell, the confidence score should be zero. A confidence score was used as a threshold with a value of 0.5 as the default to filter the unwanted data. The bounding box coordinates of the detected face were set to be the input for the classifier. For each input, the classifier returns the prediction score for both the mask and no-mask classes. The higher score was considered to be the final output. The frames per second (FPS) represent how fast the object detection model processed the input images and generated the desired outcome.
In Table 4, the results showed that the detector model had high confidence in detecting faces, scoring 99.99 for no mask and 98.59 for the faces with a mask using the customized Dataset-1. At the same time, the detector model was capable of detecting faces with a mask or no mask with an average of 41.72689 FPS for no mask and 41.9384 FPS for a face with a mask. It was also observed that the FPS obtained were sufficient for realtime feedback. While operating with FMDNet in a resource-constrained environment (without GPU support), a lightweight model of a size of 14 MB can perform well by providing a 20 FPS detection rate with a 97.97 confidence score to be deployed in a realtime environment. In addition to this, FMDNet was evaluated with the public Dataset-2, Dataset-3, Dataset-4, and Dataset-5, and the results were interesting. The accuracy was almost 99.99% and 98.9% with the no-mask and mask categories, respectively. The FPS scores were also found to be optimal for the real-time usage of the model.

Comparison between MobileNetV2 and FMDNet Models
The performance of MobileNetV2 vs. FMDNet on the training set and testing set concerning the accuracy and loss are highlighted in this section. Figure 3 highlights the learning curve for MobileNetV2 and FMDNet, respectively, for model learning performance over the epochs. In each epoch, the values of the accuracy and loss metrics represent the learning curves of the model for the training and validation data. The accuracy was also used as an evaluation metric for testing the model. While the curves of both graphs can be considered a good fit, FMDNet provided a better model since the loss and accuracy for both the validation and the training set converged at comparatively better values in contrast to the base model MobileNetV2. In the case of loss, the FMDNet model converged at a lower value than MobileNetV2 for both the validation set and the training set. In the case of accuracy, the FMDNet model converged at a higher value than the base model for both the validation set and the training set.
The accuracy and loss of the training dataset for MobileNetV2 and FMDNet are presented in Figure 4a. This represents several correct decisions made by the FMDNet classifier over the training set. For the same training set, FMDNet (99%) provided a considerable amount of better accuracy than the base model, MobileNetV2, which was only 94%. Figure 4b is the representation of the loss, which in simple terms is the divergence of the predicted probability with the actual result. It was inferred that, for the same training set, FMDNet (0.039) provided a considerable amount of better loss than the base model, MobileNetV2 (0.18). In FMDNet, the layers were designed in such a way to have less floating point operations as compared to the other methods. Here, the number of trainable parameters was also less. With fewer trainable parameters, the model yielded better performance as compared to the state-of-the-art methods. The system worked efficiently and detected faces either with a mask or no mask. Figure 5 and Table 5 give the proportion of correct decisions made by FMDNet over the total number of decisions made, and these were calculated over the test set. We can see that, for the same test set, FMDNet (99%) improved the accuracy more than that of the base model, MobileNetV2 (94%). The report consists of some of the significant evaluation metrics such as the precision, recall, F1-score, and accuracy [40][41][42][43][44][45][46][47][48][49]  The accuracy and loss of the training dataset for MobileNetV2 and FMDNet are presented in Figure 4a. This represents several correct decisions made by the FMDNet classifier over the training set. For the same training set, FMDNet (99%) provided a considerable amount of better accuracy than the base model, MobileNetV2, which was only 94%. Figure  4b is the representation of the loss, which in simple terms is the divergence of the predicted probability with the actual result. It was inferred that, for the same training set, FMDNet (0.039) provided a considerable amount of better loss than the base model, Mo-bileNetV2 (0.18). In FMDNet, the layers were designed in such a way to have less floating point operations as compared to the other methods. Here, the number of trainable parameters was also less. With fewer trainable parameters, the model yielded better performance as compared to the state-of-the-art methods. The system worked efficiently and detected faces either with a mask or no mask.    Figure 5 and Table 5 give the proportion of correct decisions made by FMDNet over the total number of decisions made, and these were calculated over the test set. We can see that, for the same test set, FMDNet (99%) improved the accuracy more than that of the base model, MobileNetV2 (94%). The report consists of some of the significant evaluation metrics such as the precision, recall, F1-score, and accuracy [40][41][42][43][44][45][46][47][48][49] mentioned in Equations (5)- (8). These metrics are based on the true positives, true negatives, false positives, and false negatives.

Comparison of Proposed and Existing Models
Machine and deep learning approaches are widely utilized for object identification and recognition issues [50][51][52]. For the real-time face detection problem, some researchers have developed the parallel convolutional face finder method, which can analyze 127 QVGA pictures per second. Some have suggested a reconfigurable model for rotation-invariant multi-view face identification based on a unique two-stage boosting, where they created a detector for a tree structure, which creates many detector nodes for training by combining several two-stage weak classifiers. In the presented work, creating a deep learning model was suggested as a new AI-based strategy for finding those who disregard the rules in public settings. To do this, a private dataset was made from many facial photos, both with and without masks. The suggested model was trained to recognize face masks in live surveillance footage. The deep learning classifier was integrated with YOLOv5 for real-time processing. The suggested model can identify infractions (no face mask) in public areas with a promising detection rate of 99.0%. In comparison to other contemporary DL models such as Fine-Grained Structure Aggregation Network (FSA-Net), MobileNetV2, and ResNet, the model exhibited a detection capability of 24.03%, 5.0%, and 24.10%, respectively. The model is resource-effective, has a low mass, and has a confidence score of 99.0%. The prototype can complete the detection job with a frames-per-second (FPS) rate of 41.72 in real-time situations. As a result, governments may be able to adopt and use the established model to uphold the Standard Operating Procedure (SOP) requirements. Table 6 shown below aims to present a comparative analysis of various State-of-The-Art (SOTA) models/techniques used for face mask identification.

Limitations and Future Recommendations
This method may be adjusted to better fit the matching field of vision since it is extremely sensitive to the spatial placement of the camera. These types can be used in conjunction with security cameras in busy public places such as train stations, metros, and office buildings for monitoring compliance with rules and mask-wearing identification. By training on bigger datasets, the learned weights offered by the authors can be further enhanced and used in real-world applications. To make the task of guards easier, we can later add body temperature detection to this system. Additionally, it is hoped that this device will be put in high-population areas that require face mask detectors. This approach may be utilized in any setting where accuracy and precision are crucial to the task at hand, including public spaces, stations, business settings, roadways, malls, and testing facilities. This method may be used for smart city innovation and would hasten the pace of development in many underdeveloped nations. Our examination of the existing situation offers the opportunity to assess the consequences of significant social change or to become better prepared for the next disaster.

Conclusions
Identifying face masks in streaming videos and photos was the purpose of this work. This article suggested a DL-based model called FMDNet. The method can be used to detect those who disobey the rules of wearing face masks in public. To process live video streams and photos, FMDNet was put into use. The suggested technique was used to extract faces from photos and videos, and it successfully recognized them. In comparison to the state-of-the-art models, FMDNet was superior in terms of deployment in a low-resource environment, the confidence score, the accuracy, and the FPS for real-time feedback. FMD-Net can accurately identify mask wearers and non-mask wearers from an input image or frame in a variety of different lighting scenarios and clear environment situations. The average accuracy attained after training the FMDNet model was 99%. By accurately determining whether someone is wearing a mask or not, COVID-19 may be regulated thanks to FMDNet's straightforward framework and speed. Videos taken in public spaces can also be utilized for mass screenings of people wearing masks or not, making it useful in densely populated public areas. The public authorities can efficiently identify infractions and stop the spread of COVID-19 by using the study that has been presented. This model was tested with a real-time lightweight device and can be deployed for real-time applications. This model can be utilized for biometric purposes and for continuous monitoring systems in schools, colleges, and various scenarios. Furthermore, it can be deployed in public places to monitor crowd clusters.