Lightweight Deep Learning Models for Face Mask Detection in Real-Time Edge Environments: A Review and Future Research Directions

Rasheed, Saim

doi:10.3390/make8040102

Open AccessReview

Lightweight Deep Learning Models for Face Mask Detection in Real-Time Edge Environments: A Review and Future Research Directions

by

Saim Rasheed

Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia

Mach. Learn. Knowl. Extr. 2026, 8(4), 102; https://doi.org/10.3390/make8040102

Submission received: 5 January 2026 / Revised: 7 April 2026 / Accepted: 10 April 2026 / Published: 15 April 2026

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Automated face mask detection remains an important component of hygiene compliance, occupational safety, and public health monitoring, even in post-pandemic environments where real-time and non-intrusive surveillance is required. Traditional deep learning models provide strong recognition performance but are often impractical for deployment on embedded and edge devices due to their computational and energy demands. Recent research has therefore emphasized lightweight and hybrid architectures that seek to preserve detection accuracy while reducing model complexity, inference latency, and power consumption. This review presents an architecture-centered synthesis of face mask detection systems, examining conventional convolutional models, lightweight convolutional networks such as the MobileNet family, and hybrid frameworks that integrate efficient backbones with optimized detection heads. Comparative analysis of reported results highlights key trade-offs between accuracy, efficiency, and deployment feasibility under heterogeneous datasets, evaluation protocols, and hardware settings. Open challenges, including improper mask detection, domain adaptation, model compression, and the extension of mask detection toward broader Personal Protective Equipment (PPE) compliance monitoring, are discussed to outline a forward-looking research agenda. Overall, this review consolidates current understanding of architectural design strategies for face mask detection and provides guidance for developing scalable, robust, and real-time deep learning solutions suitable for embedded and mobile platforms.

Keywords:

face mask detection; mask wearing detection; mask compliance detection; lightweight deep learning architectures; hybrid deep learning architectures; edge computing; convolutional neural networks; YOLO-based detectors; real-time inference

1. Introduction

The widespread proliferation of airborne infectious diseases such as COVID-19 highlighted the importance of face coverings as an effective measure for reducing person-to-person transmission in both clinical and community environments [1]. However, the relevance of automated mask detection extends well beyond pandemic response. In many occupational settings, including healthcare, pharmaceutical laboratories, hospitality services, manufacturing plants, and food-handling industries, mask-wearing remains a mandatory hygiene and safety requirement [2]. In such environments, inconsistent or improper mask usage can compromise safety, reduce operational compliance, and increase the risk of contamination. Automated, computer-vision-based monitoring systems therefore provide meaningful value by detecting mask-wearing violations and triggering real-time alerts to reduce reliance on manual supervision [2,3].

Deep learning-based face mask detection (FMD) has consequently emerged as a scalable, non-intrusive solution for compliance monitoring across public and professional domains [2,3]. As artificial intelligence (AI) increasingly shifts toward on-device and edge-computing environments, research emphasis has expanded from merely achieving high accuracy to achieving high accuracy with real-time efficiency under hardware limitations [4,5]. This shift reframes the central research question from “How do we detect masks accurately?” to “How do we detect masks accurately and efficiently enough for deployment on constrained edge platforms?”

Traditional Convolutional Neural Network (CNN) architectures such as Visual Geometry Group Network (VGGNet) and Residual Network (ResNet) deliver strong feature-learning capabilities but are computationally intensive and often unsuitable for deployment on low-power devices. Lightweight architectures, including the MobileNet family and other efficient convolutional models, address these challenges by reducing parameter count, lowering latency, and enabling deployment on embedded systems [3]. More recently, hybrid architectures that integrate lightweight feature extractors with optimized one-stage detectors (e.g., Single Shot Detector (SSD), You Only Look Once (YOLO), or attention-enhanced variants) have demonstrated promising improvements in balancing accuracy, speed, and resource usage [5,6]. These advancements suggest that robust and efficient mask detection is feasible even within strict real-time constraints.

1.1. Problem Statement and Rationale

Despite considerable progress, practical deployment of mask detection systems continues to face two major challenges:

(1): Maintaining high recognition reliability under real-world variability such as occlusion, lighting changes, and diverse mask types.
(2): Operating efficiently on hardware-limited platforms where computational resources, memory, and energy budgets are restricted [4,5].

Lightweight and hybrid architectures offer promising solutions, yet the rapid diversification of efficient model designs has created uncertainty regarding which architectural choices best balance accuracy, inference speed, and deployment feasibility. Existing surveys and review efforts remain limited in several respects. Many emphasize pandemic-driven solutions without considering broader compliance applications beyond COVID-19 [7,8,9]. Others focus heavily on accuracy while giving insufficient attention to practical factors such as latency, model size, generalization capacity, or the constraints of embedded deployment [4,5,8]. Issues related to improper mask detection, domain shift, and long-term sustainability remain largely unresolved, and emerging methods for compression, distillation, and hardware-aware optimization are underexplored in the current literature. To address these gaps, this review provides an architecture-centric synthesis of contemporary approaches, summarizes performance and deployment trade-offs, and outlines open challenges to guide future research [6,7,8,9].

1.2. Review Objectives

This review aims to:

(a): Examine conventional, lightweight, and hybrid deep learning architectures used for face mask detection.
(b): Compare reported performance with respect to accuracy, inference efficiency, and deployment suitability.
(c): Analyze the core challenges affecting real-world deployment, including improper mask detection, domain shift, and computational constraints.
(d): Identify future research directions focused on model compression, knowledge distillation, domain adaptation, and broader compliance-oriented applications.

The remainder of this paper is organized as follows. Section 2 presents the methodology adopted for conducting the review. Section 3 analyzes the major architectural families employed in face mask detection systems. Section 4 provides a comparative evaluation of model performance, focusing on accuracy and efficiency considerations. Section 5 discusses the open challenges and outlines potential directions for future research. Finally, Section 6 concludes the paper by summarizing the key findings and contributions.

2. Methodology

The methodology of this review was designed to rigorously analyze lightweight and hybrid deep learning approaches for face mask detection, with particular emphasis on their feasibility in real-time and resource-constrained deployment environments. The assessment focused not only on reported performance metrics such as accuracy and inference speed, but also on architectural efficiency, evaluation settings, hardware compatibility, and scalability. The selection and evaluation process followed a structured and reproducible literature search protocol designed to identify studies relevant to lightweight and hybrid deep learning architectures for face mask detection. Once included, each study was examined through a multi-dimensional analytical framework considering the problem addressed, the proposed architecture, and the resulting research gaps.

Following the initial database search, the candidate set of studies was progressively refined through screening and relevance assessment. In addition to keyword-based retrieval, backward and forward citation tracing were used to identify influential studies that were not directly captured in the initial search results. After applying the screening and eligibility criteria described in Section 2.1, Section 2.2 and Section 2.3, 81 studies were retained for qualitative synthesis. During manuscript revision and reviewer feedback, five additional relevant studies were incorporated, resulting in a final corpus of 86 studies included in the analysis. Figure 1 illustrates the overall literature identification, screening, analysis, and categorization framework adopted in this review.

2.1. Literature Search Strategy

The literature search was conducted using peer-reviewed digital libraries including ScienceDirect, IEEE Xplore, Springer Link, ACM Digital Library, and MDPI. These sources were selected because they host a substantial proportion of research publications in computer vision, machine learning, and edge computing. Publications from 2020 to 2025 were primarily targeted in order to capture developments associated with post-pandemic computer vision applications and the increasing deployment of lightweight deep learning models on edge platforms. However, earlier studies were also considered when they provided foundational architectural concepts or widely adopted model designs relevant to face mask detection or lightweight deep learning frameworks.

To ensure transparency and reproducibility, the search followed a structured protocol based on predefined keywords and Boolean query combinations. Searches were performed within the title, abstract, and keyword fields of each database. The search terms covered three conceptual categories: (1) the application domain of face mask detection, (2) deep learning architectures, and (3) deployment environments associated with real-time edge inference.

Application-related terms included “face mask detection,” “mask detection,” “mask wearing detection,” and “improper mask detection.” Architecture-related terms included “lightweight deep learning,” “MobileNet,” “YOLO,” “EfficientNet,” “ShuffleNet,” “SqueezeNet,” and “Vision Transformer.” Deployment-oriented terms included “edge computing,” “embedded deployment,” and “real-time inference.” These terms were combined using Boolean operators to retrieve studies addressing both face mask detection and deployment-oriented deep learning architectures. A representative primary query used across several databases was formulated as:

(“face mask detection” OR “mask detection” OR “mask wearing detection” OR “improper mask detection”)

AND

(“deep learning” OR “lightweight CNN” OR “MobileNet” OR “YOLO” OR “EfficientNet” OR “ShuffleNet” OR “SqueezeNet” OR “Vision Transformer”)

AND

(“edge computing” OR “embedded deployment” OR “real-time inference”)

Databases that support complex Boolean expressions such as IEEE Xplore, SpringerLink, and ACM Digital Library, were queried using the primary search expression. Under the applied time window (2020–2025), these searches returned 20 studies from IEEE Xplore, 85 studies from SpringerLink, and 44 studies from the ACM Digital Library.

Because database search engines differ in the complexity of Boolean expressions they support, equivalent shortened queries were used where necessary while preserving the same conceptual components of the search protocol. ScienceDirect, for example, limits the number of Boolean connectors per query field. Consequently, a reduced query was applied:

(“face mask detection” OR “mask wearing detection”)

AND

(MobileNet OR YOLO)

AND

(“edge computing” OR “real-time inference”)

This adapted query retained the three conceptual search blocks, application domain, model architecture, and edge deployment context, and returned 35 studies within the same time window.

Similarly, the MDPI search interface did not reliably process the full Boolean query, yielding no results when the complete expression was used. Therefore, a simplified targeted keyword query focusing on mask detection and edge deployment (e.g., “face mask detection using deep learning for edge computing”) was applied, which returned 10 studies for the period 2020–2025.

All retrieved records from the five databases were exported and merged into a consolidated dataset, resulting in 194 studies prior to deduplication. In addition to database queries, backward and forward citation tracing was performed to identify influential studies with architectural innovations or deployment-oriented experimentation that were not directly retrieved through the keyword search.

2.2. Inclusion and Exclusion Criteria

To ensure consistency and reproducibility in the screening process, a set of inclusion and exclusion criteria was defined to determine which studies were retained for detailed analysis.

Studies were included if they presented a deep learning-based method for face mask detection (binary or multi-class) and provided quantitative evaluation using publicly available or well-described datasets. Eligible studies were required to demonstrate empirical validation through measurable performance indicators such as accuracy, precision–recall metrics, inference latency, or frames-per-second (FPS). Particular emphasis was placed on research exploring lightweight architectures, real-time inference, resource-constrained deployment, or model-level optimization, as these aspects are central to practical edge-based mask detection systems. To ensure methodological reliability and comparability across sources, only peer-reviewed journal articles and reputable conference proceedings indexed in SCI, SCIE, or Scopus were retained.

Studies were excluded if they lacked empirical experimentation, relied exclusively on traditional image processing techniques without deep learning architectures, or focused solely on cloud-based inference without considering deployment feasibility in edge or embedded environments. In addition, non-peer-reviewed preprints, duplicate records retrieved across multiple databases, and sources lacking sufficient methodological detail were excluded from further analysis.

These criteria ensured that the final corpus of studies consisted of experimentally validated deep learning approaches relevant to face mask detection and deployment-oriented computer vision systems.

2.3. Screening and Selection Approach

Screening was conducted through a multi-stage filtering process designed to progressively refine the corpus of retrieved studies according to the eligibility criteria defined in Section 2.2. The initial database search across the selected digital libraries yielded 194 studies prior to deduplication.

In the first stage, duplicate records retrieved from multiple databases were identified and removed in order to construct a consolidated set of unique publications. In the second stage, title and abstract screening was performed to assess the topical relevance of each study. Articles that clearly did not address face mask detection, did not employ deep learning architectures, or focused on unrelated computer vision tasks were excluded at this stage. In the third stage, the remaining studies underwent full-text eligibility assessment. During this stage, the methodological characteristics of each study were examined in greater detail, including the proposed architecture, the presence of empirical experimentation, dataset description, reported evaluation metrics, and the relevance of the proposed method to real-time or resource-constrained deployment scenarios. Studies lacking sufficient experimental validation or architectural relevance were excluded from further analysis.

Through this staged screening process, studies that did not meet the defined methodological and thematic criteria were progressively filtered out, ultimately resulting in 81 studies, retained for detailed literature analysis.

During the revision stage of the manuscript, 5 additional relevant studies were identified through backward and forward citation tracing as well as reviewer recommendations. These studies satisfied the same eligibility criteria and were incorporated into the analysis, resulting in a final corpus of 86 studies included in this review.

2.4. Data Extraction and Categorization

For each selected study, key information was systematically extracted to support thematic and architectural analysis. The extracted attributes included the research problem addressed, architectural design, enhancement strategies, datasets and evaluation conditions, training setup, performance metrics, model complexity (e.g., parameter count), inference performance (e.g., frames-per-second or latency), and deployment-related characteristics such as hardware platforms or optimization pipelines.

To ensure the reliability of the reviewed literature, the methodological rigor of each study was qualitatively assessed during the data extraction process. Each study was examined with respect to the clarity of the experimental setup, transparency of the dataset description, reporting of evaluation metrics, and reproducibility of the proposed methodology. Studies lacking sufficient experimental detail or methodological transparency were treated cautiously during the comparative analysis and were not considered primary sources when interpreting architectural performance trends.

Following data extraction and quality assessment, the reviewed works were categorized according to their architectural characteristics, including conventional convolutional neural networks (CNNs), lightweight convolutional architectures, and hybrid detection frameworks. This categorization enabled a structured comparison of architectural design strategies, computational efficiency, and deployment feasibility across the reviewed literature.

The extracted evidence reveals substantial variation in architectural choices, dataset configurations, evaluation protocols, and deployment assumptions. Such diversity underscores the importance of an architecture-centered analytical perspective when interpreting reported performance results, as differences in experimental settings and optimization strategies can significantly influence the observed outcomes. The analysis also highlights several research gaps, including inconsistent reporting of inference metrics, limited multi-class annotations for improper mask detection, and scarce evaluation under real-world edge deployment constraints.

To ensure transparency and consistency in the comparative review, the essential characteristics of each included study were synthesized into a structured analytical summary describing the experiment objective, research goal, datasets and materials used, methodological approach, reported results, and the principal conclusions derived from the study. These extracted attributes correspond to the fields summarized in Table 1, which provides a comparative overview of the experimental design, methodological approaches, and findings reported in the selected literature. Together, the search protocol, eligibility criteria, screening stages, and structured data extraction framework provide a transparent and reproducible methodological basis for the comparative analysis presented in this review.

3. Architectural Landscape of Face Mask Detection Models

Deep learning-based face mask detection systems rely fundamentally on the architectural design of their underlying neural networks. Architectural choices determine not only a model’s feature-extraction capacity and recognition accuracy but also its computational footprint, memory usage, inference speed, and overall suitability for deployment on resource-constrained edge environments. As mask detection has transitioned from high-performance computing systems toward embedded and real-time monitoring platforms, the efficiency and scalability of these architectures have become just as important as their classification accuracy. To provide a clearer analytical structure for the diverse approaches reported in the literature, the reviewed face mask detection models can be further interpreted through a structured architectural taxonomy. Rather than viewing each model independently, this taxonomy organizes existing approaches according to three complementary dimensions: architectural design philosophy, detection paradigm, and deployment orientation.

The first dimension, architectural design philosophy, concerns how the neural network is structured to balance representational capacity and computational efficiency. From this perspective, face mask detection models can broadly be grouped into three primary families. Conventional convolutional neural networks rely on deep hierarchical feature extraction and are typically derived from large-scale image classification architectures such as VGGNet, ResNet, DenseNet, or Inception. Lightweight convolutional architectures, including MobileNet, EfficientNet, ShuffleNet, and SqueezeNet, reduce computational complexity through techniques such as depthwise separable convolution, channel shuffling, and compound network scaling, making them suitable for embedded and mobile inference. Hybrid architectures combine lightweight feature extractors with specialized detection heads or additional machine learning components in order to balance computational efficiency with improved detection capability.

A second dimension, detection paradigm, concerns the computational workflow adopted by the system. Face mask detection frameworks typically follow one of three computational workflows: two-stage detection pipelines, derived from the Region-based Convolutional Neural Network (R-CNN) family, in which candidate regions are first proposed and then classified; single-stage detection architectures, such as YOLO and SSD variants, which directly predict object locations and mask-wearing classes in a unified inference pipeline; and classification-based pipelines, where face detection is performed separately and a convolutional network subsequently determines mask compliance.

The third dimension, deployment orientation, reflects the computational environment in which the system is expected to operate. Some architectures are primarily designed for cloud or server-scale environments, where computational resources permit the use of deeper networks and larger parameter counts. Others are designed for edge-optimized deployment, prioritizing low latency, reduced memory requirements, and energy-efficient inference. A further category includes hybrid edge–cloud solutions, in which computational workloads are distributed between local embedded devices and centralized servers.

Together, these three dimensions provide a structured framework for interpreting the architectural evolution of face mask detection systems. The following subsections analyze representative models within each architectural family, beginning with conventional convolutional neural networks and progressing toward increasingly efficient architectures designed for real-time edge deployment. The proposed taxonomy is illustrated in Figure 2, which summarizes the architectural landscape of face mask detection systems according to the three complementary dimensions of architectural design philosophy, detection paradigm, and deployment orientation.

3.1. Conventional CNN-Based Approaches

Early work on automated face mask detection relied heavily on conventional convolutional neural network (CNN) architectures, which formed the foundation of modern deep learning in computer vision. These models, originally developed for large-scale image classification and object detection tasks, were adopted due to their strong representational capacity and readily available pre-trained weights. Architectures such as AlexNet, VGGNet, Inception, ResNet, DenseNet, Xception, and the R-CNN family provided the initial backbone for mask detection pipelines during the early phase of the COVID-19 pandemic.

The evolution of conventional CNNs reflects a gradual improvement in depth, efficiency, and stability. LeNet-5 introduced the first successful convolutional network structure for digit recognition, demonstrating the feasibility of learned hierarchical features [16]. AlexNet catalyzed the deep learning revolution by winning the ImageNet 2012 challenge, leveraging ReLU activations and Graphics Processing Unit (GPU) training to demonstrate substantial performance improvements over traditional hand-crafted descriptor-based approaches [17]. VGG16/VGG19 refined network depth with small, uniform 3 × 3 convolutions, producing strong yet computationally heavy models [18]. The Inception family introduced multi-branch processing to improve efficiency while preserving expressive power [19], while Inception-ResNet blended residual learning with inception modules to enable deeper and more stable optimization [20].

Several major CNN families subsequently introduced important structural innovations. ResNet, with its identity skip connections, addressed the vanishing-gradient problem and enabled successful training of networks exceeding 100 layers [21]. DenseNet extended this concept by connecting each layer to all subsequent layers, improving feature reuse and reducing parameter growth [22]. Xception reorganized convolutions into a fully depthwise separable form, a precursor to later lightweight models such as MobileNet [23]. More recent backbone families such as RegNet attempted to design regular, scalable architectures via design-space exploration, though these were not widely adopted for mask detection primarily due to their higher computational requirements [24].

Beyond classification backbones, conventional object detection frameworks also played a significant role. The R-CNN family including R-CNN, Fast R-CNN, and Faster R-CNN, extended CNNs to region-based object detection using two-stage processing [25,26,27]. Mask R-CNN further introduced an instance segmentation branch, enabling pixel-level analysis of facial regions or mask boundaries [28]. These models were frequently adopted in early COVID-19 surveillance systems, especially in industrial or controlled environments requiring both face detection and classification.

In the context of face-mask detection, conventional CNN backbones were widely adopted in early studies implementing either face-classification pipelines or end-to-end object-detection frameworks. A typical classification workflow included: (i) frame acquisition from CCTV or mobile cameras, (ii) pre-processing such as resizing and normalization, (iii) face detection using Haar cascades, Histogram of Oriented Gradients (HOG) detectors, or CNN-based region detectors, and (iv) classification of each cropped face using a CNN such as VGG16, ResNet50, DenseNet121 or InceptionV3. Also, conventional CNNs and detection frameworks were employed early on to evaluate the feasibility of automated mask compliance. For instance, MaskedFace-Net, a large publicly available dataset containing over 137,000 images with correct and incorrect mask wearing annotations, was used to train and test classifiers based on VGG, ResNet, or DenseNet backbones; results demonstrated that these models could reliably distinguish between masked and unmasked faces under controlled image conditions [29].

Conventional CNN backbones have been employed in both face-classification and object-detection pipelines for mask detection. On the detection side, SE-YOLOv3, a YOLOv3-based detector enhanced with a Squeeze-and-Excitation module, was proposed and trained on the Properly Wearing Masked Face Dataset (PWMFD) dataset of 9205 images. The reported results indicated high detection accuracy along with real-time inference capability, demonstrating that conventional CNN backbones remain effective when combined with modern detection heads [30].

However, performance of conventional CNN-based detectors may degrade under real-world variability. A recent survey on masked-face recognition and detection notes that occlusion (e.g., by hands or non-mask objects), non-frontal poses, diverse mask designs, and inconsistent lighting conditions significantly reduce the reliability of classification and detection models based on conventional CNNs [31]. Even large annotated datasets, despite their scale, often contain predominantly frontal views or synthetically generated mask images, which limit their representativeness for unconstrained real-world environments [29].

Overall, while conventional CNN architecture continues to provide a valuable baseline for face-mask detection tasks, their robustness and generalization can remain limited in challenging, real-world scenarios. The observed deficiencies under occlusion, pose variation, and diverse usage conditions motivate the shift toward lightweight CNNs, optimized detection heads, and hybrid architectures, which strive to balance accuracy with computational efficiency and deployment feasibility in edge or surveillance environments. A comparative overview of major conventional CNN architectures, highlighting their innovations, parameter scales and deployment implications, is presented in Table 2.

Despite their strengths, these networks exhibit important limitations for real-time deployment and motivated the rise of lightweight CNNs (e.g., MobileNet, ShuffleNet) and hybrid architectures designed for edge-optimized deployment while retaining adequate accuracy.

3.2. Lightweight Convolutional Models

Lightweight convolutional neural networks are designed to provide competitive recognition accuracy while significantly reducing memory footprint, parameter count and computational cost. Instead of relying on wide and deep stacks of standard convolutions, these models adopt mechanisms such as depth-wise separable convolutions, inverted residual bottlenecks, grouped pointwise convolutions with channel shuffling, or squeeze–expand fire modules to minimize the number of multiplications and parameters. As a result, they are well suited for deployment on mobile and embedded platforms, including edge devices used in camera-based mask monitoring systems.

Among lightweight convolutional architectures, the MobileNet family represents one of the most influential design frameworks for efficient visual recognition. The original MobileNet architecture introduced depth-wise separable convolutions together with global width and resolution multipliers to reduce computational complexity while preserving recognition performance on mobile vision tasks [32]. MobileNetV2 extended this idea by incorporating inverted residual blocks with linear bottlenecks, improving accuracy across multiple benchmarks while preserving efficiency [33]. Later variants such as MobileNetV3 [34] further refined block design and activation functions, but MobileNetV2 remains the most widely adopted backbone in the face mask detection literature due to its balance between representational power and computational cost.

For example, a MobileNetV2-based face mask recognition system incorporating data augmentation and optimization strategies achieved an accuracy of 99.21% on a two-class dataset, demonstrating robust performance in real-time scenarios. The model was shown to effectively detect masks across different viewing angles, including side-face orientations, highlighting the practical reliability of lightweight architectures in real-world monitoring applications [35]. Similarly, another implementation reports an accuracy of approximately 98.69% in a real-time prototype system processing live video streams, highlighting the practical feasibility of lightweight models for deployment in resource-constrained environments [36].

In face mask detection systems, lightweight backbones derived from this architectural family are frequently employed either as feature extractors within object-detection pipelines or as classifiers in transfer-learning frameworks. For example, one real-time system integrates a lightweight backbone into a single-shot detection MobileNetV2 architecture (SSDMNV2) and demonstrates effective two-class mask detection using OpenCV, TensorFlow, and Keras on live video streams [37]. Other studies fine-tune the architecture on custom mask datasets and attach classical machine-learning classifiers such as Support Vector Machine (SVM) or decision trees to the extracted deep features, illustrating how lightweight embeddings can support multiple deployment strategies. Additional implementations incorporate the model into embedded or Internet of Things (IoT)-oriented monitoring systems, highlighting its ability to achieve real-time inference on resource-constrained hardware while maintaining strong recognition performance [37].Other lightweight architectures follow different optimization strategies. EfficientNet introduced compound scaling of depth, width and input resolution, combined with an architecture discovered via neural architecture search, producing a family of models that reported very high accuracy while using fewer parameters than many conventional CNN architectures [11]. ShuffleNet instead reduces computational cost by using pointwise group convolutions and channel shuffle, allowing efficient information mixing while maintaining high throughput on ARM-based mobile devices [38]. SqueezeNet targets extreme parameter reduction by replacing many standard convolution layers with fire modules (1 × 1 squeeze layers followed by 1 × 1 and 3 × 3 expand layers), achieving AlexNet-level ImageNet accuracy with roughly 50× fewer parameters [39]. These architectures form the algorithmic foundation upon which many recent lightweight mask detection systems are built.

EfficientNet-based models have also been explored for mask detection, typically as stronger yet still relatively compact alternatives to MobileNet. One study employing EfficientNet-B0 for binary mask classification reports an accuracy of around 99–99.7% on a two-class problem, reporting competitive performance relative to several heavier backbones while remaining deployable in practical systems [40]. In [41] the authors employ EfficientNet-B0 as the feature extractor backbone and a large-margin piecewise-linear (LMPL) classifier on top of deep features. The method reported accuracies of 99.53% and 99.64% for the two tasks respectively, showing improved performance relative to several conventional end-to-end CNN models and classical image classification approaches in the cited study. The authors argue that their solution achieves both high accuracy and efficiency, making it suitable for real-world deployment in scenarios like biometric authentication or public-health compliance checks. Other works using EfficientNet variants for mask identification similarly highlight their favorable accuracy–efficiency trade-off compared with legacy CNNs, though their computational requirements are generally higher than those of MobileNetV2 on extremely low-power devices [42].

Architectures derived from SqueezeNet provide another lightweight option. SqueezeNet itself was designed as an extremely compact classification network, but face mask detection has inspired specialized derivatives such as SqueezeMaskNet, which integrates fire modules with channel-attention mechanisms to support four-way classification (correctly worn mask, mask covering only the mouth, mask not covering, and no mask) while running in real time on edge GPUs [43].

Table 3 presents a structured comparison of representative lightweight convolutional architectures, summarizing their key architectural concepts, approximate model complexity, and typical usage in face mask detection systems.

Overall, lightweight convolutional architectures have improved the feasibility of real-time face mask detection on embedded platforms by reducing computational overhead while maintaining adequate recognition performance. Evidence reported in the literature suggests that MobileNetV2-based solutions are frequently adopted in deployment-oriented implementations due to their balance between efficiency and predictive performance, particularly in transfer-learning and single-shot classification pipelines. EfficientNet derivatives have also been explored and are often reported to achieve high classification accuracy, including in improper mask detection scenarios, although typically with moderately higher computational requirements. SqueezeNet-based variants such as SqueezeMaskNet prioritize parameter efficiency and high frame rates, making them suitable for multi-class compliance monitoring in resource-constrained edge environments. These trends and trade-offs among lightweight backbones are summarized in Table 3.

3.3. Hybrid Architectures

Hybrid architecture represents an advanced design philosophy that extends beyond standalone conventional CNNs or lightweight backbones by combining multiple complementary modules, typically a feature extraction network and a detection or classification head that are used to handle both spatial localization and mask classification effectively. Unlike the approaches in Section 3.1 and Section 3.2, hybrid architectures integrate networks such as MobileNetV2, EfficientNet, or SqueezeNet with detection frameworks like YOLO, SSD, or Faster R-CNN, or with classical machine-learning classifiers. These combinations are not random; they arise from deliberate architectural reasoning to meet real-time deployment demands, especially in public health monitoring during COVID-19. Researchers have demonstrated that hybridization improves robustness, speed, and accuracy under real-world constraints, such as occlusion, inconsistent lighting and crowd density (e.g., YOLOv2 + ResNet50 [44], MobileNetV2 + SSD [45], YOLOv5 with attention mechanisms [46], or CNN + SVM hybrids [47]).

Lightweight hybrid architectures are crucial for face mask detection, especially in the context of the COVID-19 pandemic, where real-time monitoring in public spaces is essential for public health compliance. These architectures enable deployment on edge devices, which are often resource-constrained, by reducing computational costs and memory requirements while maintaining high detection accuracy. The integration of lightweight models such as MobileNetV2, and the use of heavier feature extractors like VGG16 within hybrid pipelines, allows for efficient feature extraction and classification, making them suitable for real-time applications in environments with limited hardware capabilities [48,49,50,51,52]. This motivation aligns with findings from multiple hybrid studies, where integrating lightweight backbones with advanced detection heads significantly improved end-to-end performance for mask localization and classification in constrained settings (e.g., YOLOv3-based hybrids [30], ensemble hybrid detectors [2], and smart-city surveillance systems that combine detection and classification modules within unified pipelines [53]).

The design of lightweight hybrid architectures typically involves combining the strengths of different model types, such as convolutional neural networks (CNNs) and transformer models, to optimize both efficiency and accuracy. Techniques like depth-wise separable convolutions, inverted residual structures, and parallel hybrid architectures are commonly employed to minimize the number of parameters and computational operations. These principles ensure that models can operate effectively on mobile and edge devices without sacrificing performance [48,49,51,54]. Such design principles are evident in real hybrid implementations. For example, YOLOv5 combined with coordinate attention blocks improves detection precision while maintaining fast inference [46], and CNN + ML hybrids (deep features + SVM) exploit efficiency while preserving discriminative power [47]. These examples demonstrate that hybridization follows systematic design logic rather than ad hoc network pairing.

Developers of lightweight hybrid architecture must balance the trade-off between computational efficiency and detection accuracy. While lightweight models are optimized for speed and resource usage, they may experience accuracy degradation under challenging conditions such as illumination changes or occlusions. Hybrid models attempt to mitigate these issues by leveraging complementary strengths of different architectures, but some loss in robustness may still occur in extreme scenarios [55,56,57]. This trade-off is well-documented in comparison studies, where hybrid detectors show better occlusion-robustness than pure classifiers. For instance, YOLOv2-ResNet50 hybrids handle masked and partially occluded faces more effectively than standalone CNNs [44], and multi-detector ensembles enhance detection under crowd conditions [2]. Nevertheless, no hybrid fully eliminates robustness issues, especially under severe occlusions or rapid motion.

To further enhance the performance of lightweight hybrid models, various optimization techniques are applied. These include model compression, quantization, pruning, and knowledge distillation, all aimed at reducing model size and computational demands. Additionally, architectural innovations such as ghost modules and Bidirectional Feature Pyramid Networks (Bi-FPN) are explored to improve feature extraction and detection capabilities without increasing resource consumption [51,56]. These optimization strategies extend beyond neural architecture design alone. Hybrid systems increasingly integrate acceleration frameworks such as TensorRT, automated data labeling pipelines, and edge–cloud collaboration mechanisms, as demonstrated in optimized YOLOv5-based hybrids [46]. At a broader level, system-level hybrid deployments combine edge intelligence with IoT infrastructure to compensate for on-device constraints, enabling scalable public-health monitoring applications [53].

Several case studies demonstrate the successful deployment of lightweight hybrid architectures for face mask detection. For example, systems combining MobileNetV2 and VGG19 have been implemented for real-time surveillance, achieving high accuracy and robustness in drone-based monitoring and public space compliance checks. Other models, such as SSD with MobileNetV2, have been used for automated face mask compliance monitoring, highlighting the flexibility and effectiveness of these architectures in diverse real-world scenarios [50,51,52,58].

Overall, hybrid detectors based on MobileNetV2 + SSD [45], YOLOv3 variants [30] and multi-component ensembles [2] are frequently reported to provide strong performance in deployment scenarios requiring person-level localization, mask-wearing classification and real-time operation. Additionally, smart-city systems integrating detection models with sensor networks, edge-computing nodes and cloud-based analytics highlight the broader potential of hybrid architectures in scalable public-health monitoring [53]. These examples reinforce that hybrid architecture serves as a bridge between computational feasibility and large-scale real-world applicability. Table 4 summarizes the main hybrid architecture employed in face-mask detection, highlighting their backbones, detection heads and reported strengths.

In summary, hybrid architecture provides a principled balance between speed, accuracy and deployment feasibility, making them the dominant and most practical design strategy for real-time face-mask detection in embedded and resource-constrained environments.

4. Comparative Performance Analysis

Building on the architectural taxonomy introduced in Section 3, this section focuses on the comparative performance of representative face mask detection models reported in recent studies. Rather than revisiting the architectural principles of conventional, lightweight, and hybrid networks, the analysis here emphasizes how these families perform under different evaluation metrics and deployment constraints.

4.1. Evaluation Metrics

Performance evaluation of face mask detection systems commonly relies on standard metrics from the classification and object-detection literature to quantify prediction correctness, robustness, and class discrimination. Across the reviewed studies, the most frequently reported metrics include Accuracy, Precision, Recall (Sensitivity), and F1-Score, computed using confusion-matrix statistics True Positive (TP), False Positive (FP), False Negative (FN) and True Negative (TN).

Accuracy

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(1)

2.: Precision

Precision = \frac{T P}{T P + F P}

(2)

3.: Recall (Sensitivity)

Recall = \frac{T P}{T P + F N}

(3)

4.: F1-Score

F 1 - score = 2 \times \frac{Precision \cdot Recall}{Precision + Recall}

(4)

While accuracy is often reported in early binary mask-classification studies, it can be misleading in imbalanced scenarios where compliant cases dominate. Consequently, precision and recall are routinely used to capture false-alarm and missed-detection behavior, particularly for the improper-mask and no-mask classes. The F1-score provides a balanced summary of precision and recall and is therefore widely adopted in both lightweight and hybrid detection frameworks, including lightweight detector variants such as SSD-based detectors, Squeeze-and-Excitation (SE-YOLOv3) attention based mechanism, and YOLO-tiny variants.

Several deep learning-based face mask detection studies [59,60,61,62] have reported accuracy, precision, recall, and F1-score as their primary evaluation metrics, typically derived from confusion matrices built over test sets or cross-validation folds. For example, a transfer-learning system based on a lightweight backbone network was developed for binary mask classification and evaluated on two datasets (one in-house and one public), demonstrating the effectiveness of lightweight architectures for mask recognition tasks. The study compared a generic Deep Convolutional Neural Network (DCNN) with a MobileNetV2-based model and reported per-model accuracy, precision, recall, and F1-score. The results indicate that lightweight backbone networks can achieve accuracy values approaching 99%, while maintaining similarly strong precision, recall, and F1-scores, thereby supporting model selection based on balanced performance across multiple evaluation metrics rather than accuracy alone [59]. Similarly, a custom CNN incorporating a four-stage image-processing pipeline was proposed for face mask detection and evaluated on a Real-Image-Labeled Face Dataset (RILFD) as well as two public datasets (MAFA and MOXA). The model was compared against YOLOv3 and Faster R-CNN, with reported precision, recall, accuracy, and F1-scores across datasets illustrating robustness under real-world variations in lighting, mask type, and occlusion [61].

Ensemble and hybrid classification approaches also rely heavily on these four measures to demonstrate robustness under different data distributions. An ensemble model combining ResNet50, Inception-v3, and VGG-16 was proposed for real-time face-mask detection, with class-wise precision, recall, and F1-scores, as well as macro-averaged metrics, reported for evaluation. The best-performing configuration achieved F1-scores of approximately 0.997, indicating that both false positives and false negatives were extremely rare on the evaluated test sets [63]. Similarly, a multi-class face-mask detection approach focusing on incorrect mask-wearing cases (e.g., masks worn under the nose or chin) was proposed, with particular emphasis on achieving high recall to avoid missing non-compliant instances. The reported results included an accuracy of 99.4%, precision of 99.4%, recall of 98.6%, and an F1-score of 99.0% on the evaluated dataset [64]. These works highlight that precision and recall are often more informative than accuracy when the cost of missing violators is high.

Several architectures that jointly address face-mask detection and masked face recognition use the same four metrics to evaluate both tasks. DeepMaskNet, an end-to-end framework for face-mask detection and identity recognition under mask occlusion, was introduced and trained on the MDMFR dataset. They report accuracy, precision, recall, and F1-score for both detection and recognition modules, showing that the model maintains high recall and F1 even under variations in pose, illumination, mask type, and occlusion, thereby demonstrating that the chosen metrics are sensitive enough to capture performance degradation under real-world conditions [5]. More recent works on parallel hybrid architecture or multi-branch CNNs (integrate VGG16 and MobileNetV2) for mask detection similarly present confusion matrices and derive all four metrics to compare variants with different backbones or feature-fusion strategies [48].

In object-detection-oriented approaches (e.g., YOLO or SSD variants), accuracy, precision, recall, and F1-score are frequently complemented by detection-specific metrics such as Average Precision (AP) and mean Average Precision (mAP), which jointly evaluate both localization and classification performance. For instance, these metrics were employed in the ETL-YOLOv4 framework to demonstrate improvements in mean Average Precision (mAP) across multiple Intersection over Union (IoU) thresholds when enhancing the YOLOv4 backbone for mask-related detection tasks. Beyond this, several recent studies have demonstrated how mAP, along with precision–recall curves, provides deeper insights into robustness under real-world variability [65]. A Rapid Real Time Face Mask Detection System (RRFMDS) was evaluated using accuracy, precision, recall, F1-score, and mAP, demonstrating that balanced behavior across these metrics is essential for consistent detection in crowded surveillance settings [15]. Similarly, a CNN-based detector was compared against YOLOv3 and Faster R-CNN using accuracy, precision, recall, F1-score, and average precision (AP) values across multiple datasets. The reported results illustrate how detection-oriented metrics reveal performance gaps in scenarios involving occlusion, lighting variation, and diversity in mask-wearing patterns [61]. Collectively, these findings indicate that AP, mAP and precision–recall analysis play a crucial role when evaluating detection-centric mask-monitoring systems, especially for models deployed in unconstrained or real-time environments.

Survey and review papers on masked-face detection and recognition reinforce the central role of these four evaluation metrics. A systematic analysis of deep learning-based masked face detection methods reports that almost all recent studies include accuracy and F1-score, with many also reporting precision and recall to better capture trade-offs between false alarms and missed detections [66]. Further, several studies and surveys note that additional metrics such as specificity (true-negative rate), Receiver Operating Characteristic (ROC) curves, Area Under the ROC Curve (AUC), and Intersection over Union (IoU), are increasingly employed alongside accuracy, precision, recall, and F1-score to provide a more comprehensive picture of model performance, especially for real-time, safety-critical deployments and highly imbalanced datasets [1,6,7,8,9,31,67,68,69,70,71]. Table 5 summarizes which evaluation metrics are reported (tick marked) across a representative subset of face mask detection studies.

Notably, only a few recent works explicitly report ROC curves and AUC values for face-mask detection, such as ensemble-based detectors and IoT access-control systems, whereas most studies rely primarily on accuracy, precision, recall, F1-score, and mAP.

Despite the widespread reporting of high accuracy, precision, recall, F1-score, and other performance metrics in face mask detection literature, direct comparison of these results across studies remains inherently challenging. Reported performances are often obtained under heterogeneous experimental conditions, including differing dataset compositions, annotation standards, class definitions (binary versus multi-class compliance), train–test splits, hardware platforms, and evaluation protocols. In many cases, accuracy values exceeding 99% are achieved on curated or internally collected datasets with limited environmental variability, which may not reflect real-world deployment conditions. Consequently, the performance values summarized throughout this review, including those presented in comparative tables, should not be interpreted as normalized benchmarks across studies. Instead, the analysis focuses on identifying architectural trends, efficiency patterns, and deployment-oriented trade-offs reported in the literature. Absolute performance figures should therefore be interpreted with caution in light of these methodological variations.

4.2. Trade-Offs Between Accuracy and Efficiency

While Section 3 focuses on architectural design principles, this section analyzes the empirical performance and deployment trade-offs of these architectures across different scenarios. The trade-off between accuracy and computational efficiency is a central challenge in the development and deployment of deep learning models for face-mask detection. Heavyweight architectures, such as VGGNet [18], ResNet [21], DenseNet [22], and Inception variants [19,20], typically achieve high accuracy on curated datasets but incur substantial computational and memory demands. Their high parameter counts, slow inference speeds on embedded processors, and significant GPU memory requirements make them less suitable for real-time surveillance or IoT applications with strict latency and power constraints. Consequently, practical deployments increasingly favor architectures that aim to balance representational capacity with computational efficiency, especially in resource-constrained public health monitoring scenarios.

4.2.1. Impact of Model Size and Architecture on Accuracy and Efficiency

Lightweight models, including EfficientNet, ShuffleNet, YOLOv3-tiny, and YOLOv4-tiny, reduce parameter count and inference latency by design, enabling fast deployment on embedded or mobile devices. For example, YOLOv4-tiny contains approximately six million parameters, nearly one-tenth of YOLOv4, and offers significantly higher detection speed, making it more suitable for real-time applications while still incurring a reduction in detection accuracy compared to the full model [72,73]. Similar observations are reported across lightweight design attempts: decreasing the dimensionality of feature maps can reduce inference time but often results in higher false detections. For instance, reducing feature dimensionality in lightweight architectures can improve computational efficiency, but may lead to increased false detections or reduced robustness.

Some lightweight architectures demonstrate strong and balanced performance in real-time face mask detection scenarios. For instance, models based on MobileNetV2 report accuracy around 92–93% with F1-scores above 0.90 in real-time settings [73]. Similarly, an ultra-lightweight model with only 0.12 M parameters, which is significantly smaller (up to 496 times reduction in parameters) compared to other state-of-the-art models and achieves competitive accuracy (95.41% for binary and 95.54% for three-class classification), showing that significant parameter reduction is possible without proportional loss in performance [74].

4.2.2. Comparative Performance of Lightweight and Heavyweight Models

Heavyweight convolutional architectures have demonstrated strong performance in face mask detection tasks due to their deep feature representation capabilities. Models such as DenseNet201, ResNet50, InceptionV3, VGG16, and EfficientNet variants are typically categorized as heavyweight architectures, whereas MobileNetV2 represents a lightweight alternative designed for efficient deployment on resource-constrained devices. While heavyweight models generally achieve higher accuracy due to their deeper and more complex structures, they often incur significantly higher computational costs, making them less suitable for real-time edge applications.

A representative example of heavyweight model performance is provided by the DenseMaskNet [75] framework based on a fine-tuned DenseNet201 backbone and reports superior performance compared to baseline models. DenseMaskNet achieves the highest classification accuracy of approximately 99%, outperforming VGG16 and MobileNetV2 (approximately 97–98%), while ResNet50 shows reduced performance (approximately 87%) and EfficientNetB7 performs significantly worse. In addition, DenseMaskNet maintains consistently high precision (0.98–1.00) across all mask categories and demonstrates strong recall and F1-score performance, particularly in detecting improperly worn masks, highlighting the effectiveness of deeper architectures in complex classification scenarios.

A related approach in [76] further highlights the effectiveness of heavyweight architectures when combined with sequential modeling techniques. In this study, transfer learning using VGG16 and AlexNet is integrated with Long Short-Term Memory (LSTM) and Bidirectional LSTM (BiLSTM) layers to enhance classification performance across multiple mask-wearing conditions. The proposed hybrid model achieves a classification accuracy of 95.67%, demonstrating that combining convolutional feature extraction with temporal or contextual learning can improve detection robustness. However, such hybrid and heavyweight designs introduce additional computational complexity, reinforcing the trade-off between accuracy and efficiency and motivating the need for lightweight or hybrid architectures for real-time edge deployment.

Transformer-based architectures have recently emerged as powerful alternatives to conventional CNNs. The Swin Transformer is a general-purpose vision backbone designed for a wide range of computer vision tasks, including classification and detection. It introduces hierarchical architecture with shifted window-based self-attention, enabling efficient modeling with linear computational complexity relative to image size [77]. As a result, it achieves strong performance compared to traditional CNN backbones. However, despite these efficiency improvements, Swin-based models still involve substantial computational overhead and parameter size, placing them within the category of advanced heavyweight architectures rather than suitable solutions for real-time edge deployment. A recent study has further explored the application of Swin Transformer in face mask detection tasks, demonstrating its superior performance over conventional CNN-based models. Experimental results show that the Swin Transformer achieves the highest accuracy of 99.8%, outperforming EfficientNetV2 (99.3%), MobileNetV2 (99.0%), DenseNet (98.3%), and ResNet50 (97.2%). This trend is consistently reflected across other evaluation metrics, where Swin Transformer attains approximately 99.9% precision, 99.8% recall, and 99.8% F1-score, indicating strong classification reliability. In contrast, while lightweight models such as MobileNetV2 achieve competitive performance with significantly fewer parameters (3.2 M), they exhibit slightly lower overall accuracy. Despite its superior performance, the Swin Transformer involves a higher parameter count (28.4 M), reinforcing its classification as a computationally intensive, heavyweight architecture [78].

Hybrid and ensemble approaches offer a middle ground between the high accuracy of heavyweight models and the efficiency of lightweight architectures. By combining complementary strengths, these approaches can improve detection performance while maintaining practical computational requirements. For example, an ensemble of single- and two-stage detectors achieved approximately 98.2% accuracy with an average inference time of 0.05 s, demonstrating that such hybrid strategies can meet both accuracy and latency requirements in real-time deployments [2]. Some studies emphasize that model selection for deployment should reflect hardware limitations and operational scenarios, balancing accuracy with speed, power consumption, and available compute resources [72,73,79].

To better contextualize these performance trends, it is necessary to compare the underlying architectural families that shape each model’s computational behavior. Differences in parameter count, convolutional design, memory footprint, and inference throughput directly determine whether the architecture can meet real-time requirements on edge devices or embedded systems. Unlike the architectural descriptions provided in Section 3, Table 6 presents a deployment-oriented comparison of architecture families in terms of computational efficiency, inference speed, and real-time suitability. Note that reported values are reproduced from the cited studies and should be interpreted in light of differences in datasets, hardware platforms, and evaluation procedures.

The architectural comparison presented in Table 6 reveals a clear and consistent relationship between model complexity, computational efficiency, and deployment feasibility. Conventional CNN backbones and two-stage detectors demonstrate strong representational capacity and high detection accuracy; however, their large parameter sizes, high memory requirements, and low inference speeds make them impractical for real-time edge deployment. This pattern reflects a clear inverse relationship between model complexity and computational efficiency, where increases in parameter scale are consistently associated with reduced inference speed and higher memory demand. In contrast, lightweight convolutional models significantly reduce computational overhead and achieve high frame rates, but are often limited to classification-based pipelines that require separate face detection modules. Lightweight single-stage detectors address this limitation by integrating detection and classification within a unified framework, offering a more practical balance between accuracy and efficiency. Furthermore, hybrid and attention-enhanced architectures extend this balance by improving robustness under real-world conditions while maintaining acceptable computational costs. Overall, the table highlights that deployment suitability is governed not by accuracy alone, but by a combination of latency, memory footprint, and hardware compatibility.

4.2.3. Energy Consumption Considerations for Edge Deployment

Beyond accuracy and inference latency, energy consumption is a decisive constraint for deploying face mask detection (FMD) systems on embedded and edge platforms. Recent deployment-oriented studies consistently demonstrate that the architecture type, model size, and inference configuration have a direct and measurable impact on power demand during real-time computer vision inference [4,80].

Empirical analyses on commercial GPUs and embedded devices show that conventional CNN backbones, including VGG and ResNet-based architectures, incur substantially higher power consumption during sustained inference. The increase in parameter count, convolutional depth, and dense feature extraction pipelines leads to elevated GPU utilization and power demand, particularly under continuous detection workloads. Measurements reported across multiple platforms indicate that such architectures can exceed 10–20 W on edge-class accelerators and scale to significantly higher values on desktop-class GPUs, limiting their practicality for battery-powered or large-scale multi-camera deployments despite strong accuracy performance [80,81].

In contrast, lightweight convolutional architectures demonstrate markedly lower power requirements compared with conventional deep backbones. Studies analyzing inference behavior on ARM-based embedded systems, such as Raspberry Pi and similar low-power edge devices, as well as embedded GPU platforms, attribute this reduction to depthwise separable convolutions, reduced parameterization, and simplified feature hierarchies. As a result, lightweight models sustain real-time inference while operating within single-digit watt ranges on embedded platforms, making them well suited for continuous vision tasks under strict power budgets [4,81].

Hybrid lightweight detectors, such as SSD variants built on lightweight backbones and compact YOLO configurations, occupy an intermediate position in the accuracy–energy spectrum. Although the inclusion of detection heads and multi-scale feature fusion introduces additional computational overhead, empirical measurements indicate that these architectures remain significantly more energy-efficient than heavy backbones. Their moderate power consumption, combined with improved localization capability, makes them a practical compromise for real-time compliance monitoring scenarios where both accuracy and energy efficiency are required [80].

Overall, existing evidence confirms a strong correlation between architectural complexity and power demand, reinforcing the preference for lightweight and hybrid models in power-constrained FMD deployments. However, the reviewed studies also highlight the lack of standardized power-measurement protocols, which complicates direct cross-study comparison. This limitation underscores the importance of incorporating deployment-aware metrics, such as energy-per-inference (J/inference) or joules-per-frame (J/frame), alongside accuracy and latency in future evaluations.

This contextual foundation supports the subsequent analysis of accuracy–efficiency patterns across representative models. The strategies discussed next build upon these architectural characteristics to further enhance model performance under resource constraints. By understanding the inherent computational profiles of each architecture family, it becomes clearer how transfer learning, data augmentation, and model-level modifications can be applied to strengthen accuracy without compromising real-time efficiency.

To facilitate comparison across representative studies, Table 7 summarizes commonly reported deployment platforms, model architectures, inference performance (FPS or latency), and power or energy characteristics where available. However, it should be noted that the literature reporting detailed energy or power measurements on edge devices remains relatively limited. Consequently, the table distinguishes between values that are experimentally measured and those that are estimated or discussed in the literature without direct experimental validation, particularly in cases where deployment on embedded platforms is suggested but not empirically evaluated.

The results summarized in Table 7 clearly demonstrate that inference performance in face mask detection systems is strongly influenced by the underlying hardware platform rather than by model architecture alone. Even when similar lightweight backbones are used, substantial variations in latency are observed across platforms such as Raspberry Pi 4, Jetson Nano, and Jetson Xavier NX. Notably, the same model family consistently exhibits significantly lower latency on Jetson Xavier NX compared with other edge devices. This pattern holds across both image classification and object-detection tasks, indicating that hardware capability plays a dominant and task-independent role in determining real-time performance. These findings suggest that deployment efficiency can be improved not only through architectural optimization but also through appropriate selection of hardware platforms.

The table further reflects the inherent inverse relationship between latency and throughput, where lower latency corresponds to higher frames per second (FPS). This relationship provides an important basis for interpreting reported performance metrics across studies. However, the observed latency and FPS values are not solely determined by model design; rather, they are jointly influenced by task complexity and execution platform. As a result, FPS and latency should be interpreted as complementary indicators of real-time capability rather than independent performance measures.

A consistent distinction can be observed between image classification and object-detection tasks. Classification-based implementations generally achieve lower latency due to their simpler computational pipeline, which involves only feature extraction and label prediction. In contrast, object-detection frameworks introduce additional overhead associated with localization, multi-scale feature processing, and bounding-box regression, leading to increased latency even when built on similar backbone architectures. This distinction highlights that task formulation is a critical factor when evaluating and comparing deployment performance.

Despite the importance of energy efficiency for edge deployment, the table reveals significant inconsistency in how power and energy metrics are reported across studies. Several entries provide experimentally measured latency or FPS without corresponding power measurements, while others report estimated power values or omit energy-related metrics entirely. In particular, some studies report “measured” results that refer only to inference performance, not to power consumption. This lack of uniformity limits the ability to perform comprehensive energy-efficiency comparisons and highlights a broader issue in deployment-oriented deep learning research.

Overall, Table 7 underscores that performance evaluation in face mask detection systems is inherently multi-dimensional, involving interactions between model architecture, task type, and hardware platform. The observed variability in measurement practices and deployment settings indicates that the reported values should be interpreted as indicative of relative trends rather than directly comparable benchmarks. These findings reinforce the need for standardized evaluation protocols that jointly consider latency, throughput, and energy consumption under consistent hardware conditions to enable fair and reproducible comparison across models.

4.2.4. Strategies to Improve the Trade-Off

Transfer Learning: Utilizing pre-trained models and transfer learning can enhance the accuracy of lightweight models without significantly increasing computational demands. For example, transfer learning with MobileNetV2 and ResNet50 has been shown to improve detection accuracy in real-time applications [2,9,61].

Data Augmentation and Balanced Datasets: Techniques such as random over-sampling and data augmentation help improve model accuracy by addressing class imbalance, as seen in studies that reduced imbalance ratios and achieved high accuracy with efficient models [2,74].

Model Modifications: Modifying backbone networks, activation functions, and loss functions, as done in YOLOv4 and its variants, can help maintain good accuracy while improving speed and reducing resource consumption [9,72].

The comparative relationships between accuracy and computational efficiency across representative models are summarized in Table 8 providing a consolidated view of the trade-offs discussed in this section.

The comparative evidence presented in Table 8 highlights that no single model consistently optimizes both accuracy and computational efficiency across all deployment scenarios. Instead, the results reveal a clear trade-off spectrum in which heavyweight architectures achieve the highest accuracy but incur substantial computational cost, limiting their suitability for real-time, hardware-constrained environments. In contrast, lightweight models significantly reduce resource requirements and enable real-time deployment, although they may exhibit reduced robustness under complex or unconstrained conditions. Notably, recent lightweight and custom-designed architectures demonstrate that substantial parameter reduction can be achieved without proportional loss in performance, indicating ongoing progress in efficient model design. Hybrid and ensemble approaches emerge as a particularly effective compromise, achieving near-state-of-the-art accuracy while maintaining practical inference speed, making them well suited for real-time monitoring in smart-city, IoT, and embedded applications. However, the variability in how performance is reported, with some studies providing precise numerical accuracy values and others offering qualitative comparisons, limits direct cross-study benchmarking. Consequently, the results should be interpreted as indicative of general performance trends, and the optimal model selection ultimately depends on deployment context, where accuracy-critical applications may tolerate heavier models, while large-scale real-time systems benefit from efficient lightweight or hybrid architectures.

4.2.5. Deployment-Oriented Architecture Selection Framework

Selecting an appropriate architecture for face mask detection depends not only on recognition accuracy but also on the computational constraints of the target deployment environment. Real-world systems often operate under strict limitations in processing power, memory capacity, and energy availability. So, architectural selection should be guided by a deployment-oriented decision framework that balances accuracy, efficiency, and scalability. At the lowest resource level, embedded platforms such as Raspberry Pi devices, microcontrollers, or low-power ARM processors require extremely compact models capable of running with minimal computational overhead. Lightweight convolutional architectures such as MobileNetV2, ShuffleNet, and SqueezeNet are particularly suitable in such environments due to their reduced parameter counts and efficient convolutional operations. These models are typically used in classification pipelines where face detection is performed separately using lightweight detectors.

For edge GPU platforms, such as NVIDIA Jetson Nano, Xavier NX, or similar embedded accelerators, slightly larger models can be deployed while still maintaining real-time inference. In these environments, lightweight single-stage detectors such as SSD-MobileNetV2, YOLOv4-tiny, YOLOv5s, or SqueezeMaskNet provide an effective balance between detection accuracy and inference speed. These architectures are frequently adopted in surveillance systems requiring simultaneous face localization and mask classification.

In high-performance computing environments, including cloud servers and GPU clusters, computational constraints are significantly reduced. In these scenarios, heavier architectures such as Faster R-CNN, Mask R-CNN, or large transformer-based models can be employed to achieve higher detection accuracy and improved robustness under challenging conditions. However, such models are generally unsuitable for direct deployment on embedded or mobile platforms. Table 9 summarizes a deployment-oriented decision framework that maps common hardware environments to appropriate architecture families.

Table 9 synthesizes the preceding architectural and performance analysis into a deployment-oriented decision framework that maps computational environments to suitable model families. Rather than emphasizing individual model performance, the table highlights how architectural selection should be guided by system-level constraints, including processing capability, memory availability, and real-time requirements. A clear progression can be observed from ultra-low-power edge devices to cloud-based systems, where increasing computational resources enable the use of more complex architectures with higher accuracy. At the same time, the framework demonstrates that lightweight and hybrid models are essential for maintaining real-time performance in edge and embedded environments, while heavyweight architectures remain more appropriate for accuracy-critical applications in high-performance settings. This structured mapping reinforces the principle that optimal model selection is inherently context-dependent and must balance accuracy, efficiency, and scalability according to deployment conditions.

Overall, the comparative analysis presented in this section demonstrates that effective face mask detection in real-world deployments is not determined by model accuracy alone, but by a multi-dimensional balance between architectural efficiency, task complexity, hardware capability, and energy constraints, thereby emphasizing the need for deployment-aware model design.

5. Future Research Directions

Despite the significant progress achieved through conventional, lightweight, and hybrid architectures, several open challenges continue to limit the robustness, generalizability, and scalability of face-mask detection systems. These challenges present rich opportunities for future research, particularly in improving multi-class capability, domain adaptation, model compression, and extending mask-detection frameworks toward broader applications within public-health monitoring and intelligent surveillance.

5.1. Improper Mask Detection and Multi-Class Analysis

Most existing models are optimized for binary mask detection (mask vs. no mask), yet improper mask wearing remains a major real-world compliance issue. Identifying masks worn below the nose, under the chin, partially covering the mouth, or loosely attached requires fine-grained, region-aware feature extraction that many lightweight systems struggle with. Conventional CNNs offer strong discriminative capacity but are computationally expensive, while lightweight architectures may misclassify subtle misuse cases due to reduced representational depth. Hybrid detectors improve robustness, but their performance still degrades when improper mask patterns exhibit high variability across individuals, mask materials, or face shapes.

Future work must therefore address multi-class imbalance, generate or augment datasets with diverse improper-wear scenarios, and incorporate region-specific constraints or anatomical priors. Approaches such as fine-grained visual reasoning, dense part-based attention, or landmark-guided mask positioning analysis could substantially improve improper-mask detection accuracy.

5.2. Domain Adaptation and Real-World Variability

A persistent challenge across all model families is domain shift, the discrepancy between curated training datasets and real-world deployment conditions. Surveillance environments introduce uncontrolled variables including illumination changes, shadows, motion blur, camera angle variations, diverse mask designs, facial accessories, occlusions (hands, hair, objects), and crowd density. Models trained purely on curated datasets such as MaskedFace-Net or PWMFD often suffer significant accuracy drops when deployed in unseen domains, demonstrating insufficient generalization.

Future research must explore domain-adaptation strategies, such as:

Unsupervised domain adaptation (UDA) for aligning feature distributions across environments;
Self-supervised representation learning to reduce dependency on labels;
Cross-dataset training pipelines that incorporate heterogeneous noise, mask materials, and cultural variations;
Synthetic domain randomization to simulate low-quality or occluded footage.

Enhancing robustness under real-world variability is critical for scalable deployment across transportation hubs, hospitals, campuses, and crowded public spaces.

5.3. Knowledge Distillation and Model Compression

Lightweight models perform efficiently on edge devices but may lose accuracy on challenging scenes due to reduced feature capacity. Knowledge distillation, where a compact student model learns from a larger teacher model, can recover lost accuracy while maintaining computational efficiency. This approach has seen early success in related vision tasks, but its application to face-mask detection remains limited.

Future work should investigate:

Teacher–student pipelines using powerful hybrids (e.g., enhanced YOLOv5/YOLOv8) as teachers and MobileNet- or ShuffleNet-based students;
Quantization-aware training to reduce model size without introducing significant accuracy loss;
Structured pruning of convolutional layers to remove redundant channels;
Neural Architecture Search (NAS) for identifying optimal low-complexity architectures tailored to improper-mask detection.

Such compression-oriented strategies will be essential for enabling large-scale deployment across IoT sensor networks or multi-camera smart-city infrastructures.

5.4. Expanding Applications Beyond Mask Detection

As mask usage declines post-pandemic, face-mask detection systems should evolve toward broader public-safety, healthcare, and compliance-monitoring applications. Since many architectural foundations are adaptable such as lightweight CNNs, attention-enhanced detectors, and hybrid YOLO variants, researchers can repurpose these systems for:

Personal Protective Equipment compliance monitoring (helmets, gloves, lab coats, face shields, safety goggles, etc.);
Human behavior analysis (face-touching detection, cough detection, proximity violations);
Health screening (visible respiratory cues, temperature screening integration);
Access-control and identity verification under occlusion;
Crowd analytics and anomaly detection for smart-city infrastructure.

Furthermore, integrating mask detection with multimodal sensing (audio, thermal imaging, RFID) can enable holistic public-health monitoring systems. The shift from task-specific detectors to generalizable compliance-monitoring frameworks represents a major future research opportunity.

5.5. Standardized Evaluation Protocols and Benchmarking

A fundamental limitation in current face mask detection research is the absence of standardized evaluation protocols and benchmark datasets. Existing studies report performance using diverse datasets, varying class definitions (binary versus multi-class compliance), inconsistent train–test splits, and heterogeneous metric aggregation strategies. As a result, accuracy and F1-score values, often exceeding 99%, are not directly comparable across studies, particularly when derived from curated or application-specific datasets.

Establishing standardized benchmarks, shared evaluation splits, and cross-dataset testing protocols would significantly enhance reproducibility and enable more objective architectural comparisons. However, such standardization would require coordinated dataset curation, unified annotation guidelines, and large-scale validation across deployment environments, which would substantially extend the scope of the present review. Accordingly, this work emphasizes relative architectural trends and deployment trade-offs rather than absolute performance ranking. Future research should prioritize the development of common benchmarking frameworks and evaluation standards to support fair, transparent, and deployment-oriented assessment of face mask detection systems. Conducting such large-scale standardized benchmarking falls outside the scope of a review article and would require coordinated community effort; nevertheless, highlighting this gap is essential for guiding future research toward reproducible and deployment-relevant evaluation.

5.6. Energy-Aware Evaluation and Power-Centric Benchmarking

While accuracy and inference latency remain the most commonly reported metrics in face mask detection (FMD) research, energy consumption is rarely evaluated in a systematic and comparable manner, despite being a decisive factor for real-world edge deployment. As highlighted by recent empirical studies on embedded platforms, architectural choices can lead to substantial variation in power demand even among models with similar accuracy and frame rates. However, the lack of standardized measurement methodologies—covering hardware configuration, workload duration, and power-sensing granularity—limits the interpretability and reproducibility of reported results.

Future research should therefore emphasize energy-aware evaluation frameworks that explicitly incorporate metrics such as energy per inference, joules per frame, or performance per watt, alongside conventional accuracy and latency measures. Establishing common benchmarking practices across representative edge platforms (e.g., Raspberry Pi and NVIDIA Jetson devices) would enable fairer comparison of conventional, lightweight, and hybrid architectures, and support deployment-oriented model selection. Such efforts would not only improve experimental transparency but also facilitate the design of energy-optimized FMD systems capable of long-term, large-scale operation in power-constrained environments.

5.7. Emerging Transformer-Based Architectures for Edge Mask Detection

Recent advances in Vision Transformers (ViTs) and hybrid CNN–transformer architectures have demonstrated strong representation capability in a wide range of vision tasks, including image classification and object detection. Unlike conventional CNNs, transformers leverage self-attention mechanisms to model long-range spatial dependencies, which may offer advantages for face mask detection in complex scenes involving occlusion, crowded environments, and varying mask appearances [82,83,84].

However, the direct deployment of standard ViT models on edge devices remains challenging due to their computational and memory requirements, which are often incompatible with the constraints of embedded platforms. To address this limitation, recent research has explored lightweight and efficient transformer variants, such as MobileViT [85], DeiT [82], and hybrid CNN–transformer designs including Mobile-Former [86], which aim to balance global context modeling with reduced parameter counts and inference cost. These architectures present promising opportunities for enhancing mask detection robustness while remaining feasible for edge-based deployment.

Future research may therefore investigate the integration of efficient transformer components within lightweight detection pipelines, as well as systematic comparison between CNN-based, hybrid, and transformer-based models under realistic edge constraints. Such studies would help clarify whether attention-driven architecture can provide measurable benefits for mask detection accuracy and generalization without compromising real-time performance and energy efficiency on embedded platforms.

6. Conclusions

This review examined the architectural evolution of deep learning-based face mask detection systems, tracing the progression from conventional convolutional neural networks to lightweight and hybrid designs tailored for real-time deployment in resource-constrained environments. While traditional architectures such as VGGNet, ResNet, and DenseNet provide strong feature representation capabilities, their substantial computational and memory requirements limit their practicality for embedded, edge, and IoT-based applications. In contrast, lightweight convolutional architectures, including MobileNet, EfficientNet, ShuffleNet, and related efficient model families, significantly reduce model complexity and inference latency, enabling practical deployment without severe degradation in recognition performance.

Hybrid architectures that integrate lightweight backbones with optimized detection heads, including frameworks such as YOLO and SSD, multi-stage pipelines, or complementary classification modules further enhance the balance between accuracy, robustness, and efficiency. These designs have emerged as particularly effective for real-world mask compliance monitoring, where spatial localization, multi-class discrimination, and real-time operation must be achieved simultaneously under hardware constraints. However, comparative analysis across the literature confirms that no single architectural paradigm consistently optimizes all performance dimensions; instead, each model family exhibits strengths that align with specific deployment requirements and operational priorities.

Several open challenges remain. Improper mask detection, robustness to unconstrained environmental variability, and domain adaptation across diverse surveillance contexts continue to limit generalization. Moreover, direct cross-study comparison is hindered by heterogeneous datasets, evaluation protocols, and hardware configurations, reinforcing the need for standardized benchmarking practices. Future research should therefore prioritize architectures that generalize reliably across environments, support multi-class compliance scenarios, and enable energy-efficient, real-time computation on embedded and edge platforms. Compression-oriented strategies such as pruning, quantization, and knowledge distillation, as well as emerging hybrid CNN–transformer designs, represent promising directions for advancing this goal.

Overall, this review provides an architecture-centered synthesis of face mask detection research, clarifying design trade-offs, highlighting deployment-oriented constraints, and outlining future research directions to support the development of efficient, reliable, and scalable compliance-monitoring systems for modern edge-computing environments.

Funding

This research received no external funding.

Data Availability Statement

Data sharing is not applicable to this article as no new datasets were created or analyzed. This study is based on the analysis and synthesis of previously published research articles, which are cited within the manuscript.

Conflicts of Interest

The author declares no conflict of interest.

References

Liang, M.; Gao, L.; Cheng, C.; Zhou, Q.; Uy, J.P.; Heiner, K.; Sun, C. Efficacy of face mask in preventing respiratory virus transmission: A systematic review and meta-analysis. Travel Med. Infect. Dis. 2020, 36, 101751. [Google Scholar] [CrossRef] [PubMed]
Sethi, S.; Kathuria, M.; Kaushik, T. Face mask detection using deep learning: An approach to reduce risk of Coronavirus spread. J. Biomed. Inform. 2021, 120, 103848. [Google Scholar] [CrossRef]
Wu, P.; Li, H.; Zeng, N.; Li, F. FMD-Yolo: An efficient face mask detection method for COVID-19 prevention and control in public. Image Vis. Comput. 2022, 117, 104341. [Google Scholar] [CrossRef]
Kolosov, D.; Kelefouras, V.; Kourtessis, P.; Mporas, I. Anatomy of Deep Learning Image Classification and Object Detection on Commercial Edge Devices: A Case Study on Face Mask Detection. IEEE Access 2022, 10, 109167. [Google Scholar] [CrossRef]
Ullah, N.; Javed, A.; Ghazanfar, M.A.; Alsufyani, A.; Bourouis, S. A novel DeepMaskNet model for face mask detection and masked facial recognition. J. King Saud Univ.—Comput. Inf. Sci. 2022, 34, 9905–9914. [Google Scholar] [CrossRef]
Abbas, S.F.; Shaker, S.H.; Abdullatif, F.A. Face Mask Detection Based on Deep Learning: A Review. J. Soft Comput. Comput. Appl. 2024, 1, 7. [Google Scholar] [CrossRef]
Amer, F.; Ali, M.; Al-Tamimi, M.S.H. Face mask detection methods and techniques: A review. Int. J. Nonlinear Anal. Appl. 2022, 13, 2008–6822. [Google Scholar] [CrossRef]
Vibhuti; Jindal, N.; Singh, H.; Rana, P.S. Face mask detection in COVID-19: A strategic review. Multimed. Tools Appl. 2022, 81, 40013–40042. [Google Scholar] [CrossRef]
Alturki, R.; Alharbi, M.; AlAnzi, F.; Albahli, S. Deep learning techniques for detecting and recognizing face masks: A survey. Front. Public Health 2022, 10, 955332. [Google Scholar] [CrossRef]
Anggraini, N.; Ramadhani, S.H.; Wardhani, L.K.; Hakiem, N.; Shofi, I.M.; Rosyadi, M.T. Development of Face Mask Detection using SSDLite MobilenetV3 Small on Raspberry Pi 4. In Proceedings of the 2022 5th International Conference on Computer and Informatics Engineering, IC2IE 2022, Jakarta, Indonesia, 13–14 September 2022; Institute of Electrical and Electronics Engineers Inc.: New Jersey, NJ, USA, 2022; pp. 209–214. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, Long Beach, CA, USA, 10–15 June 2019; Volume 2019, pp. 10691–10700. Available online: https://arxiv.org/pdf/1905.11946 (accessed on 1 December 2025).
Sanjaya, S.A.; Rakhmawan, S.A. Face Mask Detection Using MobileNetV2 in the Era of COVID-19 Pandemic. In Proceedings of the 2020 International Conference on Data Analytics for Business and Industry: Way Towards a Sustainable Economy, ICDABI 2020, Sakheer, Bahrain, 26–27 October 2020; Institute of Electrical and Electronics Engineers Inc.: New Jersey, NJ, USA, 2020. [Google Scholar] [CrossRef]
Shao, Y.; Ning, J.; Shao, H.; Zhang, D.; Chu, H.; Ren, Z. Lightweight face mask detection algorithm with attention mechanism. Eng. Appl. Artif. Intell. 2024, 137, 109077. [Google Scholar] [CrossRef]
Dodda, R.; Raghavendra, C.; Swamy, U.R.; Azmera, C.N.; Sreenu, M.; Nimmala, S. Real-Time Face Mask Detection Using Deep Learning: Enhancing Public Health and Safety. E3S Web Conf. 2025, 616, 02013. [Google Scholar] [CrossRef]
Sheikh, B.U.H.; Zafar, A. RRFMDS: Rapid Real-Time Face Mask Detection System for Effective COVID-19 Monitoring. SN Comput. Sci. 2023, 4, 288. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2323. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015—Conference Track Proceedings, San Diego, CA, USA, 7–9 May 2015; Available online: https://arxiv.org/pdf/1409.1556 (accessed on 4 December 2025).
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, AAAI 2017, San Francisco, CA, USA, 4–9 February 2017; Association for the Advancement of Artificial Intelligence: Palo Alto, CA, USA, 2017; pp. 4278–4284. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; Institute of Electrical and Electronics Engineers: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; Institute of Electrical and Electronics Engineers: Piscataway, NJ, USA, 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; Institute of Electrical and Electronics Engineers: Piscataway, NJ, USA, 2016; pp. 1800–1807. [Google Scholar] [CrossRef]
Radosavovic, I.; Kosaraju, R.P.; Girshick, R.; He, K.; Dollár, P. Designing network design spaces. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; Institute of Electrical and Electronics Engineers: Piscataway, NJ, USA, 2020; pp. 10425–10433. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Columbus: IEEE Computer Society, Columbus, OH, USA, 23–28 June 2014; Institute of Electrical and Electronics Engineers: Piscataway, NJ, USA, 2014; pp. 580–587. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; Institute of Electrical and Electronics Engineers: Piscataway, NJ, USA, 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Cabani, A.; Hammoudi, K.; Benhabiles, H.; Melkemi, M. MaskedFace-Net—A dataset of correctly/incorrectly masked face images in the context of COVID-19. Smart Health 2020, 19, 100144. [Google Scholar] [CrossRef]
Jiang, X.; Gao, T.; Zhu, Z.; Zhao, Y. Real-Time Face Mask Detection Method Based on YOLOv3. Electronics 2021, 10, 837. [Google Scholar] [CrossRef]
Mahmoud, M.; Kasem, M.S.E.; Kang, H.S. A Comprehensive Survey of Masked Faces: Recognition, Detection, and Unmasking. Appl. Sci. 2024, 14, 8781. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobileNetV3. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, South Korea, 27 October–2 November 2019; Institute of Electrical and Electronics Engineers: Piscataway, NJ, USA, 2019; pp. 1314–1324. [Google Scholar] [CrossRef]
Al-Rammahi, A.H.I. Face mask recognition system using MobileNetV2 with optimization function. Appl. Artif. Intell. 2022, 36, 2145638. [Google Scholar] [CrossRef]
Fadly, F.; Kurniawan, T.B.; Dewi, D.A.; Zakaria, M.Z.; Hisham, P.A.A.B. Deep Learning Based Face Mask Detection System Using MobileNetV2 for Enhanced Health Protocol Compliance. J. Appl. Data Sci. 2024, 5, 2067–2078. [Google Scholar] [CrossRef]
Nagrath, P.; Jain, R.; Madan, A.; Arora, R.; Kataria, P.; Hemanth, J. SSDMNV2: A real time DNN-based face mask detection system using single shot multibox detector and MobileNetV2. Sustain. Cities Soc. 2021, 66, 102692. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; Institute of Electrical and Electronics Engineers: Piscataway, NJ, USA, 2018; pp. 6848–6856. [Google Scholar] [CrossRef]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and <0.5 MB Model Size. February. 2016. Available online: https://arxiv.org/pdf/1602.07360 (accessed on 1 December 2025).
Sharma, M.; Gunwant, H.; Saggar, P.; Gupta, L.; Gupta, D. EfficientNet-B0 Model for Face Mask Detection Based on Social Information Retrieval. Int. J. Inf. Syst. Model. Des. 2022, 13, 15. [Google Scholar] [CrossRef]
Azouji, N.; Sami, A.; Taheri, M. EfficientMask-Net for face authentication in the era of COVID-19 pandemic. Signal Image Video Process. 2022, 16, 1991–1999. [Google Scholar] [CrossRef]
Thuan, C.H.; Nguyen, V.D. Face Mask Detection Using YOLOv8 with Fine-Tuning and EfficientNet Backbone. In Proceedings of the International Conference on Sustainable Computing. ICSC 2025, Ho Chi Minh, Vietnam, 16–17 June 2025; Lecture Notes in Electrical Engineering; Goyal, N., Nguyen, T.N., Lata, M., Ogunmola, G.A., Eds.; Springer: Singapore, 2026; Volume 1530. [Google Scholar] [CrossRef]
Benitez-Garcia, G.; Prudente-Tixteco, L.; Olivares-Mercado, J.; Takahashi, H. SqueezeMaskNet: Real-Time Mask-Wearing Recognition for Edge Devices. Big Data Cogn. Comput. 2025, 9, 10. [Google Scholar] [CrossRef]
Loey, M.; Manogaran, G.; Taha, M.H.N.; Khalifa, N.E.M. Fighting against COVID-19: A novel deep learning model based on YOLO-v2 with ResNet-50 for medical face mask detection. Sustain. Cities Soc. 2021, 65, 102600. [Google Scholar] [CrossRef] [PubMed]
Karthikeyan, B.; Gowri, S. A Real-Time Face Mask Detection Using SSD and MobileNetV2. In Proceedings of the 2021 4th International Conference on Computing and Communications Technologies, ICCCT 2021, Chennai, India, 16–17 December 2021; pp. 144–148. [Google Scholar] [CrossRef]
Pham, T.N.; Nguyen, V.H.; Huh, J.H. Integration of improved YOLOv5 for face mask detector and auto-labeling to generate dataset for fighting against COVID-19. J. Supercomput. 2023, 79, 8966–8992. [Google Scholar] [CrossRef]
Loey, M.; Manogaran, G.; Taha, M.H.N.; Khalifa, N.E.M. A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the COVID-19 pandemic. Measurement 2021, 167, 108288. [Google Scholar] [CrossRef] [PubMed]
Tabassum, T.; Talukder, A.; Rahman, M.; Rashiduzzaman; Kabir, Z.; Islam, M.; Uddin, A. A Parallel Convolutional Neural Network for Accurate Face Mask Detection in the Fight Against COVID-19. Biomed. Mater. Devices 2025, 4, 2347–2357. [Google Scholar] [CrossRef]
Haque, S.B.U. A fuzzy-based frame transformation to mitigate the impact of adversarial attacks in deep learning-based real-time video surveillance systems. Appl. Soft Comput. 2024, 167, 112440. [Google Scholar] [CrossRef]
Dubey, P.; Dubey, P.; Iwendi, C.; Biamba, C.N.; Rao, D.D. Enhanced IoT-Based Face Mask Detection Framework Using Optimized Deep Learning Models: A Hybrid Approach with Adaptive Algorithms. IEEE Access 2025, 13, 17325–17339. [Google Scholar] [CrossRef]
Parikh, D.; Karthikeyan, A.; Ravi, V.; Shibu, M.; Singh, R.; Sofana, R.S. IoT and ML-driven framework for managing infectious disease risks in communal spaces: A post-COVID perspective. Front. Public Health 2025, 13, 1552515. [Google Scholar] [CrossRef]
Truong, C.D.; Mishra, S.; Long, N.Q.; Ngoc, L.A. Efficient Face Mask Detection for Banking Information Systems. In Creative Approaches Towards Development of Computing and Multidisciplinary IT Solutions for Society; Scrivener Publishing LLC: Beverly, MA, USA, 2024; pp. 435–454. [Google Scholar] [CrossRef]
Himeur, Y.; Al-Maadeed, S.; Varlamis, I.; Al-Maadeed, N.; Abualsaud, K.; Mohamed, A. Face Mask Detection in Smart Cities Using Deep and Transfer Learning: Lessons Learned from the COVID-19 Pandemic. Systems 2023, 11, 107. [Google Scholar] [CrossRef]
George, A.; Ecabert, C.; Shahreza, H.O.; Kotwal, K.; Marcel, S. EdgeFace: Efficient Face Recognition Model for Edge Devices. IEEE Trans. Biom. Behav. Identity Sci. 2024, 6, 158–168. [Google Scholar] [CrossRef]
Anh, T.N.; Nguyen, V.D. MAPBoost: Augmentation-resilient real-time object detection for edge deployment: Augmentation-resilient lightweight detection. J. Real. Time. Image Process. 2026, 23, 10. [Google Scholar] [CrossRef]
Hamdi, A.; Noura, H.; Azar, J.; Pujolle, G. Frugal Object Detection Models: Solutions, Challenges and Future Directions. In Proceedings of the 21st International Wireless Communications and Mobile Computing Conference, IWCMC 2025, Montreal, QC, Canada, 12–16 May 2025; Institute of Electrical and Electronics Engineers: Piscataway, NJ, USA, 2025; pp. 1694–1701. [Google Scholar] [CrossRef]
Qian, J.; Mu, S.; Lu, H.; Xu, S. Two-stage model re-optimization and application in face recognition. Neurocomputing 2025, 651, 130805. [Google Scholar] [CrossRef]
Mostafa, S.A.; Ravi, S.; Zebari, D.A.; Zebari, N.A.; Mohammed, M.A.; Nedoma, J.; Martinek, R.; Deveci, M.; Ding, W. A YOLO-based deep learning model for Real-Time face mask detection via drone surveillance in public spaces. Inf. Sci. 2024, 676, 120865. [Google Scholar] [CrossRef]
Hussain, D.; Ismail, M.; Hussain, I.; Alroobaea, R.; Hussain, S.; Ullah, S.S. Face Mask Detection Using Deep Convolutional Neural Network and MobileNetV2-Based Transfer Learning. Wirel. Commun. Mob. Comput. 2022, 2022, 1536318. [Google Scholar] [CrossRef]
Hagui, I.; Msolli, A.; Helali, A.; Fredj, H. Face Mask Detection using CNN: A Fusion of Cryptography and Blockchain. Eng. Technol. Appl. Sci. Res. 2024, 14, 17156–17161. [Google Scholar] [CrossRef]
Umer, M.; Sadiq, S.; Alhebshi, R.M.; Alsubai, S.; Al Hejaili, A.; Eshmawi, A.A.; Nappi, M.; Ashraf, I. Face mask detection using deep convolutional neural network and multi-stage image processing. Image Vis. Comput. 2023, 133, 104657. [Google Scholar] [CrossRef]
Benifa, J.V.B.; Chola, C.; Muaad, A.Y.; Bin Hayat, M.A.; Bin Heyat, B.; Mehrotra, R.; Akhtar, F.; Hussein, H.S.; Vargas, D.L.R.; Castilla, Á.K.; et al. FMDNet: An Efficient System for Face Mask Detection Based on Lightweight Model during COVID-19 Pandemic in Public Areas. Sensors 2023, 23, 6090. [Google Scholar] [CrossRef]
Bania, R.K. Ensemble of deep transfer learning models for real-time automatic detection of face mask. Multimed. Tools Appl. 2023, 82, 1. [Google Scholar] [CrossRef]
Habeeb, Z.Q.; Al-Zaydi, I. Incorrect facemask-wearing detection using image processing and deep learning. Bull. Electr. Eng. Inform. 2023, 12, 2212–2219. [Google Scholar] [CrossRef]
Kumar, A.; Kalia, A.; Kalia, A. ETL-YOLO v4: A face mask detection algorithm in era of COVID-19 pandemic. Optik 2022, 259, 169051. [Google Scholar] [CrossRef]
Hosny, K.M.; Ibrahim, N.A.; Mohamed, E.R.; Hamza, H.M. Artificial intelligence-based masked face detection: A survey. Intell. Syst. Appl. 2024, 22, 200391. [Google Scholar] [CrossRef]
Mbunge, E.; Simelane, S.; Fashoto, S.G.; Akinnuwesi, B.; Metfula, A.S. Application of deep learning and machine learning models to detect COVID-19 face masks—A review. Sustain. Oper. Comput. 2021, 2, 235–245. [Google Scholar] [CrossRef]
Mulani, A.O.; Kulkarni, T.M. Face Mask Detection System Using Deep Learning: A Comprehensive Survey. Commun. Comput. Inf. Sci. 2025, 2439, 25–33. [Google Scholar] [CrossRef]
Jayaswal, R.; Dixit, M. AI-based face mask detection system: A straightforward proposition to fight with Covid-19 situation. Multimed. Tools Appl. 2022, 82, 13241–13273. [Google Scholar] [CrossRef]
Vukicevic, A.M.; Petrovic, M.; Milosevic, P.; Peulic, A.; Jovanovic, K.; Novakovic, A. A systematic review of computer vision-based personal protective equipment compliance in industry practice: Advancements, challenges and future directions. Artif. Intell. Rev. 2024, 57, 319. [Google Scholar] [CrossRef]
Benitez-Baltazar, V.H.; Pacheco-Ramírez, J.H.; Moreno-Ruiz, J.R.; Núñez-Gurrola, C. Autonomic Face Mask Detection with Deep Learning: An IoT Application. Rev. Mex. De Ing. Biomédica 2021, 42, 160–170. [Google Scholar] [CrossRef]
Han, Z.; Huang, H.; Fan, Q.; Li, Y.; Li, Y.; Chen, X. SMD-YOLO: An efficient and lightweight detection method for mask wearing status during the COVID-19 pandemic. Comput. Methods Programs Biomed. 2022, 221, 106888. [Google Scholar] [CrossRef] [PubMed]
Biswas, A.K.; Roy, K. A comparative study on ‘face mask detection’ using machine learning and deep learning algorithms. Volume 1: AI, Classification, Wearable Devices, and Computer-Aided Diagnosis. In Artificial Intelligence in e-Health Framework; Academic Press: Cambridge, MA, USA, 2025; Volume 1, pp. 193–200. [Google Scholar] [CrossRef]
Masud, U.; Siddiqui, M.; Sadiq, M.; Masood, S. SCS-Net: An efficient and practical approach towards Face Mask Detection. Procedia Comput. Sci. 2023, 218, 1878–1887. [Google Scholar] [CrossRef]
Sahoo, M.P.; Sridevi, M.; Sridhar, R. Covid prevention based on identification of incorrect position of face-mask. Procedia Comput. Sci. 2024, 235, 1222–1234. [Google Scholar] [CrossRef]
Koklu, M.; Cinar, I.; Taspinar, Y.S. CNN-based bi-directional and directional long-short term memory network for determination of face mask. Biomed. Signal Process. Control 2022, 71, 103216. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Mao, Y.; Lv, Y.; Zhang, G.; Gui, X. Exploring Transformer for Face Mask Detection. IEEE Access 2024, 12, 118377–118388. [Google Scholar] [CrossRef]
Kuriakose, B.; Shrestha, R.; Sandnes, F.E. DeepNAVI: A deep learning based smartphone navigation assistant for people with visual impairments. Expert Syst. Appl. 2023, 212, 118720. [Google Scholar] [CrossRef]
Tomiło, P.; Oleszczuk, P.; Laskowska, A.; Wilczewska, W.; Gnapowski, E. Effect of Architecture and Inference Parameters of Artificial Neural Network Models in the Detection Task on Energy Demand. Energies 2024, 17, 5417. [Google Scholar] [CrossRef]
Lahmer, S.; Khoshsirat, A.; Rossi, M.; Zanella, A. Energy Consumption of Neural Networks on NVIDIA Edge Boards: An Empirical Model. In Proceedings of the 2022 20th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt), Torino, Italy, 19–23 September 2022. [Google Scholar] [CrossRef]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. Proc. Mach. Learn. Res. 2021, 139, 10347–10357. [Google Scholar]
d’Ascoli, S.; Touvron, H.; Leavitt, M.; Morcos, A.; Biroli, G.; Sagun, L. ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases. J. Stat. Mech. Theory Exp. 2021, 2022, 139. [Google Scholar] [CrossRef]
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. CvT: Introducing Convolutions to Vision Transformers. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 22–31. Available online: http://arxiv.org/abs/2103.15808 (accessed on 8 February 2026).
Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. In Proceedings of the ICLR 2022—10th International Conference on Learning Representations, Virtual, 25–29 April 2022; Available online: https://arxiv.org/pdf/2110.02178 (accessed on 8 February 2026).
Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Dong, X.; Yuan, L.; Liu, Z. Mobile-Former: Bridging MobileNet and Transformer. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5260–5269. Available online: http://arxiv.org/abs/2108.05895 (accessed on 8 February 2026).

Figure 1. Overview of the Framework adopted in this study.

Figure 2. Taxonomy of deep learning approaches for face mask detection systems.

Table 1. Summary of Included Face Mask Detection Studies.

Ref.	Experiment	Goal	Materials	Methods	Results	Conclusion
[3]	Face mask detection	To propose a novel face mask detection framework FMD-Yolo to monitor whether people wear masks in the right way in public, which is an effective way to block the virus transmission.	Im-Res2Net-101 feature extractor, enhanced path aggregation network En-PAN, localization loss, Matrix NMS method	Im-Res2Net-101 used for feature extraction with En-PAN for feature fusion; localization loss applied during training and Matrix NMS used at inference.	The proposed FMD-Yolo has achieved the best precision AP50 of 92.0% and 88.4% on the two datasets, and AP75 at Intersection over Union (IoU) = 0.75 has improved 5.5% and 3.9% respectively compared with the second one.	The results demonstrate the superiority of FMD-Yolo in face mask detection with both theoretical values and practical significance.
[10]	Object detection	To develop multi-class mask compliance detection on Raspberry Pi 4 using SSDLite MobileNetV3.	Raspberry Pi 4 Model B 4 Gb, Raspberry Pi 4 Model B Cam V.1, monitor, push button non-momentary switch, fan, diode 1N4001, 3 resistor 470 Ohm, transistor 2n2222	1. Trained SSDLite MobilenetV3 Small model with fine-tuning and without fine-tuning. 2. Compared the detection performance of SSDLite MobilenetV3 Small with other models. 3. Evaluated the detection, FPS, and power consumption of the models.	SSDLite MobileNetV3-Small achieved the highest FPS but showed limited accuracy for incorrect mask detection; overall accuracy was 70%.	The SSDLite MobilenetV3 Small model offers faster detection than others but is less effective than SSDLite MobilenetV2 in identifying incorrect mask usage.
[10]	Object detection model comparison	Comparing models like SSDLite MobilenetV3 Small, SSDLite MobilenetV3 Large and SSDLite MobilenetV2.	Raspberry Pi 4 Model B 4 Gb, Raspberry Pi 4 Model B Cam V.1, dataset of face images with and without masks	1. Trained the different object detection models on the face mask dataset. 2. Evaluated the detection accuracy, FPS, and power consumption of the models	The SSDLite MobilenetV2 model with fine-tuning was best. The SSDLite MobilenetV3 Small model had the highest FPS but limited detection.	The SSDLite MobilenetV2 model is the most suitable for face mask detection on Raspberry Pi 4.
[11]	Empirical study	To study model scaling and balance depth, width, and resolution for improved performance.	Convolutional Neural Networks (ConvNets)	Systematically studied scaling up ConvNets by adjusting network depth, width, and resolution.	Scaling up any dimension of network width, depth, or resolution improves accuracy, but the accuracy gain diminishes for bigger models.	Carefully balancing network width, depth, and resolution is an important but missing piece, preventing from achieving better accuracy and efficiency.
	Methodology development	To propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient.	Convolutional Neural Networks (ConvNets)	Proposed a compound scaling method that uniformly scales network width, depth, and resolution with a set of fixed scaling coefficients.	The proposed compound scaling method can achieve better accuracy and efficiency compared to conventional single-dimension scaling methods.	The compound scaling method enables scaling up a baseline ConvNet to any target resource constraints in a more principled way, while maintaining model efficiency.
	Neural architecture search and model scaling	To design a new baseline network and scale it up to obtain a family of models, called EfficientNets.	Convolutional Neural Networks (ConvNets)	Used neural architecture search to develop a new baseline network called EfficientNet-B0, and then applied the proposed compound scaling method to scale it up and obtain a family of EfficientNet models.	The scaled EfficientNet models significantly outperform other ConvNets in terms of accuracy and efficiency.	The EfficientNet models, developed using the proposed compound scaling method, achieve much better accuracy and efficiency than previous ConvNets.
[12]	Machine learning algorithm through image classification using MobileNetV2	To develop a face mask detection model that can be used by authorities to make mitigation, evaluation, prevention, and action planning against COVID-19.	1916 images of people wearing masks, 1930 images of people not wearing masks, image size of 224 × 224 pixels	Data collection and preprocessing; MobileNetV2-based model training with augmentation; evaluation using accuracy, precision, recall, and F1-score.	The built model can detect people wearing and not wearing face masks with an accuracy of 96.85%.	Supports monitoring and enforcement of face mask policies for COVID-19 mitigation.
	Application of the face mask detection model to real-world data	To apply the developed face mask detection model to images from 25 cities in Indonesia and analyze the percentage of people wearing face masks in each city.	Images from various sources (public place CCTV, shops, traffic cameras) in 25 cities in Indonesia, selected based on data availability	Apply the trained face mask detection model to the images from the 25 cities, calculate the percentage of people wearing and not wearing face masks in each city.	The percentage of people not wearing face masks ranged from 64.14% (Surabaya) to 82.76% (Jambi).	Face mask usage differs across cities, with some showing notably lower compliance. This helps authorities target interventions and allocate resources to areas with the weakest mask-wearing.
	Correlation analysis	To evaluate the validity of the face mask wearing percentage data by correlating it with the COVID-19 vigilance index.	Percentage of people wearing face masks in the 25 cities, COVID-19 vigilance index data	Conduct a bivariate correlation analysis between the percentage of people wearing face masks in the cities and the COVID-19 vigilance index.	The percentage of people wearing face masks and the COVID-19 vigilance index have a strong, negative, and significant correlation of −0.62.	The model’s mask-wearing data aligns with the COVID-19 vigilance index, showing that cities with lower mask-wearing rates require higher vigilance against transmission.
[13]	Algorithm development	To propose a novel object detector, lightweight FMD through You Only Look Once (LFMD-YOLO), which can achieve an excellent balance of precision and speed.	Cross Stage Partial bottleneck with three convolutions and Efficient Channel Attention (C3E), Max-pooling Efficient Channel Attention Pyramid Fast (MECAPF) module, custom backbone, Enhanced Bidirectional Feature Pyramid Network (E-BiFPN), detection heads, IoU	Designed C3E and MECAPF modules, proposed a custom backbone, integrated E-BiFPN for multi-scale feature fusion, and enhanced detection heads with improved IoU.	The proposed LFMD-YOLO achieves higher detection accuracy with mAPs of 68.7% and 60.1%, respectively, while having lower parameters and giga floating point operations (GFLOPs).	The proposed LFMD-YOLO can achieve an excellent balance of precision and speed for lightweight face mask detection.
[14]	Deep learning-based face mask detection	To develop a deep learning-based system for real-time face mask detection to enhance public health monitoring in environments where mask compliance is critical.	Convolutional Neural Network (CNN) built with TensorFlow and Keras, diverse input images, Google Colab, Google Drive	Utilize a CNN model to effectively classify individuals as mask-wearing or non-mask-wearing. Apply data preprocessing and augmentation techniques to improve model robustness and generalizability. Leverage cloud-based resources for efficient model training and deployment.	The system achieved high training and validation accuracy, consistent loss reduction, and strong real-time detection. It remained reliable despite minor validation fluctuations, demonstrating resilience and suitability for varied environments.	The DL-based system detects mask usage in real time. Data augmentation improves generalization, allowing reliable performance across varied scenarios and image conditions.
[15]	Face mask detection system development	To develop a rapid real-time face mask detection system (RRFMDS) for effective COVID-19 monitoring.	Single-shot multi-box detector based on ResNet-10, fine-tuned MobileNetV2, custom dataset of 14,535 images with 5000 incorrect masks, 4789 with masks, and 4746 without masks	Used single-shot multi-box detector for face detection and fine-tuned MobileNetV2 for face mask classification. Trained the system on the custom dataset.	The system can detect all three classes (incorrect masks, with mask and without mask faces) with an average accuracy of 99.15% and 97.81% on training and testing data respectively. The system takes on average 0.14201142 s to process a single frame.	The proposed RRFMDS is a lightweight and efficient approach for real-time face mask detection from video data. It outperforms existing state-of-the-art models in terms of accuracy and processing speed.

Table 2. Major Conventional CNN Architectures.

Architecture	Year	Key Innovation	Parameter Count	Strengths	Limitations	Ref.
LeNet-5	1998	Early CNN architecture (convolution + pooling)	~60 K	Simple, stable	Too shallow for modern tasks	[16]
AlexNet	2012	ReLU, dropout, GPU training	~60 M	Started modern deep learning	Heavy; not edge-friendly	[17]
VGG16/VGG19	2014	Deep stacks of 3 × 3 convolution layers	~138 M	Strong features	Extremely large & slow	[18]
Inception-v1	2015	Multi-branch convolutions	~6.8 M	Efficient, flexible	Complex structure	[19]
Inception-ResNet	2017	Residual + inception blocks	23–55 M	Very accurate	Heavy	[20]
ResNet (18–101)	2016	Skip connections	11–44 M	Deep & stable	Still heavy for edge	[21]
DenseNet121	2017	Dense connectivity	~8 M	High feature reuse	Slow inference	[22]
Xception	2017	Depthwise separable convolution	~22 M	Good efficiency	Not lightweight enough	[23]
Faster R-CNN	2015	Two-stage region detector	Backbone-dependent	Accurate	Slow without GPU	[27]
Mask R-CNN	2017	Adds segmentation branch	Backbone-dependent	Detects improper masks	Heavy for edge	[28]
RegNet	2020	Regular network design space	10–50 M	Strong accuracy	Rarely used in mask detection	[24]

Table 3. Lightweight Architectures in Face Mask Detection.

Model Type	Key Architectural Concept	Approx. Parameters/Complexity	Typical Usage in Mask Detection
MobileNetV2	Depthwise separable convolutions with inverted residual bottlenecks	~3.4 M parameters (α = 1.0)	Most widely adopted lightweight backbone; real-time mask/no-mask or 3-class classification on embedded devices.
EfficientNet-B0	Compound scaling of depth, width, and resolution	~5.3 M parameters	Used in high-accuracy systems (e.g., EfficientMask-Net); suitable for improper mask detection with slightly higher computational needs.
ShuffleNet	Grouped 1 × 1 convolution with channel shuffle	~2.3 M parameters (1.0×)	Limited adoption; tested in low-resource conditions but less consistent than MobileNet.
SqueezeNet/SqueezeMaskNet	Fire module (1 × 1 squeeze + expand) with attention extensions	~1.2 M (SqueezeNet), ~1.5 M (SqueezeMaskNet)	Designed for real-time multi-class classification; high FPS on Jetson-class edge hardware.
EfficientMask-Net	EfficientNet-B0 backbone with large-margin piecewise-linear classifier (LMPL)	~5.3 M parameters	Achieves up to 99.6% accuracy; offers detailed detection of improper mask positioning (nose/chin uncovered).
Hybrid CNN–YOLO variants (e.g., MobileNetV2 + YOLO)	Lightweight backbone with optimised detection head	Varies (<8 M total)	Used for real-time detection + localisation in surveillance and compliance monitoring; effective for streaming environments.

Table 4. Hybrid Architectures for Face Mask Detection.

Hybrid Architecture	Backbone Type	Detection/Classification Head	Key Idea	Reported Strengths	Ref.
YOLOv3-Based Hybrid Detector	CSPDarknet-style backbone	YOLOv3 detection head	Full detector tailored to mask usage	Real-time performance with strong localization	[30]
YOLOv2–ResNet50	ResNet50 (heavy backbone)	YOLOv2 one-stage detector	Combine high-level semantic features with fast one-stage detection	High accuracy in medical mask detection; good robustness	[44]
MobileNetV2 + SSD	MobileNetV2 (lightweight)	SSD one-stage detector	Lightweight backbone with efficient localizations	Real-time mask detection on edge devices	[45]
YOLOv5 + Coordinate Attention	YOLOv5 backbone	Attention-enhanced detection head	Spatial refinement + auto-labelling	Strong mean Average Precision (mAP) improvement; suitable for embedded devices	[46]
CNN Feature Extractor + SVM/ML Classifier	VGG19, ResNet, MobileNet	SVM/KNN/RF classifiers	Deep features + classical ML	Good performance on small datasets; simpler deployment	[47]
Smart-City System-Level Hybrid	CNN/YOLO backbone	IoT + Edge-tier inference pipeline	Combines DL, transfer learning, and IoT	Scalable deployment across large environments	[53]

Table 5. Evaluation Metrics Used in Face Mask Detection Studies.

Study (Ref.)	Accuracy	Precision	Recall	F1-Score	AP	mAP	ROC/AUC	Use Case/Interpretation in Mask Detection
[2] (single-stage and two-stage object detectors)	✓	✓	✓	✓				Binary classifier; strong balanced metrics on curated datasets
[3] (FMD-YOLO)	✓	✓	✓	✓	✓	✓		YOLO detection; AP/mAP used for bounding-box evaluation
[5] (DeepMaskNet)	✓	✓	✓	✓				Detection + masked-face recognition; reports full metric suite
[18] (VGG16/VGG19)	(✓ ImageNet)					✓		Backbone for early mask-classification pipelines
[21] ResNet	(✓ ImageNet)					✓		Backbone widely reused in mask detection & compliance tasks
[25] R-CNN					✓	✓		Basis for two-stage detectors adapted for mask detection
[27] Faster R-CNN					✓	✓		Used in early mask detectors assessing region-level AP/mAP
[33] MobileNetV2	✓	✓	✓	✓				Lightweight backbone for fast mask/no-mask classification
[37] (SSDMNV2)	✓	✓	✓	✓				SSD + MobileNetV2; used in real-time mask detection systems
[44] (YOLOv2–ResNet50)	✓	✓	✓	✓	✓	✓		Hybrid YOLO-based medical mask detector
[63] (Ensemble Classification Model)	✓	✓	✓	✓			✓	Ensemble ResNet50/Inception/VGG; includes ROC curve & AUC ≈ 0.99
[71] (IoT Mask Detection)	✓	✓	✓	✓			✓	IoT access-control system; explicitly reports ROC curve & AUC ≈ 0.96

Table 6. Architecture Families Used in Face Mask Detection.

Architecture Family	Representative Models	Parameter Scale	FPS on Edge Devices (Jetson/RPi/Low-Power GPU)	Memory/Deployment Characteristics	Suitability for Real-Time Mask Detection
Conventional CNN Backbones	VGG16/19, ResNet50 [18,21], DenseNet121 [22], InceptionV3 [19], classical transfer-learning approaches	High (8 M–140 M+)	Low–Moderate (<10–15 FPS without optimization)	Require GPU-class memory; heavy compute	High accuracy under controlled datasets but generally not suitable for real-time edge deployment
Two-Stage Detectors (R-CNN Family)	R-CNN [25], Fast R-CNN [26], Faster R-CNN [27], Mask R-CNN [28]	High + region proposal overhead	Low (<5–10 FPS on Jetson; often <5 FPS on RPi)	Large VRAM usage; very slow on CPUs	Excellent detection accuracy, but too slow for practical edge-device mask monitoring
Single-Stage Detectors (Heavy Backbones)	YOLOv2–ResNet50 [44], ETL-YOLOv4 [65], drone-based YOLO [58]	Moderate–High (40 M–60 M+)	Moderate (10–30 FPS on Jetson Xavier; <15 FPS on Nano/RPi)	Need GPU acceleration; moderate memory	Suitable for edge devices only with optimization; strong accuracy but mixed speed
Lightweight CNN Backbones (Classification)	MobileNetV1/V2/V3 [32,33,34], EfficientNet-B0 [11], ShuffleNet [38], SqueezeNet [39], mask-detection works [42]	Low (1 M–5 M range)	High (30–60 FPS on Jetson Nano; usable on RPi)	Very small footprint; easy to quantize and prune; CPU-friendly	Excellent for fast mask classification once faces are detected; ideal for edge and mobile deployment
Lightweight Single-Stage Detectors	SSD-MobileNetV2 (SSDMNV2) [37,45], EfficientMask-Net [41], YOLOv4-tiny/YOLOv5-s variants	Low–Moderate (2 M–10 M)	High (25–90 FPS depending on platform)	Optimized for low memory; fits into IoT/embedded systems	Best trade-off between accuracy and speed; preferred choice for real-time mask detection on edge devices
Hybrid & Attention-Enhanced Architectures	YOLOv5 + CoordAttention [46], IoT-optimized deep learning [50].	Low–Moderate (slightly higher due to attention modules)	High (25–60 FPS with optimized pipelines)	Slightly heavier than lightweight CNNs but still edge-deployable	Very promising direction: improved robustness (occlusion, clutter) while remaining efficient
Extreme Lightweight/Frugal/Deployment-Engineered Models	Frugal object detectors [56], augmentation-resilient object detectors [55]	Very Low (<1 M–3 M)	Very High (60+ FPS even on modest devices)	Minimal memory; optimized for microcontrollers, Neural Processing Units (NPUs), or minimal-GPU boards	Ideal for massive IoT, smart-city nodes, or hundreds of camera feeds with strict power limits; slight accuracy trade-off

Table 7. Comparison of deployment platforms, model performance (FPS/latency), and power/energy in FMD systems.

Ref	Device/Platform	Model/Backbone	Task	FPS/Latency	Power (W)/Energy (J)	Measured vs. Estimated
[4]	Raspberry Pi 4	MobileNetV3	Image Classification	19.2 ms latency	9 W (max)	Latency measured; power estimated
[4]	Intel NCS2 + Raspberry Pi 4	MobileNetV3	Image Classification	9.5 ms latency	2 W (max)	Latency measured; power estimated
[4]	Jetson Nano	MobileNetV3	Image Classification	5.09 ms latency	10 W (max)	Latency measured; power estimated
[4]	Jetson Xavier NX	MobileNetV3	Image Classification	1.22 ms latency	15 W (max)	Latency measured; power estimated
[4]	Raspberry Pi 4	SSDLite MobileNetV3	Object Detection	47 ms latency	9 W (max)	Latency measured; power estimated
[4]	Jetson Xavier NX	SSDLite MobileNetV3	Object Detection	2.9 ms latency	15 W (max)	Latency measured; power estimated
[10]	Raspberry Pi 4	SSDLite MobileNetV3 Small	Object Detection	8.67–9.79 FPS	7.4–8.0 W	Measured
[10]	Raspberry Pi 4	SSDLite MobileNetV3 Large	Object Detection	3.81–4.26 FPS	7.3–8.0 W	Measured
[10]	Raspberry Pi 4	SSDLite MobileNetV2	Object Detection	3.33–3.57 FPS	7.2–7.9 W	Measured
[37]	Laptop (i7-8750H + GTX1050Ti)	SSDMNV2 (SSD-ResNet10 + MobileNetV2)	Object Detection	15.71 FPS	Not reported	Measured
[43]	Jetson Orin NX	SqueezeMaskNet	Object Detection	96 FPS	Not reported	Measured
[43]	Jetson Xavier NX	SqueezeMaskNet	Object Detection	84 FPS	Not reported	Measured
[43]	Jetson Orin Nano	SqueezeMaskNet	Object Detection	74 FPS	Not reported	Measured
[43]	RTX 2080 Super GPU	SqueezeMaskNet	Object Detection	297 FPS	Not reported	Measured
[80]	RTX 3090 GPU	YOLOv8n/YOLOv9t/YOLOv10n	Object Detection	1.78–3.16 min inference time	~144 W	Power measured
[80]	Jetson Xavier NX	YOLOv8n	Object Detection	3.82 min inference	7.29 W	Measured
[81]	Jetson TX2	Conv & Fully Connected NN layers	Neural Network Inference	Not reported	Energy per inference (J)	Measured + modeled
[81]	Jetson Xavier NX	Conv & Fully Connected NN layers	Neural Network Inference	Not reported	Energy per inference (J)	Measured + modeled
Note: Devices reported in this table include Raspberry Pi 4 (Raspberry Pi Ltd., Cambridge, UK), Intel NCS2 (Intel Corporation, Santa Clara, CA, USA), and NVIDIA Jetson platforms and GPUs (NVIDIA Corporation, Santa Clara, CA, USA). Manufacturer information refers to official product developers; procurement or sourcing details were not reported in the original studies.

Table 8. Accuracy–Efficiency Trade-offs in Face Mask Detection Models.

Model	Parameters (Approx.)	Accuracy (%)	Speed/Resource Use	Notes
YOLOv4-tiny	~6 M	Lower than YOLOv4	Fast, low resource	1/10th parameters of YOLOv4 [72]
MobileNetV2	Lightweight	~92.6	Real-time, embedded devices	Robust for real-time use [33]
DenseMaskNet (DenseNet201)	Heavyweight	99	Slower, high resource	Highest accuracy in comparison [75]
Mask R-CNN	Heavyweight	Highest	Not suitable for real-time	Best accuracy, poor efficiency [28]
Custom Lightweight Net (SCS-Net)	0.12 M	~95.5	Highly efficient	Up to 496× parameter reduction [74]
Ensemble of Single-Stage and Two-Stage Detectors	-	98.2	0.05 s/image	High accuracy and speed [2]

Table 9. Architecture selection across deployment environments for face mask detection.

Deployment Environment	Typical Hardware	Recommended Architecture	Key Rationale
Ultra-low-power edge	Microcontrollers, ARM CPUs	MobileNetV2, ShuffleNet	Minimal parameter count
Edge AI devices	Jetson Nano, Raspberry Pi + NCS2	SSD-MobileNetV2, YOLO-tiny	Real-time detection
Embedded GPU platforms	Jetson Xavier NX, Orin	YOLOv5s, SqueezeMaskNet	Balanced speed and accuracy
Cloud/server systems	GPU clusters	Faster R-CNN, ConvNeXt	Maximum accuracy

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Rasheed, S. Lightweight Deep Learning Models for Face Mask Detection in Real-Time Edge Environments: A Review and Future Research Directions. Mach. Learn. Knowl. Extr. 2026, 8, 102. https://doi.org/10.3390/make8040102

AMA Style

Rasheed S. Lightweight Deep Learning Models for Face Mask Detection in Real-Time Edge Environments: A Review and Future Research Directions. Machine Learning and Knowledge Extraction. 2026; 8(4):102. https://doi.org/10.3390/make8040102

Chicago/Turabian Style

Rasheed, Saim. 2026. "Lightweight Deep Learning Models for Face Mask Detection in Real-Time Edge Environments: A Review and Future Research Directions" Machine Learning and Knowledge Extraction 8, no. 4: 102. https://doi.org/10.3390/make8040102

APA Style

Rasheed, S. (2026). Lightweight Deep Learning Models for Face Mask Detection in Real-Time Edge Environments: A Review and Future Research Directions. Machine Learning and Knowledge Extraction, 8(4), 102. https://doi.org/10.3390/make8040102

Article Menu

Lightweight Deep Learning Models for Face Mask Detection in Real-Time Edge Environments: A Review and Future Research Directions

Abstract

1. Introduction

1.1. Problem Statement and Rationale

1.2. Review Objectives

2. Methodology

2.1. Literature Search Strategy

2.2. Inclusion and Exclusion Criteria

2.3. Screening and Selection Approach

2.4. Data Extraction and Categorization

3. Architectural Landscape of Face Mask Detection Models

3.1. Conventional CNN-Based Approaches

3.2. Lightweight Convolutional Models

3.3. Hybrid Architectures

4. Comparative Performance Analysis

4.1. Evaluation Metrics

4.2. Trade-Offs Between Accuracy and Efficiency

4.2.1. Impact of Model Size and Architecture on Accuracy and Efficiency

4.2.2. Comparative Performance of Lightweight and Heavyweight Models

4.2.3. Energy Consumption Considerations for Edge Deployment

4.2.4. Strategies to Improve the Trade-Off

4.2.5. Deployment-Oriented Architecture Selection Framework

5. Future Research Directions

5.1. Improper Mask Detection and Multi-Class Analysis

5.2. Domain Adaptation and Real-World Variability

5.3. Knowledge Distillation and Model Compression

5.4. Expanding Applications Beyond Mask Detection

5.5. Standardized Evaluation Protocols and Benchmarking

5.6. Energy-Aware Evaluation and Power-Centric Benchmarking

5.7. Emerging Transformer-Based Architectures for Edge Mask Detection

6. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI