Review of Masked Face Recognition Based on Deep Learning

Saoud, Bilal; Mohamed, Abdul Hakim H. M.; Shayea, Ibraheem; El-Saleh, Ayman A.; Alashbi, Abdulaziz

doi:10.3390/technologies13070310

Open AccessReview

Review of Masked Face Recognition Based on Deep Learning

by

Bilal Saoud

^1,2,*,

Abdul Hakim H. M. Mohamed

^3,*

,

Ibraheem Shayea

⁴

,

Ayman A. El-Saleh

⁵

and

Abdulaziz Alashbi

³

¹

Department of Electrical Engineering, Faculty of Applied Sciences, University of Bouira, Bouira 10000, Algeria

²

LISEA Laboratory, Faculty of Applied Sciences, University of Bouira, Bouira 10000, Algeria

³

Department of Information Systems and Business Analytics, A’Sharqiyah University (ASU), Ibra 400, Oman

⁴

Department of Electronics & Communications Engineering, Faculty of Electrical and Electronics Engineering, Istanbul Technical University (ITU), Istanbul 34469, Turkey

⁵

Department of Electrical Engineering and Computer Science, College of Engineering, A’Sharqiyah University (ASU), Ibra 400, Oman

^*

Authors to whom correspondence should be addressed.

Technologies 2025, 13(7), 310; https://doi.org/10.3390/technologies13070310

Submission received: 24 April 2025 / Revised: 13 June 2025 / Accepted: 19 June 2025 / Published: 21 July 2025

Download

Browse Figures

Versions Notes

Abstract

With the widespread adoption of face masks due to global health crises and heightened security concerns, traditional face recognition systems have struggled to maintain accuracy, prompting significant research into masked face recognition (MFR). Although various models have been proposed, a comprehensive and systematic understanding of recent deep learning (DL)-based approaches remains limited. This paper addresses this research gap by providing an extensive review and comparative analysis of state-of-the-art MFR techniques. We focus on DL-based methods due to their superior performance in real-world scenarios, discussing key architectures, feature extraction strategies, datasets, and evaluation metrics. This paper also introduces a structured methodology for selecting and reviewing relevant works, ensuring transparency and reproducibility. As a contribution, we present a detailed taxonomy of MFR approaches, highlight current challenges, and suggest potential future research directions. This survey serves as a valuable resource for researchers and practitioners seeking to advance the field of robust facial recognition in masked conditions.

Keywords:

deep learning; CNN; face detection; object detection; dataset; evaluation

1. Introduction

Face recognition (FR) systems typically analyze major facial features like eyes, nose, and mouth on unobstructed faces. Various events and circumstances require people to wear masks that partially conceal or obscure their features [1,2]. Common scenarios encompass pandemics, laboratories, medicinal procedures, and excessive pollution. The World Health Organization (WHO) and Centers for Disease Control and Prevention (CDC) recommend wearing face masks and implementing social distancing as the most effective ways to guard against and prevent the transmission of the COVID-19 virus [3]. All countries worldwide mandate the wearing of protective face masks in public areas, prompting a necessity to study and comprehend the performance of FR systems with masked faces. Implementing these safety requirements poses a significant challenge to the current security and authentication systems that depend on facial recognition technology. Recent algorithms have focused on detecting if a face is occluded, specifically in the context of masked face detection. While preserving lives is important, there is a pressing need to verify individuals wearing masks without requiring them to remove the masks. At sites like premises access control and immigration controls, individuals present themselves to a camera, which poses a challenge for FR due to obscured facial features that are crucial for detection and identification [4].

Numerous organizations have created and implemented the required databases internally for facial recognition to authenticate or identify individuals [5]. Facial authentication, often called one-to-one matching, is the process of confirming a person’s identification by verifying if they are who they claim to be. During the process of secure authentication, a person’s facial picture is collected in order to create a biometric template. This template is then compared to an already existing facial signature [6]. Facial identification, often called one-to-many matching, is a biometric recognition method where an individual is identified by comparing their unique facial pattern with a vast database of known faces [6]. Obscured faces make it difficult to reliably identify participants, which compromises the reliability of existing datasets and renders in-house facial recognition systems unusable.

The National Institute for Standards and Technology (NIST) has reported on the effectiveness of a new set of FR algorithms that were developed and optimized after the COVID-19 pandemic [7]. The study [8] follows their initial research on algorithms established before the pandemic. They determined that most recognition algorithms assessed post-pandemic still exhibit a decrease in performance when faces are covered with masks. Moreover, the identification accuracy decreases when both the enrollment and verification images are concealed. This necessitates addressing authentication issues by implementing stronger and more dependable facial recognition technologies in various scenarios. For instance, coordinated attempts to implement crucial facial recognition technologies, such as screening individuals at immigration checkpoints, are unprotected. As a result of the coronavirus epidemic, top biometric technology companies have had to modify their algorithms to enhance the accuracy of facial recognition systems for those wearing masks [9].

Recently, deep learning (DL) systems have achieved significant advancements in both theoretical development and practical implementations [10]. Most FR systems now use DL models because to the rise of MFR as a leading research area in computer vision. Research is being conducted on how DL could enhance the effectiveness of recognition systems in the presence of masks or occlusions. The task of occluded FR (OFR) has garnered significant interest, with various DL methods proposed such as sparse representations, autoencoders, video-based object tracking, bidirectional deep networks, and dictionary learning [11,12].

The issue of occluded face images, such as those with masks, remains unresolved despite its importance in recognition systems. Several issues are currently being thoroughly examined and analyzed, including high computational costs, resilience to image alterations and occlusions, and developing distinctive representations of obscured faces [12]. Utilizing DL architectures and algorithms effectively is crucial for achieving practical face detection and identification systems. FR using obscured images will continue to be a contentious issue for an extended period, with a growing number of research studies focusing on MFR and OFR. Additional implementations will be continuously improved to monitor the real-time mobility of individuals wearing masks. In recent years, there has been a significant increase in the volume of research conducted in the field of MFR [13].

MFR and OFR have been used in several applications, including secure authentication at checkpoints and monitoring individuals wearing face masks [4,5]. Algorithms, structures, models, datasets, and technologies suggested in research for handling occluded or masked faces do not have a unified mainstream for development and evaluation. The variety of DL methods for identifying individuals wearing masks is advantageous, although there is a need to assess the effects of these technologies in this area.

We conducted this review study to provide a comprehensive resource for anyone interested in the task of MFR and OFR, considering the significant achievements and similar obstacles. This paper examines the latest advancements in FR systems that are based on DL. The primary contributions of this work are as follows:

We provide a unique taxonomy of MFR methods, organized by deep learning architecture types (CNN, GAN, Transformer), which is not explicitly presented in prior surveys.
A comprehensive comparative table is included, summarizing state-of-the-art MFR models based on architecture, dataset, performance, and special characteristics.
We offer a new visual framework of the review methodology, outlining the selection criteria, filtering stages, and thematic categorization of 167 papers from 2001 to 2024.
This review highlights underexplored areas such as masked face recognition under real-world unconstrained conditions, and it provides forward-looking research directions.
Unlike prior reviews, this study integrates both OFR and MFR methods, bridging two closely related but separately treated areas.

Following this introduction, this paper is structured as outlined below: Section 2 introduces the study scope and provides statistics on existing works. Section 3 outlines the common pipeline of MFR/OFR found in the literature. Section 4 presents methods about mask detection in MFR. Face unmasking methods will be given in Section 5. Section 6 presents the most advanced and current techniques for MFR/OFR. Section 7 will provide examples of common datasets that can be used to test MFR algorithms. Section 8 outlines the measures that are frequently employed in the literature to assess MFR/OFR algorithms. Section 9 discusses the primary obstacles and potential future paths in this area, providing valuable perspectives to stimulate future investigations. Ultimately, this study is concluded in Section 10.

2. Study Scope and Relevant Research

FR, which has been thoroughly researched in the realm of computer vision [14], is a crucial task. Facial features of humans are more effective at identifying a person’s identity than other biometric methods like iris scans and fingerprints [15,16]. Many recognition systems have utilized FR features for forensic and security purposes. FR algorithms are hindered by face disturbances such occlusions, changes in illumination, and variations in facial emotions [17]. The classic methods of FR are challenged by complex and obscured faces in the task of MFR. It is mandatory to adjust them in order to develop efficient representations for masked faces.

Research efforts in the field of MFR/OFR have significantly risen since the COVID-19 pandemic, leading to advancements in existing FR and OFR approaches and significantly improved accuracy outcomes [18]. DL methods are being used more frequently to address the difficulties of MFR/OFR. The significant advancements in research related to OFR/MFR are emphasized. Previous efforts on OFR focused on concealing general things that obscure significant facial features including scarfs, hairstyles, spectacles, and face masks.

Many comprehensive studies about FR, MFR, and OFR have been published in the past few years. These investigations have established standard algorithmic processes and emphasized numerous significant obstacles and avenues for research [1,2,10,12,15,16]. They concentrated on standard techniques and DL algorithms designed to identify faces even when partially covered. Furthermore, surveys on OFR have concentrated on specific concerns, difficulties, and technology. The efficacy of FR algorithms was assessed in [8]. They assessed algorithms that were in place before the epidemic and adjusted them to handle faces that were obscured or not fully visible. They demonstrated that these algorithms continue to function yet underperform compared to the desired level. The study [8], which is a quantitative study, focuses on evaluating the accuracy of FR algorithms. The study utilized two datasets from U.S. federal applications, including border crossing images. In the following, studies related to the field will be presented.

The authors of [19] conducted a comprehensive study of facial expression analysis algorithms focusing on faces that are partially covered. Face masks were solely utilized in this survey as instances of items that pose a challenge to the facial expression identification system. DL methods were one of six techniques assessed in the context of partial occlusion.

In 2019 [20], the authors identified three problems that impact FR systems: face occlusion, facial expressions, and dataset variances. They categorized the most advanced methods into comprehensive and component-oriented methodologies. The significance of many datasets and contests in addressing these difficulties has also been deliberated.

In 2021 [21], the challenge of recognizing faces under occlusions was addressed, acknowledging its long-standing status as a hurdle for both FR systems and human perception. Despite its significance, research on occluded FR has received less attention compared to other challenges like pose variation or different expressions. Nonetheless, effectively recognizing occluded faces is crucial for leveraging the full potential of FR in real-world applications. This study focused on providing a systematic categorization of methods tailored to occluded FR, outlining both new approaches and those rooted in established methodologies. Initially, the authors delve into the various types of occlusion problems and the inherent difficulties they present. They proceed to examine how existing FR methods tackle occlusion, classifying them into three distinct categories: occlusion-robust feature extraction approaches, occlusion-aware FR approaches and occlusion recovery-based FR approaches. Additionally, the study analyzed the motivations, innovations, advantages, and limitations of representative approaches for comparison purposes.

In 2023 [22], authors discusses the evolution of FR systems from feature-based approaches to modern DL techniques, highlighting their extensive use in various applications and the privacy concerns they raise due to mass surveillance and data accumulation. It emphasizes the importance of protecting individuals’ biometric data as a fundamental human right and presents solutions to privacy challenges, categorizing them into post-presentation-level methods controlled by system design or operators and presentation-level methods where data subjects control privacy protection using wearable devices, patches, masks, etc. The paper provides a comprehensive review, assessment, and outlook on presentation-level facial privacy protection techniques, identifying challenges and opportunities for future research.

In 2023 [12], it focuses on occluded person re-identification (Re-ID), aiming to tackle the challenge of identifying individuals across multiple cameras when they are partially occluded. With the rise of DL and the growing need for intelligent video surveillance, occluded person Re-ID has garnered significant attention from researchers due to its prevalence in real-world scenarios. This systematic study addressed the lack of comprehensive studies on occluded person Re-ID methods and categorized existing approaches based on the specific issues they tackle, such as position misalignment, scale misalignment, noisy information, and missing information. Additionally, the paper evaluated the performance of recent occluded person Re-ID methods on popular datasets and provided insights into potential future research directions in the field.

In 2023 [23], it provided an extensive examination of periocular biometrics, highlighting its significance for human recognition, particularly in scenarios such as the COVID-19 pandemic where faces are often masked. It delves into various aspects of periocular biometrics, including anatomical cues for recognition, feature extraction and matching techniques, recognition across different spectra, fusion with other biometric modalities, utilization on mobile devices, and its applicability in diverse contexts. Additionally, the paper discussed periocular datasets and competitions aimed at evaluating the effectiveness of this biometric modality. It concluded by outlining challenges and suggesting future research directions in the field of periocular biometrics.

While several prior reviews have explored face recognition or facial analysis under occlusion, most either focus broadly on occlusion types (e.g., sunglasses, hands, or hats) or do not emphasize the technical evolution specific to MFR. Unlike these studies, our review systematically examines recent MFR methods developed in response to the COVID-19 pandemic and other real-world scenarios, offering a fine-grained taxonomy that distinguishes between preprocessing-based, architecture-based, and loss-based approaches. In addition, we provide a comparative analysis of key methods, including network design choices, evaluation on standard datasets, and observed performance trends. These distinctions aim to give both new and experienced researchers a comprehensive, focused, and up-to-date overview of the MFR landscape. The flowchart in Figure 1 outlines the systematic steps followed, from initial paper collection to the final categorization based on architectural type and thematic focus.

3. MFR Framework

This section explains the development of MFR systems through a series of phases. Figure 2 illustrates these phases in MFR. The technology relies on DL models to identify distinguishing traits of obscured faces. This pipeline demonstrates many essential phases involved in creating the final recognition system, which will be detailed in the following sections.

At the beginning of the MFR framework, a set of masked images together with their associated ground-truth images are gathered. This typically involves organizing them into separate directories based on categories for training, validating, and testing models’ architecture. Following this step, other preprocessing tasks are carried out. Among the preprocessing tasks we can find data augmentation and image segmentation [24,25,26]. Next, specific facial characteristics are identified using DL models that are typically pretrained on common images and adjusted for a new dataset, such as masked faces [27,28]. The traits should be sufficiently discriminative to reliably recognize the face masks (facial covering). A face identification process is then implemented to reveal the masked face and provide an approximation of the original face [29,30]. The predicted face is compared with the original ground-truth faces to see if a specific individual is recognized or authenticated. Finally, the DL model for MFR could be evaluated based on evaluation metrics.

3.1. Preprocessing of Images

Preprocessing images plays a crucial role in enhancing the performance and robustness of FR DL models [31]. By carefully preprocessing images before feeding them into the model, various issues such as illumination variations, pose variations, facial expressions, and noise can be effectively mitigated. Techniques like normalization, resizing, and cropping ensure that the input images are standardized and aligned, facilitating easier feature extraction by the neural network [32]. Furthermore, preprocessing steps like histogram equalization or contrast adjustment can enhance the overall quality of images, making subtle facial features more discernible [33,34]. Addressing these preprocessing steps helps to improve the accuracy and reliability of FR systems, enabling them to perform well across diverse conditions and environments. Thus, proper preprocessing of images is essential for optimizing the performance of FR DL models and ensuring their practical viability in real-world applications.

The effectiveness of facial recognition systems, whether masks are worn or not, depends significantly on the characteristics of the face images utilized during training, validation, and testing. There are limited publicly accessible datasets including pairs of facial images with and without mask items used to effectively train the MFR system in a progressive way. Thus, it is necessary to enhance the test environment by incorporating more synthetic images featuring different face masks types, as well as enhancing the ability of DL models to generalize [4,5,35,36].

Methods have been developed for producing face masks over the years [37,38]. The study [39], addressed the challenge posed by the widespread use of face masks during the COVID-19 pandemic on the accuracy of facial recognition systems. It introduced a methodology to augment existing facial datasets with masked faces, enabling recognition with low false positive rates and high overall accuracy. The authors presented an open-source tool, MaskTheFace, to effectively mask faces and create a large dataset of masked faces for training facial recognition systems. They reported a significant increase in the true positive rate for the system and demonstrated similar accuracy on a custom real-world dataset after retraining the system with the augmented dataset. MaskedFace-Net was presented in [40]. It addressed the need for efficient recognition systems to detect individuals wearing face masks in regulated areas for COVID-19 prevention. It introduced three types of masked face detection datasets: Correctly Masked Face Dataset (CMFD), Incorrectly Masked Face Dataset (IMFD), and their combination for global masked face detection (MaskedFace-Net). These datasets aimed to classify faces based on whether they are masked or not and whether the masks are correctly or incorrectly worn. Additionally, the study [40] presented an image editing approach and a mask-to-face deformable model for generating realistic masked face images, contributing to the creation of a large dataset (137,016 images) available for research purposes. The proposed datasets offer a granular classification for mask-wearing analysis, which is not available in existing large masked face datasets.

In 2020 [41], authors investigated the efficacy of face de-identification algorithms, or “masking,” in preserving anonymity in video recordings of individuals, particularly in public spaces. Eight de-identification algorithms were evaluated for their ability to obscure driver identities in low-resolution videos, with humans tested on their recognition of active drivers. The results showed that most masks significantly reduced identification performance immediately after learning, with two masks maintaining effectiveness even after a delay of 7 or 28 days. The participants exhibited stringent decision criteria and low confidence in recognition, indicating the masks’ effectiveness. Additionally, a Deep Convolutional Neural Network (DCNN) was tested on identity-matching tasks, revealing insights into the information encoded in faces and highlighting the importance of evaluating masking techniques for both human and machine perception.

In the study [42], authors introduced a novel approach for image-to-image translation tasks, where the mapping between input and output images was learned without paired training data. By employing adversarial loss, the model aimed to generate images from the source domain that are indistinguishable from the target domain. To address the under-constrained nature of this mapping, the authors introduced inverse mapping and incorporated cycle consistency loss, ensuring that the reconstructed images were close to the originals. The proposed method (CYCLE-GAN) was evaluated on various tasks, including collection style transfer, object transfiguration, and season transfer, demonstrating superior performance compared to prior methods based on both qualitative and quantitative analyses.

Authors of [43] aimed to address the challenge of MFR during the COVID-19 pandemic, given the poor generalization of existing FR models in this scenario. Two main challenges were identified: the lack of large-scale training and testing data and the significant intra-class variation between masked and full faces. To tackle these challenges, the paper introduced a new dataset, MFSR, comprising over 9700 masked face images with segmentation annotations and more than 11,600 images of 1004 identities, encompassing various orientations, lighting conditions, and mask types. Additionally, a novel Identity-Aware Mask GAN (IAMGAN) was proposed to generate synthetic masked face images from full face images, and Domain Constrained Ranking (DCR) loss was introduced to address intra-class variation.

In 2018 [44], authors addressed the challenge of scalability and robustness in image-to-image translation across multiple domains by introducing StarGAN, a novel and scalable approach. Unlike existing methods, which require separate models for each pair of image domains, StarGAN utilizes a unified model architecture to handle multiple domains within a single network. This approach enables simultaneous training of multiple datasets with different domains, resulting in superior quality translated images compared to existing models. Additionally, StarGAN offers the capability to flexibly translate input images to any desired target domain.

Data augmentation plays a pivotal role in DL methods by significantly enhancing model performance and generalization capabilities [24,25,26]. This is achieved by artificially expanding the size and diversity of the training dataset through techniques such as rotation, translation, scaling, and flipping. Data augmentation helps mitigate overfitting and improves the robustness of the model to variations in input data. Moreover, it helps the model learn invariant features by presenting it with diverse instances of the same class, thereby enabling it to better capture the underlying patterns in the data. Additionally, data augmentation is particularly valuable in scenarios where labeled data is scarce or expensive to acquire, as it allows for the creation of a larger and more representative training dataset without the need for additional manual labeling efforts. Data augmentation is a fundamental component of DL methods, contributing significantly to their effectiveness in various tasks across different domains. Furthermore, image enhancement can be achieved by adjusting its sharpness, with the Laplacian variance [45] being a regularly used method.

3.2. Deep Learning Models

Several established techniques have been suggested and tested for identifying human faces using manually designed local or global characteristics, including LBP [46], SIFT [47], and Gabor [48]. Yet, these comprehensive methods struggle to preserve the unanticipated facial alterations that differ from their original assumptions [17,20]. Subsequently, shallow picture representations such as learning-based dictionary descriptors were introduced to address the uniqueness and compactness issues of earlier techniques. Despite achieving accuracy improvements, these shallow representations nevertheless exhibit low robustness in real-world applications and instability against facial appearance fluctuations.

DL has proven exceptionally effective in the field of image processing, particularly in tasks like FR, due to its ability to automatically learn hierarchical representations of data directly from raw input [49,50]. In FR, DL models can efficiently extract intricate features from facial images, capturing complex patterns and variations in facial characteristics such as shape, texture, and color. Convolutional Neural Networks (CNNs), a popular class of DL architectures, excel in processing spatial data like images by employing layers of learnable filters that automatically detect and hierarchically combine features at different levels of abstraction. This hierarchical feature learning enables DL models to achieve remarkable performance in FR tasks, even under challenging conditions such as variations in pose, lighting, expression, and occlusions. Moreover, the scalability and adaptability of DL models allow them to continuously improve with larger datasets and fine-tuning, further enhancing their accuracy and robustness in FR applications. Additionally, advancements in DL techniques, such as attention mechanisms and adversarial training, have further boosted the performance of FR systems, making them increasingly reliable and practical for real-world deployments. Overall, DL’s ability to automatically learn and extract discriminative features from images, coupled with its scalability and adaptability, makes it a highly effective and powerful tool for FR and image processing tasks [50].

In the following section, prevalent DL models utilized for MFR will be illustrated.

3.2.1. CNN

CNNs are a foundational architecture in deep learning, known for their success in image classification, object detection, and facial recognition tasks [50,51]. In the context of MFR, CNNs are especially valuable due to their ability to learn the spatial hierarchies of features that can remain robust even when parts of the face are occluded by masks.

CNNs consist of convolutional, pooling, and fully connected layers that extract features from input images through progressively abstract representations [52,53,54,55]. In MFR, these features help isolate discriminative facial cues from the upper face regions (e.g., eyes, eyebrows, forehead), which remain visible despite occlusion.

Several CNN-based architectures have been adapted for MFR tasks. For instance, ResNet [56,57,58,59] has been frequently used due to its residual connections, which help train deep models that can generalize across mask variations. Lightweight models such as MobileNet [60,61] are preferred in real-time or mobile-based MFR systems due to their low computational cost. Additionally, Xception [62] and VGGNet [58] have been utilized in combination with face segmentation or occlusion-aware training strategies to boost recognition performance under partial visibility. DenseNet [63] has also been applied in MFR due to its dense connectivity pattern, which promotes feature reuse and strengthens gradient flow, enabling efficient learning from partially occluded facial features.

Recent MFR solutions fine-tune CNNs using masked datasets or incorporate attention mechanisms to focus on unoccluded facial regions. Moreover, CNN-based models are often used as backbone feature extractors in hybrid MFR systems that integrate multi-modal inputs or downstream modules like GANs for unmasking. Table 1 outlines the key features about prevalent CNN-based models employed in the MFR field.

3.2.2. Autoencoders

Autoencoders are unsupervised neural networks designed to learn compact representations of input data by encoding it into a lower-dimensional latent space and reconstructing the original data from this representation [64,65]. In MFR, this ability is valuable for recovering obscured facial features, enhancing robustness to occlusion.

In particular, image denoising, one of the prominent applications of autoencoders, becomes relevant when face masks introduce structured “noise” or occlusion. By training autoencoders on paired masked and unmasked images, models can learn to infer and reconstruct key facial structures, improving recognition accuracy under occlusion [66,67]. Similarly, image compression capabilities support efficient feature extraction for MFR pipelines with limited computational resources or in edge devices [68].

Recent autoencoder variants have been directly applied to MFR and occluded face recognition (OFR). For example:

LSTM-autoencoders [69] capture temporal dependencies, which is useful in video-based MFR, enabling models to learn from sequences of partially occluded frames.
DC-SSDA [70], or Double Channel Stacked Denoising Autoencoders, enhance feature robustness by learning from both clean and noisy inputs; useful for handling diverse mask types and positions.
De-corrupt autoencoders [71] are tailored to restore occluded regions of the face, such as those covered by masks or hands, making them effective in recovering key facial features lost due to occlusion.
3D landmark-based VAEs [72] generate plausible 3D face structures from partial inputs, offering a path forward for reconstructing occluded geometry in MFR systems, particularly under varying head poses.

These variants highlight the adaptability of autoencoders in enhancing masked face recognition by reconstructing occluded regions, improving feature representation, and enabling recovery of spatial consistency in masked conditions.

3.2.3. Generative Adversarial Networks

Generative Adversarial Networks (GANs) [73] are a deep learning framework consisting of two competing neural networks: a generator that produces synthetic data and a discriminator that evaluates its authenticity. Through adversarial training, GANs can generate highly realistic data samples that mimic complex distributions.

In the context of masked face recognition (MFR), GANs are particularly valuable for data augmentation. Training MFR models often suffers from a lack of diverse, annotated masked face images. GANs can generate synthetic masked face samples with varied mask styles, positions, facial expressions, and lighting conditions, thereby increasing dataset diversity without manual labeling [74]. This improves generalization and reduces overfitting, especially in cross-domain or in-the-wild settings.

Beyond simple augmentation, conditional GANs can generate identity-preserving masked or unmasked face variants, enabling training of models that are robust to occlusion while maintaining discriminative identity features. Some works also use GANs for mask removal or face de-occlusion, where the generator learns to hallucinate or recover the unmasked face, aiding in downstream recognition tasks [75,76]. Furthermore, GANs facilitate the simulation of realistic operational conditions, such as extreme poses, partial occlusions, or low-light environments factors that challenge traditional MFR models [77,78]. By enriching the training data with such variations, GAN-enhanced pipelines can better adapt to real-time surveillance and cross-domain deployment [77,78].

GANs serve as a core technology in MFR for both training data enhancement and preprocessing, helping to bridge the gap between constrained datasets and unconstrained, real-world recognition requirements.

3.2.4. Deep Belief Network

Deep Belief Networks (DBNs) [79] are generative deep learning models composed of stacked Restricted Boltzmann Machines (RBMs), capable of learning hierarchical and abstract representations of complex data. Their layered architecture enables DBNs to capture structural patterns even when parts of the input are missing or occluded.

In MFR, DBNs have shown promise for learning robust facial representations despite the presence of occlusions such as surgical or cloth masks. By training on masked face datasets, DBNs can automatically extract multi-level features that correspond to both visible and partially obscured facial regions [80]. This is particularly useful for MFR, where lower facial features are often hidden and traditional shallow models may struggle to generalize.

DBNs are also resilient to variations in mask type, position, and orientation, as their hierarchical learning structure allows them to focus on stable features across training examples. In practice, DBN-extracted features can be fed into a classifier, such as a softmax layer, support vector machine or another neural network, to enable reliable identity prediction from masked face inputs [81,82]. Additionally, DBNs can be integrated with autoencoders or other architectures in hybrid models, allowing for improved denoising or reconstruction of occluded regions. These capabilities make DBNs a useful component in MFR systems, especially under conditions where labeled data is limited or input images are partially degraded.

3.3. Extraction of Features

Feature extraction is an essential stage in the FR process that focuses on obtaining a group of distinctive features capable of representing and capturing important facial characteristics including eyes, mouth, nose, and texture [4,5]. Face occlusions and masks complicate the procedure, requiring existing FR systems to be adjusted to extract reliable facial data. When it comes to MFR, feature extraction methods could be categorized as shallow and deep representation approaches [83,84].

Shallow feature extraction refers to the process of extracting low-level features from input data using simple and often handcrafted methods [84,85]. In the context of MFR, shallow feature extraction techniques aim to capture discriminative information from facial images, particularly when faces are partially obscured by masks. These techniques typically operate on the raw pixel values of the input images and extract basic visual cues, such as edges, textures, and color distributions, which can be indicative of identity even when parts of the face are occluded. One common approach to shallow feature extraction in MFR is to use traditional image processing techniques, such as histogram of oriented gradients (HOG) [86], Local Binary Patterns (LBP) [46], and Haar-like features [87]. These methods analyze the spatial distribution of pixel intensities or texture patterns within the facial region to extract distinctive features that are robust to variations caused by masks. By focusing on low-level image characteristics, shallow feature extraction methods can effectively capture facial information even in the presence of occlusions, making them suitable for mask FR tasks. Additionally, shallow feature extraction techniques can be combined with machine learning algorithms, such as support vector machines (SVMs) or k-nearest neighbors (KNN), to build classification models for mask FR. These models learn to classify faces based on the extracted shallow features, enabling them to recognize individuals even when their faces are partially obscured by masks. Moreover, shallow feature extraction methods are computationally efficient and require relatively small amounts of training data, making them suitable for applications where real-time processing and limited training samples are common [88,89,90].

While shallow feature extraction techniques may not capture as much semantic information as DL approaches, they offer simplicity, interpretability, and efficiency, making them valuable tools for MFR, particularly in scenarios with limited computational resources or data availability. Additionally, shallow feature extraction methods can complement DL architectures by providing robust and discriminative features that enhance the performance of MFR systems. Finally, shallow feature extraction plays a crucial role in addressing the challenges of recognizing faces obscured by masks and contributes to the development of effective and reliable MFR solutions.

Many methods have been proposed based on DL in order to extract features [27,91,92,93,94]. Deep graph convolutional networks (GCNs) have been used for graph image representations in masked face detection, reconstruction, and identification [95,96]. GCNs have demonstrated strong proficiency in learning and processing facial images through spatial or spectral filters designed for a common or unchanging graph framework. Yet, mastering graph representations is sometimes limited by the quantity of GCN layers and the challenging computational complexity. Researchers have also explored 3D spatial characteristics for the purpose of recognizing obscured or masked 3D faces. Three-dimensional FR algorithms simulate human facial characteristics’ true vision and understanding, potentially enhancing the effectiveness of current 2D recognition systems. Three-dimensional facial characteristics are resistant to many alterations in the face, including changes in lighting, facial expressions, and face orientation.

4. Mask Detection

In MFR, mask detection methods are employed to identify whether an individual is wearing a mask in a given facial image. These methods play a crucial role in preprocessing and enhancing the performance of FR systems, particularly in scenarios where the presence or absence of masks significantly affects the accuracy of identity recognition [97,98]. Several approaches and techniques are commonly used for mask detection in mask FR. Traditional computer vision techniques, such as image segmentation, edge detection, and template matching, can be applied to detect the presence of masks in facial images. These methods analyze visual cues such as color, texture, and shape to identify regions of the image that correspond to masks [38]. DL, particularly CNNs, has shown promise for mask detection tasks [99,100]. CNN architectures can be trained on labeled datasets containing images of individuals wearing and not wearing masks to learn discriminative features for mask detection. However, hybrid approaches combine traditional computer vision techniques with DL methods to improve mask detection accuracy. For example, a CNN may be used to detect faces in an image, followed by traditional image processing techniques to localize and classify regions corresponding to masks [101,102,103].

Pre-trained models trained on large-scale datasets, such as ImageNet, can be fine-tuned or used as feature extractors for mask detection tasks. Transfer learning techniques enable the adaptation of existing models to the specific task of mask detection, even when labeled data is limited. In addition, ensemble methods combine multiple mask detection models to improve overall performance and robustness [104]. By aggregating predictions from diverse models, ensemble methods can mitigate the weaknesses of individual models and achieve better generalization on unseen data. Real-time mask detection techniques leverage efficient algorithms and architectures to achieve fast and accurate detection of masks in live video streams or camera feeds [105]. These methods are essential for applications requiring immediate response, such as access control systems and public safety monitoring [103,106,107].

Among these methods, we can find R-CNN, which stands for Regions with CNN characteristics. This has been widely used in the field of object identification [108]. It involves using a deep ConvNet to classify object proposals. R-CNN processes occluded faces by extracting numerous facial regions using a CNN and a selective search technique, resulting in a feature vector for each region. The support vector machine (SVM) would classify the existence of an object within the proposed facial region based on the retrieved feature. Fast R-CNN [109] and Faster R-CNN [110] were developed to improve performance by modifying the R-CNN architecture. Yet, these systems have significant disadvantages, including the fact that the training process is a multi-stage pipeline, making it costly in terms of both space and time. Additionally, the R-CNN conducts a ConvNet forward pass for each object suggestion individually, without reusing computation. In [111], the authors introduced a context-attention R-CNN for detecting individuals wearing face masks. This framework aims to increase the distance between items within the same class while decreasing the distance between items from different classes by extracting unique features. In addition, the authors of [112] presented an enhanced YOLOv7 model for mask-wearing detection, addressing challenges such as identifying small targets and achieving high accuracy amid the COVID-19 pandemic. The approach involves augmenting the dataset using a GAN, integrating the Convolutional Block Attention Module (CBAM) mechanism to enhance small target detection capabilities and employing the Parametric Rectified Linear Unit (FReLU) as the activation function to improve overall performance. A simple method was proposed in [113] to identify masked faces. The technique successfully recognized faces in images or videos and determined the presence of masks, even in motion or within video footage, demonstrating excellent accuracy. An automatic face mask position recognition system is proposed in [114], leveraging a dataset of face mask images collected from 391 individuals and evaluating 6 pre-trained DL architectures. The research [115] identified challenges in mask detection, such as missed and false detections due to obscured face features and varying target scales. To address these challenges, the paper introduced MFMDet, a novel face mask detection model that employed recursive feature pyramid and modulated deformable RoIpooling techniques to enhance multi-scale feature representation and adapt to target variations effectively.

Overall, mask detection methods play a vital role in mask face recognition systems by accurately identifying the presence or absence of masks in facial images. By leveraging various approaches and techniques, these methods contribute to the development of reliable and effective mask FR solutions for a wide range of applications, including security, healthcare, and public safety.

5. Face Unmasking

In the literature, a multitude of methodologies have been proposed for object removal, a task that aligns closely with the mask detection focus of this study. These methodologies encompass a diverse range of approaches, each offering unique strengths and limitations. Learning-based techniques leverage machine learning algorithms, such as CNNs, to automatically identify and remove objects from images. These methods often require large annotated datasets for training and may exhibit high computational complexity during inference. Conversely, non-learning-based algorithms rely on handcrafted features and heuristics to detect and inpaint objects, offering simplicity and efficiency but potentially limited adaptability to diverse scenarios. Additionally, hybrid approaches that combine learning-based and non-learning-based techniques have emerged, aiming to leverage the strengths of both paradigms. Despite the abundance of methodologies, selecting the most suitable approach depends on factors such as the specific application requirements, available computational resources, and desired trade-offs between accuracy and efficiency. Hence, exploring and understanding the diverse landscape of object removal techniques is crucial for advancing mask detection research and addressing real-world challenges effectively.

The authors of [116] introduced a GAN model for learning-based methods. This model takes an input image and automatically eliminates the specified object. The study [117] presented two distinct models that aim to achieve global coherence by filling in missing areas in images through the removal of specific objects and subsequent reconstruction of the damaged sections using a GAN framework. The authors of [118] employed a coarse-to-fine GAN-based method to eliminate items from facial pictures. In [119], an Embedding Unmasking Model (EUM) for removing masks was introduced. The model utilizes a feature embedding obtained from the masked face as its input. It creates a distinct feature representation equivalent to an unmasked face embedding of the same person, possessing unique characteristics.

The study [77] presented a two-stage approach to address the removal of mask objects from facial images, involving mask object detection and image completion. Leveraging a GAN-based network with dual discriminators, the model automatically segmented the mask region and synthesized the affected region with fine details while maintaining the global coherence of the face structure. The study [120] investigated the potential vulnerabilities of FR systems equipped with mask detectors on large-scale masked faces, which posed risks of suspects evading identity detection. The authors proposed three main contributions: firstly, they analyzed the challenges of a naive Delaunay-based masking method (DM) for simulating facial mask wearing; secondly, they introduced an adversarial noise attack to DM, creating the AdvNoise-DM method effective in fooling both FR and mask detection, albeit with less natural faces; thirdly, they enhanced AdvNoise-DM with adversarial filtering to generate more natural-looking faces while remaining undetected by state-of-the-art facial mask detectors, leading to significant performance deterioration of DL-based FRS. The study [76] introduced a novel two-stage network designed to address the intricacies of unmasking faces, including concealed facial features such as mouths, noses, and chins. In the first stage to perform binary segmentation of the face mask, an autoencoder-based network was successfully employed. The GAN-based network was utilized in the second stage. However, it is worth noting that there is a paucity of linked datasets containing both masked face and unmasked face images.

Figure 3 displays an example of revealing faces and reconstructing absent facial features.

Utilizing deep characteristics for face matching in FR and MFR could be seen as a challenge in face identification or verification. For completing this task, a series of images of recognized subjects is first inputted into the system during the training and validation phase. During the testing phase, a novel subject is introduced to the system for recognition. To efficiently learn a set of deep features or descriptors, it is essential to use an appropriate loss function. Two common matching approaches used in the MFR community are one to one (1:1) and one to many (1:N). Both methods often utilize common distance metrics like Euclidean-based L2 and cosine [121]. Face verification utilizes 1:1 similarity matching to compare a test image with a ground-truth image collection to confirm if they depict the same person. On the other hand, face identification uses 1:N similarity matching to identify the identity of a masked face.

Various methods have been developed to improve the distinction of deep characteristics in order to better the accuracy and efficiency of face matching, such as metric learning and sparse representations. DL algorithms commonly utilize softmax loss-based and triplet loss-based models for matching face identities. Softmax loss-based models build a multi-class classifier with a softmax function for each identity in the training dataset. Triplet loss-based models focus on learning embeddings by matching inputs to minimize intra-class distance and maximize inter-class distance. Softmax loss-based and triplet loss-based models’ performance is negatively impacted by face mask occlusions.

6. Review of Face Recognition Techniques

Previous research on FR with obscured parts, such as face masks, has been explored extensively in the literature. Studies have investigated various approaches to tackle the challenges posed by obscured facial features, including occlusion detection, feature extraction, and robust matching algorithms. These efforts have aimed to enhance the performance of FR systems in scenarios where facial attributes are partially concealed. Furthermore, recent advancements in DL and computer vision have paved the way for more sophisticated methods capable of handling masked faces with greater accuracy and efficiency. In this context, the scientific contributions related to the task of MFR have garnered significant attention. Researchers have proposed novel techniques and algorithms tailored specifically to address the unique complexities associated with identifying individuals wearing face masks. These contributions range from dataset creation and model development to evaluation methodologies, all aimed at improving the effectiveness of MFR systems in real-world applications. By reviewing these scientific advancements, we gain insights into the evolving landscape of facial recognition technology and its adaptation to the challenges posed by the widespread use of face masks.

6.1. Face Recognition in the Presence of Occlusions

In recent years, the proliferation of FR technology has underscored the need to address its limitations when faced with obscured facial features, such as masks, sunglasses, or partial obstructions. This section critically examines various methodologies, algorithms, and techniques proposed by researchers to enhance the robustness and accuracy of facial recognition systems amidst occlusions. By synthesizing insights from diverse studies, this review aims to elucidate the current state-of-the-art approaches.

In 2020 [122], authors presented a face de-occlusion method for facial images that involves the user selecting the object to be removed. They created high-quality content without visual artifacts by combining vanilla and partial convolutions in a same network. Additionally, they addressed the issue of insufficient data by creating a comprehensive synthetic dataset of faces with occlusions, utilizing the publicly available CelebA and CelebA-HQ datasets. They determined that a model trained on a synthetic face-occluded dataset effectively eliminates non-face items and generates facial content that is structurally and perceptually realistic in difficult real images.

The authors of [123] presented an approach that is both computationally efficient and effective in implementing feature extraction, depth calculation, and 3D picture generation. The SIFT algorithm was employed to densely capture the facial features. Subsequently, the image’s depth is computed utilizing a multivariate Gaussian distribution. The shape was detected by employing the coloring technique that relies on the Lambertian reflectance rule, enabling the capturing of fine details such as dimples and wrinkles.

The authors of [124] introduced a FR technique for occluded faces using a single comprehensive deep neural network named Face Recognition with Occlusion Masks (FROM). It is utilized for training precise feature masks, identifying faulty features through deep Convolutional Neural Networks (CNNs) and subsequently rectifying them utilizing dynamically acquired masks. Moreover, the authors effectively trained the network utilizing large, obscured facial images. They analyzed numerous datasets containing obscured or concealed faces, including LFW, Megaface challenge 1, RMF2, and AR.

Pairwise self-contrastive attention-aware (PSCA) models for extracting various local features was introduced in [125]. The attention sparsity loss (ASL) suggested aims to enhance sparse responses in attention maps, reducing attention on distracted areas and emphasizing discriminative facial features. The recognition performance was assessed on various datasets, such as LFW, VGGFace2, MS-Celeb-1M, and RMFRD.

In the study [126], the authors presented a perceptual hashing technique called the one-shot frequency dominating neighborhood structure (OSF-DNS). This approach showed improvements in obstructed face verification and face classification tasks. Matching obscured faces with their unobscured counterparts is beneficial for verifying obscured faces. Moreover, employing a classifier that has been trained using unobstructed faces and perceptual hash codes as feature vectors can aid in recognizing the identity of a face that is partially concealed.

In 2022 [127], authors evaluated different approaches for both MFR and OFR, revealing conceptual similarities and suggesting future research directions. Through analysis of occluded and FR algorithms, it demonstrated the interoperability of MFR methods on OFR datasets, indicating the potential for effective deployment across both domains.

In 2022 [128], authors introduced two dynamic feature subset selection (DFSS) methods aimed at improving recognition for occluded faces by minimizing the negative impact of confusing and low-quality features. Leveraging resilient algorithms, these methods dynamically adjust feature representation through exclusion or weight reduction, leading to enhanced recognition performance. Experimental validation was conducted on the AR database and Extended Yale Face Database B.

In 2022 [129], authors introduced a model for detecting occluded or masked faces using fused convolutional graphs, leveraging deep neural architecture with spatial-based graphs capturing key facial features. Transfer learning was employed with a pre-trained deep architecture as a baseline, followed by discriminant graph convolutions based on the fusion of distance and correlation graphs.

In 2022 [130], authors aimed to tackle these challenges by developing the Optimal Face Recognition Network (OPFaceNet), specifically designed to recognize face images affected by high noise and occlusion. By extracting feature patterns sensitive to noise and employing a Convolutional Neural Network (CNN) classifier, the proposed model achieved notable improvements. Notably, the CNN model was further enhanced by optimizing hyperparameters using the Fitness Sorted Rider Optimization Algorithm (FS-ROA).

In 2022 [131], various recognition tasks were examined within two realistic scenarios, both involving faces under significant occlusion. One scenario focused on recognizing facial expressions of individuals wearing virtual reality headsets, while the other aimed to estimate age and identify gender among individuals wearing surgical masks, with half of their faces obscured. CNNs trained solely on fully visible faces exhibited notably low performance levels in these challenging settings. However, fine-tuning these networks on occluded faces proved beneficial, and additional performance enhancements were achieved through knowledge distillation from models trained on fully visible faces. Two knowledge distillation methods were explored, including one based on teacher–student training and another based on triplet loss. A novel approach for knowledge distillation based on triplet loss was introduced, showcasing generalizability across models and tasks.

In 2023 [132], the focus was on enhancing face detection and tracking technology for crime detection in surveillance systems. Recognizing the complexity involved, the research aimed to streamline the preprocessing layer and improve data quality. To address these challenges, a novel crow search-based recurrent neural scheme was developed to enhance prediction performance for occluded faces and improve classification results. Implemented using Python 3.13.2, the model was trained on the COFW dataset, leveraging crow search fitness to enhance prediction accuracy and classify individuals accurately.

In this study [11], a novel handcrafted feature descriptor, Radial Derivative Gaussian Feature (RDGF), was proposed for disguised thermal FR. The feature encoding aimed to minimize the impact of noise and perform effectively across challenging datasets. A cascaded framework was introduced, combining two modules, BoCNN, and the RDGF descriptor to estimate performance before classification. Additionally, a dynamic classifier selector was implemented to choose between handcrafted features and CNN framework at runtime, enhancing overall performance.

In 2023 [133], a novel approach to address the limitations of current FR models caused by occlusion factors like masks and glasses is introduced. The proposed Occlusion-Aware Module Network (OAM-Net) aims to enhance the accuracy of occluded FR. OAM-Net consists of two sub-networks: an occlusion-aware sub-network and a key region-aware sub-network. The occlusion-aware sub-network employs an attention module to dynamically adjust the weights of convolutional kernels, optimizing the processing of occluded face images. Meanwhile, the key region-aware sub-network integrates a Spatial Attention Residual Block (SARB) for precise identification and localization of key facial regions. Additionally, a meta learning-based strategy is implemented to further enhance the network’s generalization performance and accuracy.

In 2025 [134], authors tackles the problem of facial expression recognition (FER) under facial occlusion, which hampers model accuracy and robustness. The authors propose a Multi-Angle Feature Extraction (MAFE) method that integrates global, fine-grained, and region-specific features. MAFE comprises three main modules: multi-feature extraction using PTIR-50 and Swin Transformer, regional detail feature fusion, and consistent feature recognition. Key innovations include the use of Regional Bias Loss and Consistent Feature Loss to help the model focus on expressive and informative facial regions.

The ID-Inpainter, a face inpainting model that is governed by identity, was introduced by [135]. Through the utilization of a very accurate identification sampling technique and a network that combines the characteristics of a GAN, ID-Inpainter has reached the most advanced level of performance in the preservation of identity during the process of inpainting.

Occlusion scenarios presented a formidable obstacle to the person re-identification (ReID) task, compromising discriminative features and introducing interference. Transformer-based networks emerged as a promising solution, leveraging their ability to adaptively aggregate features across image patches. However, existing methods faced challenges in effectively extracting local features due to the diffusion of disturbing occlusion features during self-attention block processing. To address this, the study [136] proposed the Occlusion Suppression and Repairing Transformer (OSRTrans), which predicted occlusion situations before feature extraction and guided the Transformer encoder to focus on visible regions, suppressing interference from occlusion. Additionally, it introduced a novel approach to reconstruct pseudo-holistic features for more robust retrieval.

The reviewed approaches for face recognition under occlusion demonstrate a diverse array of strategies, from conventional handcrafted features to advanced deep learning and transformer-based architectures. Several models, such as FROM [124], OSF-DNS [126], and OAM-Net [133], incorporate deep learning mechanisms to directly handle occlusion through feature masking, attention modules, or occlusion-aware sub-networks. These approaches typically show improved generalization and robustness when trained with large, occluded datasets. On the other hand, methods like RDGF [11] and DFSS [128] adopt handcrafted or traditional machine learning principles, offering lower computational complexity and interpretability but often at the expense of scalability and flexibility in more complex occlusion scenarios.

Transformer-based solutions, such as OSRTrans [136], provide superior performance in learning long-range dependencies and suppressing occlusion noise, although they require more computational resources and training data. In contrast, segmentation- and mask-based approaches [134,135] focus on isolating or reconstructing unoccluded features, often enhancing accuracy but depending heavily on precise segmentation quality. Additionally, approaches such as [122,123,129] leverage synthetic datasets or fused graphs, addressing data scarcity and enhancing feature representation from diverse viewpoints. Finally, while attention and mask learning-based models dominate recent advancements with strong performance in occlusion-robust FR, hybrid strategies that combine traditional descriptors with modern deep learning (e.g., BoCNN + RDGF [11]) and cross-domain learning (e.g., knowledge distillation [131]) present promising directions for balancing efficiency and accuracy. However, the effectiveness of each approach often depends on the occlusion type, available training data, and application context.

6.2. Reviewing Methods for MFR

In this section, we delve into a comprehensive review of methods for MFR. With the widespread adoption of face masks in response to some circumstances, the need for robust facial recognition systems capable of identifying individuals even when their faces are partially obscured has become increasingly imperative. Our review examines a range of approaches proposed in the literature to address this challenge, encompassing both traditional and DL-based methods by providing insights into the strengths and limitations. This review aims to contribute to the advancement of MFR technology, facilitating its deployment across various real-world applications.

During the pandemic, the challenge of recognizing individuals wearing masks persisted worldwide. FR, integral to various applications like attendance systems and security checks, faced hurdles in identifying masked faces. To address this, the study [137] proposed a system employing Haar-cascade face detection, MobileNet, and the cosine distance method, achieving accurate identification even with masks. By leveraging multi-threading and transfer learning techniques, the system achieved high accuracy rates of 100% and 82.20%, with real-time implementation speeds of 4 FPS and 22 FPS, respectively, while successfully generating person identification numbers.

In 2022 [138], authors proposed a system leveraging deep metric learning and the FaceMaskNet-21 DL network for MFR from various sources. Achieving a testing accuracy of 88.92% with an execution time of under 10 ms, the system enabled real-time recognition in diverse settings such as CCTV footage in malls, banks, and schools. Its rapid performance facilitated applications in attendance systems and high-security areas without requiring individuals to remove their masks.

Drawing inspiration from recent advancements in amodal perception, an end-to-end de-occlusion distillation framework was introduced in [139]. The framework comprised two modules: a de-occlusion module leveraging a generative adversarial network for face completion and a distillation module transferring knowledge from a pre-trained FR model to train a student network for completed faces.

The novel task of MFR was the focus of [140], aiming to match masked faces with common faces, particularly during the global outbreak of COVID-19. Two datasets were collected for MFR: MFV for verification and MFI for identification, while a data augmentation method was introduced to generate synthetic masked face images. Additionally, a novel latent part detection (LPD) model was proposed to locate robust latent facial parts and extract discriminative features. The LPD model, trained in an end-to-end manner using original and synthetic training data, demonstrated superior generalization on both realistic and synthetic masked data, outperforming other methods significantly.

In crime scenes, the concealment of facial identity by criminals through face-masked disguise presented a significant challenge to identity recognition, rendering existing disguised FR techniques ineffective for face-masked identification. To address this issue, a MFR method based on person re-identification association was proposed in [141], transforming the MFR problem into an association uncovering problem between the masked face and full face of the same person. Leveraging the characteristics of person re-identification techniques, which do not solely rely on facial information, advantages were taken to establish associations between face-masked pedestrians and face-unveiled pedestrians. Subsequently, an effective face image quality assessment was provided to select the most identifiable faces for subsequent recognition from a variety of appearing candidate faces. Selected high-quality recognizable faces were utilized to replace masked faces for identification.

In the study [142], the recognition of faces wearing surgical masks was addressed. An end-to-end approach was presented for training FR models based on the ArcFace architecture, incorporating various modifications to the backbone and loss computation. Data augmentation techniques were utilized to generate a masked version of the original dataset, which was dynamically combined during training. Additionally, the chosen network was adapted to output the likelihood of wearing a mask without incurring additional computational costs, resulting in the creation of a new function termed Multi-Task ArcFace (MTArcFace), where the FR loss and the mask-usage loss were merged. Experimental results demonstrated that the proposed method surpassed the baseline model performance in recognizing faces with masks while achieving comparable metrics on the original dataset. Furthermore, it achieved a mean accuracy of

99.78 %

in mask-usage classification.

A method based on occlusion removal and DL-based features was proposed in [143]. Initially, the masked face region was eliminated, followed by the extraction of deep features using three pre-trained deep CNNs, namely VGG-16, AlexNet, and ResNet-50, primarily focusing on regions such as the eyes and forehead. Subsequently, the bag-of-features paradigm was applied to quantize the feature maps of the last convolutional layer, providing a concise representation compared to the fully connected layer of a classical CNN. Finally, Multilayer Perceptron (MLP) was employed for the classification process.

In 2021 [144], authors investigated MFR by developing a DL-based model capable of accurately identifying people wearing face masks. Specifically, the authors trained a ResNet-50-based architecture that demonstrates proficiency in recognizing masked faces. The findings of this study offered potential integration into existing FR programs designed for detecting faces for security verification purposes.

In 2021 [145], authors proposed a heterogeneous training method to maximize mutual information between FR across domains using semi-Siamese networks. Additionally, a 3D face reconstruction-based approach synthesized masked faces from existing NIR images. These strategies yield domain-invariant and mask-robust face representations.

In 2021 [146], a MFR algorithm based on an attention mechanism was proposed to enhance recognition rates for masked face images. Initially, the masked face image was separated using the local constrained dictionary learning method to isolate the face image portion. Subsequently, dilated convolution was employed to mitigate resolution reduction during subsampling. Finally, an attention mechanism neural network was utilized to prioritize important feature information of the face image, reducing information loss during subsampling and improving FR rates. Experimental comparisons conducted on the RMFRD and SMFRD databases from Wuhan University.

A masked-FR algorithm based on large margin cosine loss (MFCosface) was proposed in [147]. Given the scarcity of masked-face data for training, an algorithm for generating masked-face images was developed, leveraging key facial feature detection. Furthermore, an Att-inception module was designed to enhance model attention on unoccluded areas, thereby improving recognition accuracy for masked subjects.

A new method for MFR was proposed by authors of [148], integrating a cropping-based approach with the Convolutional Block Attention Module (CBAM). Optimal cropping strategies were explored for each scenario, while the CBAM module was employed to prioritize regions around the eyes. Additionally, two special application scenarios were investigated, involving training on unmasked faces to recognize masked faces and vice versa.

In 2022 [149], authors introduced a method aimed at addressing the challenges posed by face masks in FR systems, particularly in the context of the COVID-19 pandemic. Firstly, a technique for synthesizing masked faces, known as mask transfer, was proposed to augment the dataset at a low cost with high accuracy. Secondly, an attention-aware MFR model (AMaskNet) was presented, consisting of a feature extractor and a contribution estimator, which refines feature representation by learning the contribution of feature elements through matrix multiplications. Additionally, an end-to-end training strategy was employed to optimize the entire model. Finally, a mask-aware similarity matching strategy (MS) was utilized to enhance performance during the inference stage.

A novel DeepMaskNet framework was proposed in [150], enabling both face mask detection and masked facial recognition. Furthermore, to address the lack of unified datasets for evaluation, a large-scale and diverse Unified Mask Detection and Masked Facial Recognition (MDMFR) dataset was developed.

In 2022 [151], a method combining DL and Local Binary Pattern (LBP) features was proposed, leveraging RetinaFace as an efficient encoder. LBP features from masked face areas were extracted and integrated with RetinaFace features into a unified framework for recognition. Additionally, the COMASK20 dataset, comprising data from 300 subjects, was collected. Experimental comparisons with state-of-the-art methods on the Essex dataset and COMASK20 demonstrated superior performance, with an

87 %

F1 score on COMASK20 and a

98 %

F1 score on Essex.

In 2022 [152], authors aimed to address this gap in MFR by developing a masked face dataset to identify individuals who do not wear masks or wear them incorrectly. A real-time masked detection service and FR mobile application were developed based on an ensemble of fine-tuned lightweight deep CNNs. The proposed model achieved a validation accuracy of

90.40 %

using 12 individuals’ 1849 face samples.

In 2022 [153], a novel DL approach was introduced. This method is based on the convolutional block attention module (CBAM) and the angular margin ArcFace loss. This approach involved the integration of a Convolutional Block Attention Module (CBAM) with CNNs to extract feature maps, specifically focusing on the region around the eyes. In addition, ArcFace was used as a training loss function to optimize feature embedding and improve the discriminative features for MFR. To address the lack of sufficient masked face photos for model training, data augmentation techniques were utilized to create masked face images from a widely used FR dataset. The efficacy of the suggested method was assessed using established masked image versions of LFW, AgeDB-30, CFP-FP, and actual mask image MFR2 verification datasets.

The suggested method in [30] involved transfer learning using MobileNet V2-based technology, employing deep features like feature extraction and DL models to address the challenge of masked face identification. Initially, a face mask detector was applied to identify masks, followed by the proposed approach applied to datasets from various sources. The proposed model achieved a recognition accuracy of

99.82 %

with the proposed dataset. Additionally, the study utilized pre-programmed models including VGG16, VGG19, ResNet50, and ResNet101. The extraction of deep features with VGG16 yielded

99.30 %

accuracy, VGG19 achieved

99.54 %

accuracy, ResNet50 achieved

78.70 %

accuracy, and ResNet101 achieved

98.64 %

accuracy with their dataset.

A novel Progressive Learning Loss (PLFace) was proposed in [154], which implemented a progressive training strategy for deep FR to achieve balanced performance in recognizing masked and mask-free faces based on margin losses. PLFace dynamically adjusted the relative importance of masked and mask-free samples during various training stages. Initially, PLFace focused on learning feature representations of mask-free samples, with regular sample embeddings shrinking to corresponding prototypes stored in the last linear layer. Subsequently, PLFace converged on mask-free samples and progressively emphasized masked samples until their embeddings were also centralized in the class center. Throughout the training process, the paradigm emphasized the initial shrinking of normal samples followed by the gathering of masked samples.

In 2023 [155], authors addressed challenges of MFR by proposing an enhancement to Facenet. The ConvNeXt-T was employed as the backbone of the network model. This enhancement facilitated better feature extraction from the unobscured part of the face, capturing more useful information without increasing model complexity or causing dimensionality reduction. The study investigated the effects of different attention mechanisms on face mask recognition models and explored various dataset ratios to optimize experimental outcomes. Additionally, a comprehensive dataset of faces wearing masks was constructed to facilitate efficient model training. Experimental results demonstrated a high accuracy of

99.76 %

for recognizing real faces wearing masks and a combined accuracy of

99.48 %

in challenging environments characterized by extreme contrast and brightness conditions.

In 2023 [156], authors proposed a local eyebrow feature attention network for MFR, comprising feature extraction, eyebrow region pooling, and feature fusion. To accentuate the eyebrow region, eyebrow region pooling was initially utilized to isolate local eyebrow features from the overall facial features. Subsequently, the symmetry of left and right eyebrows was exploited to enhance their discriminative capability, compensating for the lack of fine information in low-resolution eyebrows. Notably, considering the symmetrical similarity between eyebrow pairs and the hierarchical relationship between facial components and the whole, a feature fusion model based on a graph convolutional network (GCN) was proposed to learn the feature association structure of eye features, brow features, and global facial features. Benchmark datasets for MFR, including the real-world MFR dataset (RMFRD) and the synthetic MFR dataset (SMFRD), were constructed to validate the proposed approach.

The authors of [157] introduced a new periocular recognition framework called Masked Mobile Lightweight Thermo-visible Face Recognition (MmLwThV). This framework utilizes thermo-visible features and an ensemble subspace network classifier to enhance the performance of existing periocular recognition systems. The framework effectively enhanced the precision compared to a single visible modality by reducing the impact of noise in the thermo-visible features. The MmLwFR framework is characterized by its low weight and may be readily implemented on mobile devices equipped with both a visible and an infrared camera.

In 2024 [158], authors introduced a novel method called MaskDUF for learning data uncertainty in MFR tasks. This method can adjust the optimization weight based on modeling uncertainty and measuring sample recognizability. As a result, it learns a sample distribution that is compact within classes, different between classes, and distant from noise. The suggested method, known as Hard Kullback–Leibler Divergence (H-KLD), serves as an adaptive variance regularizer for masked faces. It helps in learning more precise uncertainty representations and prevents overfitting noise. In addition, the concept of Mask Uncertainty Fluctuation (MUF) was introduced to quantify sample recognizability by considering both the magnitude of features and the uncertainty of their variance. This approach improves the learning preference for masked faces and results in a more condensed intra-class distribution resembling a cone shape. MaskDUF outperformed other advanced models with an average accuracy improvement ranging from

1.33 %

to

13.28 %

. Its effectiveness and strong resilience were confirmed through an ablation study, noise experiment, and parametric analysis.

Table 2 presents a comparative analysis of prominent MFR methods. The comparison highlights the core techniques, datasets used, accuracy or performance metrics, and notable remarks. This analysis provides insights into the diversity of approaches and their effectiveness in addressing the challenges of recognizing masked faces.

7. Datasets

This section provides a comprehensive overview of the datasets commonly used in the field to evaluate the performance of MFR algorithms. These datasets are essential for training and testing MFR models, offering researchers a diverse range of masked face images captured under various conditions. The datasets typically include both real-world images of individuals wearing masks in different environments and synthetic images generated with controlled parameters. By utilizing these datasets, researchers can assess the robustness, accuracy, and generalization capabilities of MFR algorithms across different scenarios and settings. Additionally, this section discusses the challenges associated with dataset collection, annotation, and distribution, as well as strategies for addressing these challenges to ensure the quality and reliability of MFR research.

The Synthetic CelebFaces Attributes (Synthetic CelebA) [77] dataset comprises 10,000 synthetic images that are publicly accessible. The CelebA dataset is a comprehensive collection of facial attributes containing over 200,000 images of celebrities. The construction involved utilizing 50 different types of synthetic masks with varying sizes, forms, colors, and constructions. The synthetic samples were created by aligning the face.

The Synthetic Face-Occluded Dataset, referenced as [122], was developed by leveraging the publicly available CelebA and CelebA-HQ datasets, as noted in [159]. The CelebA-HQ dataset contains an extensive collection of over 30,000 images showcasing the faces of celebrities. These facial images are carefully cropped and roughly aligned based on the position of the eyes. To simulate occlusions, the dataset incorporates five common non-facial objects: hands, masks, sunglasses, eyeglasses, and microphones. Each of these items is represented by more than 40 variations, spanning a wide range of sizes, shapes, colors, and designs. Moreover, these non-facial objects are randomly placed on the faces, adding complexity and diversity to the dataset.

The MFSR dataset [43] consists of two components: masked face segmentation and recognition. A total of 9742 photos of individuals wearing masks were obtained from the Internet. The photos contain manually annotated masked region segmentation annotations. The second half consists of 11,615 photos, which depict 1004 different identities. Out of these, 704 photographs were obtained from real-world sources, while the rest were sourced from the Internet. Every identity is depicted by a minimum of one image featuring both masked and unmasked faces.

The Masked Face Detection Dataset (MFDD), cited as [5], comprises a collection of 24,771 images depicting faces adorned with masks. This dataset aims to improve the precision of masked face detection models by providing ample examples of masked faces. On the other hand, the Real-World Masked Face Recognition Dataset (RMFRD), also referenced as [5], is a comprehensive dataset for Masked Face Recognition (MFR). It includes 5000 images portraying 525 individuals wearing masks, alongside a staggering 90,000 images of the same individuals without masks. This dataset stands out as one of the largest and most extensive resources available for MFR research, facilitating thorough evaluation and benchmarking of MFR algorithms.

The Masked Face Recognition Dataset (SMFRD) dataset [5] includes 500,000 images of synthetically masked faces of 10,000 individuals found online to enhance diversity.

The AgeDB dataset [160] is a real-world dataset comprising 16,488 images of 568 celebrities across a wide age range. It includes four verification protocols with age gaps of 5, 10, 20, and 30 years. While not specifically created for MFR, it is relevant for evaluating the impact of age-related appearance changes on face recognition systems. When combined with occlusion challenges such as face masks, age variation can significantly affect recognition accuracy, making AgeDB useful for assessing the robustness of MFR models in realistic scenarios.

Celebrities in Frontal-Profile (CFP) [161] contains face images of 500 celebrities captured in both frontal and extreme profile views. It offers two standard verification protocols (FF (frontal–frontal) and FP (frontal–profile)) each with 7000 comparison pairs. Although CFP was not created for MFR, its large pose variation directly addresses a challenge highlighted in Section 9: when a face is simultaneously turned sideways and partially covered by a mask, the visible facial area shrinks dramatically, making recognition harder. Evaluating models on CFP therefore helps to gauge an MFR system’s robustness to real-world pose and occlusion combinations.

A dataset was created by aligning their data with a three-dimensional Morphable Model, as described in [123]. The collection comprises three-dimensional scans of 100 females and 100 males. A total of 200 photos were created, categorized, and utilized by their model for MFR on two datasets.

The MS1MV2 dataset [162] is an improved iteration of the MS-Celeb1M dataset [163]. MS1MV2 comprises 58 million images representing 85000 different identities. A masked iteration of MS1MV2 was presented in [119]. It was referred as MS1MV2-Masked. The mask type and color were selected randomly for each image to diversify the mask color and encompass a wider range of variances in the training dataset.

The Extended Masked Face Recognition (EMFR) dataset, documented as [164], was meticulously compiled through the capture of video footage featuring 48 individuals utilizing their webcams across three distinct sessions: reference, session 2, and session 3 (probes). These sessions were conducted over the course of three separate days to ensure diversity and variability in the dataset. The baseline reference (BLR) subset encompasses a total of 480 images extracted from the initial session’s starting video. Meanwhile, the mask reference segment comprises 960 pictures drawn from the second and third videos of the reference session, depicting individuals wearing masks. In contrast, the baseline probe category includes 960 unmasked facial photos extracted from the opening videos of sessions 2 and 3. Finally, the mask probe section features a total of 1920 images obtained from the second and third videos of sessions 2 and 3, presenting individuals wearing masks. This dataset offers a comprehensive array of facial images under various conditions, serving as a valuable resource for extended masked face recognition research.

The Webface dataset, sourced from IMBb, constitutes a vast collection comprising 500,000 images representing 10,000 distinct identities, as documented in [165]. This extensive dataset offers a diverse array of facial images, providing a rich resource for training and evaluating face recognition algorithms.

The eExtend Yela B dataset [166] has 16,128 images of 28 individuals in 9 positions and under 64 different illumination conditions. It is commonly utilized in FR applications.

The Labeled Faces in the Wild (LFW) dataset [167] contains around 50,000 images. The LFW-SM variant dataset, presented in [39], expands the LFW dataset by adding simulated masks. It includes 13,233 images of 5749 individuals.

Several MFR techniques utilized the VGGFace2 dataset for training, containing 3 million photos of 9131 individuals, with an average of 362 images per person. The Masked Faces in the Wild (MFW) micro dataset [37] comprises 3000 photographs of 300 individuals sourced from the Internet. Each person is represented by five images of them wearing masks and five images of them without masks. The Masked Face Database (MFD) has 45 people and a total of 990 photos of both women and men.

VGG-Face2_m, referenced in [147], represents an enhanced version of the initial VGG-Face dataset, encompassing over 3.3 million images spanning 9131 distinct identities. Additionally, CASIA-FaceV5_m, as detailed in the same reference, is an upgraded rendition of CASIA-FaceV5, containing 2500 images depicting 500 Asian individuals, each person represented by five photographs.

Two datasets pertinent to MFR were introduced in [140]. The Masked Face Verification (MFV) dataset comprises 400 pairs for 200 identities, while the Masked Face Identification (MFI) dataset includes 4916 photos depicting 669 distinct identities.

Table 3 summarizes the key attributes of the major datasets used in MFR and general face recognition research. While general face recognition datasets (e.g., AgeDB, CFP, MS1MV2) do not contain masked faces, they are used to evaluate complementary challenges such as aging, pose variation, and large-scale identity coverage. These variations are critical in understanding the performance boundaries of MFR systems, especially under complex real-world conditions, as discussed in Section 9.

While several popular MFR datasets have advanced the field by providing large-scale image collections, many of these rely heavily on synthetically generated mask overlays. This approach often lacks the variability and complexity found in real-world occlusions such as partial coverage from scarves, hands, sunglasses, hair, or different types and positions of masks. As a result, models trained on such datasets may struggle to generalize to uncontrolled or in-the-wild scenarios. This highlights a crucial gap in dataset realism and diversity that needs to be addressed for robust MFR deployment.

Most existing MFR datasets consist of static images captured under controlled conditions. These datasets often fail to represent dynamic scenarios that occur in real-time video streams, such as slipping or readjusted masks, hand gestures covering faces, or crowd interactions involving multiple individuals. Such real-world temporal complexities are crucial for applications like surveillance and access control, where face visibility may change over time. The absence of annotated video datasets with frame-level occlusion dynamics limits the ability of current models to generalize effectively to live monitoring tasks. Incorporating sequential data with labeled temporal occlusion events is an essential direction for future dataset development.

8. Performance Evaluation Metrics

In this section, we delve into the various criteria and methodologies used to assess the effectiveness and accuracy of MFR systems. Through rigorous evaluation metrics, researchers gauge the performance of MFR algorithms under diverse conditions and datasets. This section provides an in-depth exploration of the metrics employed, shedding light on their significance in benchmarking MFR methods. From traditional metrics like accuracy, precision, and recall to more specialized measures tailored for MFR, such as mask detection rates and false acceptance rates, this section offers an overview of the evaluation metrics in MFR.

Accuracy serves as a prevalent metric for assessing recognition and classification tasks, representing the ratio of correct predictions to the total sample size. It is calculated by dividing the number of accurate forecasts by the total number of samples, providing a straightforward measure of overall performance.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(1)

The performance of DNNs in computer vision is evaluated using Rank-1, Rank-5, and Rank-N accuracy metrics. Rank-1 accuracy measures the proportion of accurately categorized labels. Rank-5 accuracy is commonly employed in multi-class classification scenarios to determine if the top five predicted labels include the correct ground truth label. Rank-N accuracy is akin to Rank-5 accuracy, but it is typically employed with larger datasets.

Precision is the proportion of true positive forecasts among all positive predictions:

P r e c e s i o n = \frac{T P}{T P + F P}

(2)

Mean Average Precision (

m A P

) is a commonly employed performance metric in the field of computer vision, specifically for tasks like object detection, categorization, and localization. The calculation entails computing the average precision across all classes and the overall intersection over union (

I o U

) thresholds.

m A P = \frac{\sum_{q = 1}^{Q} A P (q)}{Q}

(3)

where Q is the total number of queries or classes.

A P (q)

is the Average Precision (

A P

) for each individual query or class. The

A P

for a single query or class is calculated as follows:

A P (q) = \frac{\sum_{k = 1}^{N} P (k) \times r e l (k)}{t o t a l r e l e v a n t i t e m s}

(4)

where N is the total number of retrieved items.

P (k)

is the precision at cut-off k, where k ranges from 1 to N.

r e l (k)

is an indicator function equaling 1 if the item at rank k is relevant and 0 otherwise.

The

m A P

is essentially the average of the AP values across all queries or classes, providing a single metric to evaluate the overall performance of a retrieval or classification system.

The Structural Similarity Index (

S S I M

) is utilized to quantify the perceived quality of digital pictures and movies. Furthermore, it is utilized to calculate the resemblance between two images. The evaluation or prediction of image quality relies on the utilization of an uncompressed or undistorted image as a point of reference. The SSIM index is a comprehensive metric used as a reference. Below is a potential explanation:

S S I M (x, y) = \frac{(2 μ_{x} μ_{y} + c_{1}) (2 σ_{x y} + c_{2})}{(μ_{x}^{2} + μ_{y}^{2} + c_{1}) (σ_{x}^{2} + σ_{y}^{2} + c_{2})}

(5)

The symbol

μ

represents the mean value of an image, while

σ

stands for the standard deviation of the image. The variables x and y refer to the two images being compared, and

c_{1}

and

c_{2}

are constants used to provide stability when the divisor equals 0.

The peak signal-to-noise ratio (

P S N R

) is a measure of the ratio between the maximum power of a signal and the power of noise that affects its accuracy.

P S N R

is often represented logarithmically on the decibel scale since many signals have a wide dynamic range. Furthermore, it is commonly used to measure the reconstruction quality of images and videos when they undergo lossy compression. The original picture matrix and the degraded image matrix must have identical dimensions. It can be characterized as follows:

P S N R = 20 log (\frac{M A X_{f}}{\sqrt{M S E}})

(6)

The maximum signal value in the original image is denoted as

M A X_{f}

, and the mean squared error (

M S E

) is determined as follows:

M S E = \frac{1}{m n} \sum_{0}^{m - 1} \sum_{0}^{n - 1} {∥ f (i, j) - g (i, j) ∥}^{2}

(7)

f is the matrix data of the original image, g is the matrix data of the degraded image, m is the number of pixel rows, i is the row index, n is the number of pixel columns, and j is the column index.

The error rate, also referred to as the misclassification rate, stands as the complement of the accuracy metric. It measures the proportion of misclassified samples across both positive and negative classes. Imbalanced datasets can impact the accuracy metric’s reliability. The error rate is computed by subtracting the accuracy from one.

E R R = 1 - A c c u r a c y

(8)

The Equal Error Rate (

E E R

) is a biometric security technique that determines the threshold at which the rate of falsely accepting an imposter (False Acceptance Rate or

F A R

) is equal to the rate of falsely rejecting a genuine user (False Rejection Rate or

F R R

). The equal error rate is the average value when the rates are the same. The

E E R

value signifies that the rate of false acceptances is equal to the rate of false rejections. A lower

E E R

(Equal Error Rate) value signifies a greater level of precision in the biometric system. The false positive rate (

F P R

) is a statistical measure that quantifies the ratio of negative samples that are erroneously classified as positive relative to the overall number of negative samples. The false negative rate (

F N R

) is a measure that quantifies the percentage of positive samples that were inaccurately classified.

F A R = F P R = \frac{F P}{(F P + T N)}

(9)

F R R = F N R = \frac{F N}{(F N + T P)}

(10)

E R R = \frac{(F A R + F R R)}{2}

(11)

The False Discovery Rate (

F D R

) refers to the anticipated ratio of false positive classifications to all positive classifications. The total number of rejections of the null hypothesis encompasses both false positives (

F P

) and true positives (

T P

). The estimation of

F D R

can be calculated in the following manner:

F D R = \frac{F P}{(F P + T P)}

(12)

The geometric mean (

G - M e a n

) assesses the equilibrium of classification results across both the majority and minority classes. A low

G - M e a n

indicates poor performance in classifying positive cases, despite accurately classifying negative cases. This approach is essential to prevent overfitting of the negative class and underfitting of the positive class. Sensitivity (also known as True Positive Rate, or

T P R

) is a commonly used metric that measures the proportion of actual positives correctly identified by the system.

T P R (S e n s i t i v i t y) = \frac{T P}{(T P + F N)}

(13)

S p e c i f i c i t y = \frac{T N}{(T N + F N)}

(14)

G - M e a n = \sqrt{S e n s i t i v i t y \times S p e c i f i c i t y}

(15)

The false alarm rate (

F A R

), alternatively termed the false positive rate (

F P R

), delineates the proportion of negative samples erroneously identified among the entire negative sample pool. It stands as the complement of the specificity measure. Conversely, the true negative rate (

T N R

) mirrors the reciprocal of recall and signifies the percentage of accurately classified negative samples within the overall negative sample set. Both the

F A R

and

T N R

serve as pivotal measures of verification accuracy, offering insights into the system’s ability to accurately discern between positive and negative instances.

T N R = \frac{T N}{(F P + T N)}

(16)

F A R = 1 - T N R = \frac{F P}{(T N + F P)}

(17)

9. Challenges and Future Research Directions

This section outlines significant obstacles and complexities associated with the MFR of the current literature. Its purpose is to guide researchers in identifying areas that require particular attention.

While there exist datasets that are both effective and efficient for MFR, they still have certain limitations. For instance, there are datasets that contain images of faces where some are wearing masks and some are not, but there are no images of faces with masks that are not properly worn. Even if they do, the quantity is significantly limited. If a dataset contains images of faces that are either properly or improperly masked, the collection only includes examples of uncovered faces. Conversely, the dataset that includes all three elements is insufficient in terms of quantity. Certain databases exhibit regional bias, rendering the examination of impartial data ineffective. As a result, the model is unable to undergo training and appropriately classify data. Hence, a suitable dataset is required that is extensive in size, varied in orientation, and impartial towards any one location. In addition, a collection should include individuals of different ages with unique physical features and textures, along with a variety of face masks. An ideal benchmark dataset should include faces with diverse orientations, allowing for the accurate classification or detection of face masks, even when the faces are turned to the side or at an angle. Furthermore, a critical limitation is that many current datasets rely on artificially synthesized mask overlays, which do not adequately reflect real-world occlusions [112,115]. In actual scenarios, faces may be partially covered by scarves, hands, hoods, or incorrectly worn or non-standard masks, introducing irregular occlusion patterns that are rarely captured in synthetic datasets. This lack of natural occlusion complexity reduces the generalization ability of MFR models trained on such datasets. Thus, future work must focus on collecting more realistic and diverse datasets that reflect actual use cases, including spontaneous occlusions and environmental variability. Moreover, to enhance real-time MFR capabilities, datasets should move beyond static image collections and include annotated video sequences that capture the temporal evolution of occlusions, such as slipping masks, natural face movements, or interactions among individuals. Because of the novelty of COVID-19, there is a limited availability of datasets containing images of people wearing masks, which poses difficulties in effectively training and implementing face mask identification algorithms. In this scenario, the utilization of a simulated masked face dataset creation method is advantageous due to the larger quantity of face datasets compared to masked face datasets.

There are tools available for data augmentation and face masking that can be used to generate synthetic face masks. It is important to evaluate MFR algorithms using several types of genuine masks, including those with textures. It is beneficial to assess the performance of real-time MFR algorithms on real-world images gathered with realistic masks. Furthermore, MFR methods must be developed to handle scenarios when numerous faces or people are present in the same image or scene, in addition to the variations found in dataset photographs. MFR ideas and implementations mostly focus on analyzing a single face with an algorithm that is sensitive to masks [135]. More diverse datasets containing various mask kinds and participants are anticipated to be made accessible to provide a reliable evaluation of the accuracy of MFR algorithms. It might be advantageous to enhance training and benchmarking datasets by include images with different face expressions to challenge the MFR system with real-life participants [158].

A inherent issue with wearing a face mask is that it obscures a significant portion of the human face, posing a limitation for face detection in different systems and organisations. Due to the wide range of applications for face detection and recognition, the process becomes challenging when an individual wears a mask on their face. Masks are not practical for FR, as they cover the crucial facial regions [136]. Moreover, a concealed countenance undermines security, as it prevents a system from identifying or granting access to an individual. Therefore, exploring the field of recreating complete faces from partially visible regions could be a significant area of research. Amidst the COVID-19 pandemic, there is a need for additional research on matters pertaining to safety and security. Wearing face masks is obligatory for the purpose of ensuring safety, yet it may potentially undermine security. A limited number of experiments have been conducted utilizing computer vision and DL to explore masked face identification or reconstruction. Hence, the reconstruction of a masked face poses a significant and complex problem in the context of masked face detection.

MFR systems will likely prioritize the use of 3D facial reconstruction over traditional 2D approaches. Two-dimensional facial recognition is constrained by its susceptibility to changes in facial position, lighting conditions, and occlusions. In contrast, three-dimensional representations offer greater resilience by capturing depth information and facial geometry. Various methods have employed 3D modeling to reconstruct masked or occluded facial regions more accurately. Potential strategies that could be investigated for the MFR challenge include masked adaptive projection, which projects visible features into an estimated 3D space; multi-view recognition, which aggregates face information from different angles; and 3D morphable models (3DMMs), which use a parametric face model to fit and reconstruct the full face from a partially visible image. Additionally, deep learning-based inverse rendering and GAN-based 3D face synthesis have recently emerged as promising directions, enabling plausible 3D face recovery even from single masked 2D images [123]. These approaches can help MFR systems infer the occluded facial structure and improve recognition performance in challenging real-world conditions.

One of the key limitations of current MFR systems is their reduced performance in unconstrained environments. Existing models often show significant degradation under low-light conditions, extreme or large pose variations, and cross-domain settings where the training and testing data originate from different distributions (e.g., different cameras, backgrounds, or cultural contexts). This lack of robustness limits the practical deployment of MFR in real-world scenarios, such as nighttime surveillance or mobile device authentication. Current datasets and training pipelines do not adequately account for these factors [43,167]. Therefore, future research should focus on improving the adaptability of MFR systems by incorporating advanced domain adaptation techniques, low-light image enhancement, and pose-invariant feature extraction methods. Benchmarking models under these challenging conditions is essential to gauge true generalization capability.

To overcome limitations of visible-spectrum MFR—such as poor performance in low-light conditions or with occluded faces—integrating multiple modalities like infrared (IR) and thermal imaging is a promising direction. Multimodal approaches leverage complementary information from different sensors, enhancing robustness and generalization. A specific integration framework may involve early fusion (combining raw pixel data), feature-level fusion (combining features extracted from each modality), or decision-level fusion (combining final predictions from unimodal classifiers). For instance, CNN-based encoders can process each modality independently, followed by feature concatenation or attention-based fusion for joint learning. Despite its potential, real-time deployment is limited by the scarcity of aligned multimodal datasets and the complexity of sensor synchronization. Future research should aim to design unified architectures that effectively learn cross-modal representations while minimizing the need for extensive calibration [50,67].

10. Conclusions

This study has provided an extensive overview of the latest developments on MFR utilizing DL techniques. It examined the standard MFR pipeline used in recent years and highlighted the latest advancements that have enhanced the efficiency of MFR systems. These systems have addressed several essential components, including image preprocessing, feature extraction, face identification, localization, and face unmasking. Recent innovative strategies have been proposed to overcome persistent challenges in MFR, and they are expected to inspire continued research efforts.

The MFR task is anticipated to remain an active area of investigation, with ongoing projects and studies continually enriching the literature. It is evident that existing FR models adapted for MFR can suffer from degraded performance, especially under varying degrees of occlusion. Thus, exploring novel strategies that enhance the learning focus and generalization capacity of DL models remains crucial. For example, CNN-based architectures like ResNet and DenseNet are often employed for high-accuracy recognition due to their strong feature representation, whereas lightweight models like MobileNet are better suited for real-time or mobile scenarios where computational efficiency is critical. GAN-based frameworks excel in data augmentation and mask reconstruction tasks, making them ideal for forensic and surveillance applications. Transformer-based models are emerging as powerful alternatives for handling spatial relationships and long-range dependencies in masked face data, especially in complex occlusion settings.

Analyzing image feature variations and test conditions remains essential for robust generalization. Hybrid deep neural networks capable of performing multiple tasks (such as mask detection, identity classification, and face reconstruction) are key to achieving high recognition accuracy. Additionally, integrating metric learning techniques may enhance identity verification and further improve recognition robustness.

Author Contributions

Methodology, B.S. and I.S.; formal analysis, A.H.H.M.M., A.A.E.-S. and A.A.; investigation, B.S. and I.S.; writing—original draft preparation, B.S. and I.S.; writing—review and editing, A.H.H.M.M., A.A.E.-S. and A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by A’Sharqiyah University in the Sultanate of Oman under grant number (BFP/RGP/ICT/22/490).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhao, W.; Chellappa, R.; Phillips, P.J.; Rosenfeld, A. Face recognition: A literature survey. ACM Comput. Surv. 2003, 35, 399–458. [Google Scholar] [CrossRef]
Kortli, Y.; Jridi, M.; Al Falou, A.; Atri, M. Face recognition systems: A survey. Sensors 2020, 20, 342. [Google Scholar] [CrossRef] [PubMed]
World Health Organization. Infection Prevention and Control During Health Care When Coronavirus Disease (COVID-19) Is Suspected or Confirmed: Interim Guidance, 12 July 2021; WHO/2019-nCoV/IPC/2021.1; World Health Organization: Geneva, Switzerland, 2021. [Google Scholar]
Hsu, G.S.J.; Wu, H.Y.; Tsai, C.H.; Yanushkevich, S.; Gavrilova, M.L. Masked face recognition from synthesis to reality. IEEE Access 2022, 10, 37938–37952. [Google Scholar] [CrossRef]
Wang, Z.; Huang, B.; Wang, G.; Yi, P.; Jiang, K. Masked face recognition dataset and application. IEEE Trans. Biom. Behav. Identity Sci. 2023, 5, 298–304. [Google Scholar] [CrossRef]
Kaur, P.; Krishan, K.; Sharma, S.K.; Kanchan, T. Facial-recognition algorithms: A literature review. Med. Sci. Law 2020, 60, 131–139. [Google Scholar] [CrossRef]
Utegen, D.; Rakhmetov, B.Z. Facial recognition technology and ensuring security of biometric data: Comparative analysis of legal regulation models. J. Digit. Technol. Law 2023, 1, 825–844. [Google Scholar] [CrossRef]
Ngan, M.L.; Grother, P.J.; Hanaoka, K.K. Ongoing Face Recognition Vendor Test (FRVT) Part 6B: Face Recognition Accuracy with Face Masks Using Post-COVID-19 Algorithms; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2020. [Google Scholar]
Talahua, J.S.; Buele, J.; Calvopiña, P.; Varela-Aldás, J. Facial recognition system for people with and without face mask in times of the COVID-19 pandemic. Sustainability 2021, 13, 6900. [Google Scholar] [CrossRef]
Liu, F.; Chen, D.; Wang, F.; Li, Z.; Xu, F. Deep learning based single sample face recognition: A survey. Artif. Intell. Rev. 2023, 56, 2723–2748. [Google Scholar] [CrossRef]
Kumar, S.; Singh, S.K.; Peer, P. Occluded thermal face recognition using BoCNN and radial derivative Gaussian feature descriptor. Image Vis. Comput. 2023, 132, 104646. [Google Scholar] [CrossRef]
Peng, Y.; Wu, J.; Xu, B.; Cao, C.; Liu, X.; Sun, Z.; He, Z. Deep learning based occluded person re-identification: A survey. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 20, 1–27. [Google Scholar] [CrossRef]
Fang, Z.; Lei, Y.; Yuan, G. Research advanced in occluded face recognition. In Proceedings of the Fifth International Conference on Computer Information Science and Artificial Intelligence (CISAI 2022), Chongqing, China, 16–18 September 2022; Volume 12566, pp. 624–633. [Google Scholar]
Shree, M.; Dev, A.; Mohapatra, A.K. Review on facial recognition system: Past, present, and future. In Proceedings of the International Conference on Data Science and Applications: ICDSA 2022, Volume 1; Springer Nature: Singapore, 2023; pp. 807–829. [Google Scholar]
Makrushin, A.; Uhl, A.; Dittmann, J. A survey on synthetic biometrics: Fingerprint, face, iris and vascular patterns. IEEE Access 2023, 11, 33887–33899. [Google Scholar] [CrossRef]
Abdulrahman, S.A.; Alhayani, B. A comprehensive survey on the biometric systems based on physiological and behavioural characteristics. Mater. Today Proc. 2023, 80, 2642–2646. [Google Scholar] [CrossRef]
Ramasundaram, B.A.; Gurusamy, R.; Jayakumar, D. Facial recognition technologies in human resources: Uses and challenges. J. Inf. Technol. Teachnol. Cases 2023, 13, 165–169. [Google Scholar] [CrossRef]
Al-Nabulsi, J.; Turab, N.; Owida, H.A.; Al-Naami, B.; De Fazio, R.; Visconti, P. IoT solutions and AI-based frameworks for masked-face and face recognition to fight the COVID-19 pandemic. Sensors 2023, 23, 7193. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Verma, B.; Tjondronegoro, D.; Chandran, V. Facial expression analysis under partial occlusion: A survey. ACM Comput. Surv. 2018, 51, 1–49. [Google Scholar] [CrossRef]
Lahasan, B.; Lutfi, S.L.; San-Segundo, R. A survey on techniques to handle face recognition challenges: Occlusion, single sample per subject and expression. Artif. Intell. Rev. 2019, 52, 949–979. [Google Scholar] [CrossRef]
Zeng, D.; Veldhuis, R.; Spreeuwers, L. A survey of face recognition techniques under occlusion. IET Biom. 2021, 10, 581–606. [Google Scholar] [CrossRef]
Hasan, M.R.; Guest, R.; Deravi, F. Presentation-level privacy protection techniques for automated face recognition—A survey. ACM Comput. Surv. 2023, 55, 1–27. [Google Scholar] [CrossRef]
Sharma, R.; Ross, A. Periocular biometrics and its relevance to partially masked faces: A survey. Comput. Vis. Image Underst. 2023, 226, 103583. [Google Scholar] [CrossRef]
Duong, H.T.; Nguyen-Thi, T.A. A review: Preprocessing techniques and data augmentation for sentiment analysis. Comput. Soc. Netw. 2021, 8, 1. [Google Scholar] [CrossRef]
Maharana, K.; Mondal, S.; Nemade, B. A review: Data pre-processing and data augmentation techniques. Glob. Transit. Proc. 2022, 3, 91–99. [Google Scholar] [CrossRef]
Liu, X.; Zou, Y.; Kuang, H.; Ma, X. Face image age estimation based on data augmentation and lightweight convolutional neural network. Symmetry 2020, 12, 146. [Google Scholar] [CrossRef]
Charoqdouz, E.; Hassanpour, H. Feature extraction from several angular faces using a deep learning based fusion technique for face recognition. Int. J. Eng. Trans. B Appl. 2023, 36, 1548–1555. [Google Scholar] [CrossRef]
Riaz, Z.; Mayer, C.; Beetz, M.; Radig, B. Model based analysis of face images for facial feature extraction. In Proceedings of the Computer Analysis of Images and Patterns: 13th International Conference, CAIP 2009, Münster, Germany, 2–4 September 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 99–106. [Google Scholar]
Feihong, L.; Hang, C.; Kang, L.; Qiliang, D.; Jian, Z.; Kaipeng, Z.; Hong, H. Toward high-quality face-mask occluded restoration. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–23. [Google Scholar] [CrossRef]
Shukla, R.K.; Tiwari, A.K. Masked face recognition using MobileNet V2 with transfer learning. Comput. Syst. Sci. Eng. 2023, 45, 293–309. [Google Scholar] [CrossRef]
Fan, Y.; Guo, C.; Han, Y.; Qiao, W.; Xu, P.; Kuai, Y. Deep-learning-based image preprocessing for particle image velocimetry. Appl. Ocean Res. 2023, 130, 103406. [Google Scholar] [CrossRef]
Zhang, Y.; Safdar, M.; Xie, J.; Li, J.; Sage, M.; Zhao, Y.F. A systematic review on data of additive manufacturing for machine learning applications: The data quality, type, preprocessing, and management. J. Intell. Manuf. 2023, 34, 3305–3340. [Google Scholar] [CrossRef]
Hayashi, T.; Cimr, D.; Fujita, H.; Cimler, R. Image entropy equalization: A novel preprocessing technique for image recognition tasks. Inf. Sci. 2023, 647, 119539. [Google Scholar] [CrossRef]
Murcia-Gomez, D.; Rojas-Valenzuela, I.; Valenzuela, O. Impact of Image Preprocessing Methods and Deep Learning Models for Classifying Histopathological Breast Cancer Images. Appl. Sci. 2022, 12, 11375. [Google Scholar] [CrossRef]
Nazarbakhsh, B.; Manaf, A.A. Image pre-processing techniques for enhancing the performance of real-time face recognition system using PCA. In Bio-Inspiring Cyber Security and Cloud Services: Trends and Innovations; Springer: Berlin/Heidelberg, Germany, 2014; pp. 383–422. [Google Scholar]
Abbas, A.; Khalil, M.I.; Abdel-Hay, S.; Fahmy, H.M. Expression and illumination invariant preprocessing technique for face recognition. In Proceedings of the 2008 International Conference on Computer Engineering & Systems, Cairo, Egypt, 25–27 November 2008; pp. 59–64. [Google Scholar]
Mohammed Ali, F.A.; Al-Tamimi, M.S. Face mask detection methods and techniques: A review. Int. J. Nonlinear Anal. Appl. 2022, 13, 3811–3823. [Google Scholar]
Nowrin, A.; Afroz, S.; Rahman, M.S.; Mahmud, I.; Cho, Y.Z. Comprehensive review on facemask detection techniques in the context of COVID-19. IEEE Access 2021, 9, 106839–106864. [Google Scholar] [CrossRef]
Anwar, A.; Raychowdhury, A. Masked face recognition for secure authentication. arXiv 2020, arXiv:2008.11104. [Google Scholar]
Cabani, A.; Hammoudi, K.; Benhabiles, H.; Melkemi, M. MaskedFace-Net–A dataset of correctly/incorrectly masked face images in the context of COVID-19. Smart Health 2021, 19, 100144. [Google Scholar] [CrossRef] [PubMed]
Hooge, K.D.O.; Baragchizadeh, A.; Karnowski, T.P.; Bolme, D.S.; Ferrell, R.; Jesudasen, P.R.; O’toole, A.J. Evaluating automated face identity-masking methods with human perception and a deep convolutional neural network. ACM Trans. Appl. Percept. 2020, 18, 1–20. [Google Scholar] [CrossRef]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Geng, M.; Peng, P.; Huang, Y.; Tian, Y. Masked face recognition with generative data augmentation and domain constrained ranking. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2246–2254. [Google Scholar]
Choi, Y.; Choi, M.; Kim, M.; Ha, J.W.; Kim, S.; Choo, J. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8789–8797. [Google Scholar]
Hong, J.H.; Kim, H.; Kim, M.; Nam, G.P.; Cho, J.; Ko, H.S.; Kim, I.J. A 3D model-based approach for fitting masks to faces in the wild. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 235–239. [Google Scholar]
Ahonen, T.; Hadid, A.; Pietikainen, M. Face description with local binary patterns: Application to face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 2037–2041. [Google Scholar] [CrossRef] [PubMed]
Geng, C.; Jiang, X. Face recognition using SIFT features. In Proceedings of the 2009 16th IEEE International Conference on Image Processing (ICIP), Cairo, Egypt, 7–10 November 2009; pp. 3313–3316. [Google Scholar]
Liu, C.; Wechsler, H. Gabor feature based classification using the enhanced Fisher linear discriminant model for face recognition. IEEE Trans. Image Process. 2002, 11, 467–476. [Google Scholar]
Yi, Z. Researches advanced in image recognition based on deep learning. Highlights Sci. Eng. Technol. 2023, 39, 1309–1316. [Google Scholar] [CrossRef]
Sharifani, K.; Amini, M. Machine Learning and Deep Learning: A Review of Methods and Applications. World Inf. Technol. Eng. J. 2023, 10, 3897–3904. [Google Scholar]
Krichen, M. Convolutional neural networks: A survey. Computers 2023, 12, 151. [Google Scholar] [CrossRef]
Murphy, J. An Overview of Convolutional Neural Network Architectures for Deep Learning; Microway Inc.: Plymouth, MA, USA, 2016; pp. 1–22. [Google Scholar]
Shah, A.; Shah, M.; Pandya, A.; Sushra, R.; Sushra, R.; Mehta, M.; Patel, K. A comprehensive study on skin cancer detection using artificial neural network (ANN) and convolutional neural network (CNN). Clin. eHealth 2023, 6, 76–84. [Google Scholar] [CrossRef]
Salehi, A.W.; Khan, S.; Gupta, G.; Alabduallah, B.I.; Almjally, A.; Alsolai, H.; Mellit, A. A Study of CNN and Transfer Learning in Medical Imaging: Advantages, Challenges, Future Scope. Sustainability 2023, 15, 5930. [Google Scholar] [CrossRef]
Indira, D.N.V.S.L.S.; Goddu, J.; Indraja, B.; Challa, V.M.L.; Manasa, B. A review on fruit recognition and feature evaluation using CNN. Mater. Today Proc. 2023, 80, 3438–3443. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Fei-Fei, L. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Pinaya, W.H.L.; Vieira, S.; Garcia-Dias, R.; Mechelli, A. Autoencoders. In Machine Learning; Academic Press: London, UK, 2020; pp. 193–208. [Google Scholar]
Sewak, M.; Sahay, S.K.; Rathore, H. An overview of deep learning architecture of deep neural networks and autoencoders. J. Comput. Theor. Nanosci. 2020, 17, 182–188. [Google Scholar] [CrossRef]
Bajaj, K.; Singh, D.K.; Ansari, M.A. Autoencoders based deep learner for image denoising. Procedia Comput. Sci. 2020, 171, 1535–1541. [Google Scholar] [CrossRef]
Chen, S.; Guo, W. Auto-Encoders in Deep Learning-A Review with New Perspectives. Mathematics 2023, 11, 1777. [Google Scholar] [CrossRef]
Sebai, D.; Shah, A.U. Semantic-oriented learning-based image compression by Only-Train-Once quantized autoencoders. Signal Image Video Process. 2023, 17, 285–293. [Google Scholar] [CrossRef]
Zhao, F.; Feng, J.; Zhao, J.; Yang, W.; Yan, S. Robust LSTM-autoencoders for face de-occlusion in the wild. IEEE Trans. Image Process. 2017, 27, 778–790. [Google Scholar] [CrossRef]
Cheng, L.; Wang, J.; Gong, Y.; Hou, Q. Robust deep auto-encoder for occluded face recognition. In Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane, Australia, 26–30 October 2015; pp. 1099–1102. [Google Scholar]
Zhang, J.; Kan, M.; Shan, S.; Chen, X. Occlusion-free face alignment: Deep regression networks coupled with de-corrupt autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3428–3437. [Google Scholar]
Sharma, S.; Kumar, V. 3D landmark-based face restoration for recognition using variational autoencoder and triplet loss. IET Biom. 2021, 10, 87–98. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-attention generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 7354–7363. [Google Scholar]
Chen, C.; Kurnosov, I.; Ma, G.; Weichen, Y.; Ablameyko, S. Masked Face Recognition Using Generative Adversarial Networks by Restoring the Face Closed Part. Opt. Mem. Neural Netw. 2023, 32, 1–13. [Google Scholar] [CrossRef]
Mahmoud, M.; Kang, H.S. Ganmasker: A two-stage generative adversarial network for high-quality face mask removal. Sensors 2023, 23, 7094. [Google Scholar] [CrossRef] [PubMed]
Din, N.U.; Javed, K.; Bae, S.; Yi, J. A novel GAN-based network for unmasking of masked face. IEEE Access 2020, 8, 44276–44287. [Google Scholar] [CrossRef]
Li, X.; Shao, C.; Zhou, Y.; Huang, L. Face mask removal based on generative adversarial network and texture network. In Proceedings of the 2021 4th International Conference on Robotics, Control and Automation Engineering (RCAE), Wuhan, China, 4–6 November 2021; pp. 86–89. [Google Scholar]
Hua, Y.; Guo, J.; Zhao, H. Deep belief networks and deep learning. In Proceedings of the 2015 International Conference on Intelligent Computing and Internet of Things, Harbin, China, 17–18 January 2015; pp. 1–4. [Google Scholar]
Zhang, N.; Ding, S.; Zhang, J.; Xue, Y. An overview on restricted Boltzmann machines. Neurocomputing 2018, 275, 1186–1199. [Google Scholar] [CrossRef]
Chu, J.L.; Krzyźak, A. The recognition of partially occluded objects with support vector machines, convolutional neural networks and deep belief networks. J. Artif. Intell. Soft Comput. Res. 2014, 4, 5–19. [Google Scholar] [CrossRef]
Naskath, J.; Sivakamasundari, G.; Begum, A.A.S. A study on different deep learning algorithms used in deep neural nets: MLP, SOM and DBN. Wirel. Pers. Commun. 2023, 128, 2913–2936. [Google Scholar] [CrossRef]
Kong, X.; Zhang, X. Understanding masked image modeling via learning occlusion invariant feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6241–6251. [Google Scholar]
Alzu’bi, A.; Albalas, F.; Al-Hadhrami, T.; Younis, L.B.; Bashayreh, A. Masked face recognition using deep learning: A review. Electronics 2021, 10, 2666. [Google Scholar] [CrossRef]
Rasti, B.; Hong, D.; Hang, R.; Ghamisi, P.; Kang, X.; Chanussot, J.; Benediktsson, J.A. Feature extraction for hyperspectral imagery: The evolution from shallow to deep: Overview and toolbox. IEEE Geosci. Remote Sens. Mag. 2020, 8, 60–88. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Mita, T.; Kaneko, T.; Hori, O. Joint haar-like features for face detection. In Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV’05), Washington, DC, USA, 17–20 October 2005; Volume 2, pp. 1619–1626. [Google Scholar]
Babu, K.N.; Manne, S. An Automatic Student Attendance Monitoring System Using an Integrated HAAR Cascade with CNN for Face Recognition with Mask. Trait. Signal 2023, 40, 743. [Google Scholar] [CrossRef]
Oztel, I.; Yolcu Oztel, G.; Akgun, D. A hybrid LBP-DCNN based feature extraction method in YOLO: An application for masked face and social distance detection. Multimed. Tools Appl. 2023, 82, 1565–1583. [Google Scholar] [CrossRef] [PubMed]
Chong, W.J.L.; Chong, S.C.; Ong, T.S. Masked Face Recognition Using Histogram-Based Recurrent Neural Network. J. Imaging 2023, 9, 38. [Google Scholar] [CrossRef] [PubMed]
Şengür, A.; Akhtar, Z.; Akbulut, Y.; Ekici, S.; Budak, Ü. Deep feature extraction for face liveness detection. In Proceedings of the 2018 International Conference on Artificial Intelligence and Data Processing (IDAP), Malatya, Turkey, 28–30 September 2018; pp. 1–4. [Google Scholar]
Lin, C.H.; Wang, Z.H.; Jong, G.J. A de-identification face recognition using extracted thermal features based on deep learning. IEEE Sens. J. 2020, 20, 9510–9517. [Google Scholar] [CrossRef]
Li, X.; Niu, H. Feature extraction based on deep-convolutional neural network for face recognition. Concurr. Comput. Pract. Exp. 2020, 32, 1. [Google Scholar] [CrossRef]
Wang, H.; Hu, J.; Deng, W. Face feature extraction: A complete review. IEEE Access 2017, 6, 6001–6039. [Google Scholar] [CrossRef]
Zhang, S.; Tong, H.; Xu, J.; Maciejewski, R. Graph convolutional networks: A comprehensive review. Comput. Soc. Netw. 2019, 6, 11. [Google Scholar] [CrossRef]
Jin, X.; Lai, Z.; Jin, Z. Learning dynamic relationships for facial expression recognition based on graph convolutional network. IEEE Trans. Image Process. 2021, 30, 7143–7155. [Google Scholar] [CrossRef]
Balaji, S.; Balamurugan, B.; Kumar, T.A.; Rajmohan, R.; Kumar, P.P. A brief survey on AI-based face mask detection system for public places. Ir. Interdiscip. J. Sci. Res. 2021, 5, 108–117. [Google Scholar]
Chu, P.; Li, Z.; Lammers, K.; Lu, R.; Liu, X. Deep learning-based apple detection using a suppression mask R-CNN. Pattern Recognit. Lett. 2021, 147, 206–211. [Google Scholar] [CrossRef]
Qin, B.; Li, D. Identifying facemask-wearing condition using image super-resolution with classification network to prevent COVID-19. Sensors 2020, 20, 5236. [Google Scholar] [CrossRef]
Tomás, J.; Rego, A.; Viciano-Tudela, S.; Lloret, J. Incorrect facemask-wearing detection using convolutional neural networks with transfer learning. Healthcare 2021, 9, 1050. [Google Scholar] [CrossRef] [PubMed]
Ristea, N.C.; Ionescu, R.T. Are you wearing a mask? Improving mask detection from speech using augmentation by cycle-consistent GANs. arXiv 2020, arXiv:2006.10147. [Google Scholar]
Loey, M.; Manogaran, G.; Taha, M.H.N.; Khalifa, N.E.M. A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the COVID-19 pandemic. Measurement 2021, 167, 108288. [Google Scholar] [CrossRef]
Suganthalakshmi, R.; Hafeeza, A.; Abinaya, P.; Devi, A.G. COVID-19 facemask detection with deep learning and computer vision. Int. J. Eng. Res. Technol. 2021, 9, 73–75. [Google Scholar]
Hussain, A.; Hosseinimanesh, G.; Naeimabadi, S.; Al Kayed, N.; Alam, R. WearMask in COVID-19: Identification of Wearing Facemask Based on Using CNN Model and Pre-trained CNN Models. In Intelligent Systems and Applications: Proceedings of the 2021 Intelligent Systems Conference (IntelliSys) Volume 3; Springer International Publishing: Berlin/Heidelberg, Germany, 2022; pp. 588–601. [Google Scholar]
Inamdar, M.; Mehendale, N. Real-time face mask identification using facemasknet deep learning network. SSRN 2020, 3663305. [Google Scholar] [CrossRef]
Farman, H.; Khan, T.; Khan, Z.; Habib, S.; Islam, M.; Ammar, A. Real-time face mask detection to ensure COVID-19 precautionary measures in the developing countries. Appl. Sci. 2022, 12, 3879. [Google Scholar] [CrossRef]
Sheikh, B.U.H.; Zafar, A. RRFMDS: Rapid Real-Time Face Mask Detection System for Effective COVID-19 Monitoring. SN Comput. Sci. 2023, 4, 288. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef]
Zhang, J.; Han, F.; Chun, Y.; Chen, W. A novel detection framework about conditions of wearing face mask for helping control the spread of COVID-19. IEEE Access 2021, 9, 42975–42984. [Google Scholar] [CrossRef] [PubMed]
Luo, F.; Zhang, Y.; Xu, L.; Zhang, Z.; Li, M.; Zhang, W. Mask wearing detection algorithm based on improved YOLOv7. Meas. Control 2024, 57, 751–762. [Google Scholar] [CrossRef]
Samreen, S.; Arpana, C.; Spandana, D.; Sheetal, D.; Sandhya, D. Real-Time Face Mask Detection System for COVID-19 Applicants. Turk. J. Comput. Math. Educ. 2023, 14, 1–14. [Google Scholar]
Rahman, M.H.; Jannat, M.K.A.; Islam, M.S.; Grossi, G.; Bursic, S.; Aktaruzzaman, M. Real-time face mask position recognition system based on MobileNet model. Smart Health 2023, 28, 100382. [Google Scholar] [CrossRef]
Cao, R.; Mo, W.; Zhang, W. MFMDet: Multi-scale face mask detection using improved Cascade R-CNN. J. Supercomput. 2024, 80, 4914–4942. [Google Scholar] [CrossRef]
Shetty, R.R.; Fritz, M.; Schiele, B. Adversarial scene editing: Automatic object removal from weak supervision. Adv. Neural Inf. Process. Syst. 2018, 31, 7717–7727. [Google Scholar]
Li, Y.; Liu, S.; Yang, J.; Yang, M.H. Generative face completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3911–3919. [Google Scholar]
Khan, M.K.J.; Ud Din, N.; Bae, S.; Yi, J. Interactive removal of microphone object in facial images. Electronics 2019, 8, 1115. [Google Scholar] [CrossRef]
Boutros, F.; Damer, N.; Kirchbuchner, F.; Kuijper, A. Self-restrained triplet loss for accurate masked face recognition. Pattern Recognit. 2022, 124, 108473. [Google Scholar] [CrossRef]
Zhu, J.; Guo, Q.; Juefei-Xu, F.; Huang, Y.; Liu, Y.; Pu, G. Masked faces with face masks. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 360–377. [Google Scholar]
Wang, H.; Wang, Y.; Zhou, Z.; Ji, X.; Gong, D.; Zhou, J.; Li, Z.; Liu, W. CosFace: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5265–5274. [Google Scholar]
Din, N.U.; Javed, K.; Bae, S.; Yi, J. Effective removal of user-selected foreground object from facial images using a novel GAN-based network. IEEE Access 2020, 8, 109648–109661. [Google Scholar] [CrossRef]
Afzal, H.R.; Luo, S.; Afzal, M.K.; Chaudhary, G.; Khari, M.; Kumar, S.A. 3D face reconstruction from single 2D image using distinctive features. IEEE Access 2020, 8, 180681–180689. [Google Scholar] [CrossRef]
Qiu, H.; Gong, D.; Li, Z.; Liu, W.; Tao, D. End2end occluded face recognition by masking corrupted features. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6939–6952. [Google Scholar] [CrossRef] [PubMed]
Wang, Q.; Guo, G. DSA-Face: Diverse and sparse attentions for face recognition robust to pose variation and occlusion. IEEE Trans. Inf. Forensics Secur. 2021, 16, 4534–4543. [Google Scholar] [CrossRef]
Biswas, R.; González-Castro, V.; Fidalgo, E.; Alegre, E. A new perceptual hashing method for verification and identity classification of occluded faces. Image Vis. Comput. 2021, 113, 104245. [Google Scholar] [CrossRef]
Neto, P.C.; Pinto, J.R.; Boutros, F.; Damer, N.; Sequeira, A.F.; Cardoso, J.S. Beyond masks: On the generalization of masked face recognition models to occluded face recognition. IEEE Access 2022, 10, 86222–86233. [Google Scholar] [CrossRef]
Alsaedi, N.H.; Jaha, E.S. Dynamic Feature Subset Selection for Occluded Face Recognition. Intell. Autom. Soft Comput. 2022, 31, 407. [Google Scholar] [CrossRef]
Albalas, F.; Alzu’bi, A.; Alguzo, A.; Al-Hadhrami, T.; Othman, A. Learning discriminant spatial features with deep graph-based convolutions for occluded face detection. IEEE Access 2022, 10, 35162–35171. [Google Scholar] [CrossRef]
Lokku, G.; Reddy, G.H.; Prasad, M.G. OPFaceNet: OPtimized Face Recognition Network for noise and occlusion affected face images using hyperparameters tuned Convolutional Neural Network. Appl. Soft Comput. 2022, 117, 108365. [Google Scholar] [CrossRef]
Georgescu, M.I.; Duţǎ, G.E.; Ionescu, R.T. Teacher-student training and triplet loss to reduce the effect of drastic face occlusion: Application to emotion recognition, gender identification and age estimation. Mach. Vis. Appl. 2022, 33, 12. [Google Scholar] [CrossRef]
Polisetty, N.K.; Sivaprakasam, T.; Sreeram, I. An efficient deep learning framework for occlusion face prediction system. Knowl. Inf. Syst. 2023, 65, 5043–5063. [Google Scholar] [CrossRef]
Wang, D.; Li, R. Enhancing accuracy of face recognition in occluded scenarios with OAM-Net. IEEE Access 2023, 11, 117297–117307. [Google Scholar] [CrossRef]
Li, Y.; Liu, H.; Liang, J.; Jiang, D. Occlusion-Robust Facial Expression Recognition Based on Multi-Angle Feature Extraction. Appl. Sci. 2025, 15, 5139. [Google Scholar] [CrossRef]
Li, H.; Zhang, Y.; Wang, W.; Zhang, S.; Zhang, S. Recovery-Based Occluded Face Recognition by Identity-Guided Inpainting. Sensors 2024, 24, 394. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Han, S.; Liu, D.; Ming, D. Focus and imagine: Occlusion suppression and repairing transformer for occluded person re-identification. Neurocomputing 2024, 127442. [Google Scholar] [CrossRef]
Maharani, D.A.; Machbub, C.; Rusmin, P.H.; Yulianti, L. Improving the capability of real-time face masked recognition using cosine distance. In Proceedings of the 2020 6th International Conference on Interactive Digital Media (ICIDM), Virtual, 14–15 December 2020; pp. 1–6. [Google Scholar]
Golwalkar, R.; Mehendale, N. Masked-face recognition using deep metric learning and FaceMaskNet-21. Appl. Intell. 2022, 52, 13268–13279. [Google Scholar] [CrossRef]
Li, C.; Ge, S.; Zhang, D.; Li, J. Look through masks: Towards masked face recognition with de-occlusion distillation. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 3016–3024. [Google Scholar]
Ding, F.; Peng, P.; Huang, Y.; Geng, M.; Tian, Y. Masked face recognition with latent part detection. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2281–2289. [Google Scholar]
Hong, Q.; Wang, Z.; He, Z.; Wang, N.; Tian, X.; Lu, T. Masked face recognition with identification association. In Proceedings of the 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), Baltimore, MD, USA, 9–11 November 2020; pp. 731–735. [Google Scholar]
Montero, D.; Nieto, M.; Leskovsky, P.; Aginako, N. Boosting masked face recognition with multi-task arcface. In Proceedings of the 2022 16th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), Dijon, France, 19–21 October 2022; pp. 184–189. [Google Scholar]
Hariri, W. Efficient masked face recognition method during the COVID-19 pandemic. Signal Image Video Process. 2022, 16, 605–612. [Google Scholar] [CrossRef] [PubMed]
Mandal, B.; Okeukwu, A.; Theis, Y. Masked face recognition using ResNet-50. arXiv 2021, arXiv:2104.08997. [Google Scholar]
Du, H.; Shi, H.; Liu, Y.; Zeng, D.; Mei, T. Towards NIR-VIS masked face recognition. IEEE Signal Process. Lett. 2021, 28, 768–772. [Google Scholar] [CrossRef]
Wu, G. Masked face recognition algorithm for a contactless distribution cabinet. Math. Probl. Eng. 2021, 2021, 5591020. [Google Scholar] [CrossRef]
Deng, H.; Feng, Z.; Qian, G.; Lv, X.; Li, H.; Li, G. MFCosface: A masked-face recognition algorithm based on large margin cosine loss. Appl. Sci. 2021, 11, 7310. [Google Scholar] [CrossRef]
Li, Y.; Guo, K.; Lu, Y.; Liu, L. Cropping and attention based approach for masked face recognition. Appl. Intell. 2021, 51, 3012–3025. [Google Scholar] [CrossRef] [PubMed]
Zhang, M.; Liu, R.; Deguchi, D.; Murase, H. Masked face recognition with mask transfer and self-attention under the COVID-19 pandemic. IEEE Access 2022, 10, 20527–20538. [Google Scholar] [CrossRef]
Ullah, N.; Javed, A.; Ghazanfar, M.A.; Alsufyani, A.; Bourouis, S. A novel DeepMaskNet model for face mask detection and masked facial recognition. J. King Saud-Univ.-Comput. Inf. Sci. 2022, 34, 9905–9914. [Google Scholar] [CrossRef] [PubMed]
Vu, H.N.; Nguyen, M.H.; Pham, C. Masked face recognition with convolutional neural networks and local binary patterns. Appl. Intell. 2022, 52, 5497–5512. [Google Scholar] [CrossRef]
Kocacinar, B.; Tas, B.; Akbulut, F.P.; Catal, C.; Mishra, D. A real-time CNN-based lightweight mobile masked face recognition system. IEEE Access 2022, 10, 63496–63507. [Google Scholar] [CrossRef]
Pann, V.; Lee, H.J. Effective attention-based mechanism for masked face recognition. Appl. Sci. 2022, 12, 5590. [Google Scholar] [CrossRef]
Huang, B.; Wang, Z.; Wang, G.; Jiang, K.; Han, Z.; Lu, T.; Liang, C. PLFace: Progressive learning for face recognition with mask bias. Pattern Recognit. 2023, 135, 109142. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Li, Y.; Zou, H. Masked face recognition system based on attention mechanism. Information 2023, 14, 87. [Google Scholar] [CrossRef]
Huang, B.; Wang, Z.; Wang, G.; Han, Z.; Jiang, K. Local eyebrow feature attention network for masked face recognition. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–19. [Google Scholar] [CrossRef]
Mishra, N.K.; Kumar, S.; Singh, S.K. MmLwThV framework: A masked face periocular recognition system using thermo-visible fusion. Appl. Intell. 2023, 53, 2471–2487. [Google Scholar] [CrossRef]
Zhong, M.; Xiong, W.; Li, D.; Chen, K.; Zhang, L. MaskDUF: Data uncertainty learning in masked face recognition with mask uncertainty fluctuation. Expert Syst. Appl. 2024, 238, 121995. [Google Scholar] [CrossRef]
Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive growing of GANs for improved quality, stability, and variation. arXiv 2017, arXiv:1710.10196. [Google Scholar]
Moschoglou, S.; Papaioannou, A.; Sagonas, C.; Deng, J.; Kotsia, I.; Zafeiriou, S. Agedb: The first manually collected, in-the-wild age database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 51–59. [Google Scholar]
Sengupta, S.; Chen, J.C.; Castillo, C.; Patel, V.M.; Chellappa, R.; Jacobs, D.W. Frontal to profile face verification in the wild. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–9. [Google Scholar]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. ArcFace: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4690–4699. [Google Scholar]
Guo, Y.; Zhang, L.; Hu, Y.; He, X.; Gao, J. Ms-Celeb-1M: A dataset and benchmark for large-scale face recognition. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part III 14; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 87–102. [Google Scholar]
Damer, N.; Grebe, J.H.; Chen, C.; Boutros, F.; Kirchbuchner, F.; Kuijper, A. The effect of wearing a mask on face recognition performance: An exploratory study. In Proceedings of the 2020 International Conference of the Biometrics Special Interest Group (BIOSIG), Darmstadt, Germany, 16–18 September 2020; pp. 1–6. [Google Scholar]
Yi, D.; Lei, Z.; Liao, S.; Li, S.Z. Learning face representation from scratch. arXiv 2014, arXiv:1411.7923. [Google Scholar]
Georghiades, A.S.; Belhumeur, P.N.; Kriegman, D.J. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 643–660. [Google Scholar] [CrossRef]
Huang, G.B.; Mattar, M.; Berg, T.; Learned-Miller, E. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. In Proceedings of the Workshop on Faces in ‘Real-Life’ Images: Detection, Alignment, and Recognition, Marseille, France, 17–20 October 2008; pp. 1–9. [Google Scholar]

Figure 1. A flowchart illustrating the review methodology employed in this study. The process includes database selection, keyword-based search, initial screening, application of inclusion and exclusion criteria, and the final classification of selected papers based on architecture types and key MFR themes.

Figure 2. An illustration of MFR framework phases.

Figure 3. Overview of the GAN-based masked face restoration and recognition framework. The left part illustrates the generative mask-removal pipeline, which includes a map generator, editing module, and dual discriminators. The right part shows the Embedding Unmasking Model (EUM), which enhances face embeddings using a triplet loss strategy to improve recognition performance. Adapted from [77,119].

Table 1. An overview of convolutional neural network models.

Model	Summary	Trainable Parameters	Convolutional Layers
AlexNet	Introduced in 2012, one of the pioneering deep convolutional neural networks for image classification. Consists of eight layers, including five convolutional layers and three fully connected layers. Employed techniques like ReLU activation, dropout, and local response normalization. Won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012, sparking the resurgence of interest in DL.	62 million	5
VGGNet	Developed by the Visual Geometry Group (VGG) at the University of Oxford. Known for its simplicity, comprising multiple convolutional layers with 3 × 3 filters and max-pooling layers. Offers different configurations (e.g., VGG16, VGG19) with varying depths and number of parameters. Achieves strong performance on image classification tasks but is computationally expensive due to its large number of parameters.	138 million–143 million	13–16
ResNet	Introduced the concept of residual connections to address the vanishing gradient problem in very deep networks. Enables training of extremely deep networks with hundreds of layers by allowing the model to learn residual mappings. Significantly improves accuracy and convergence speed by mitigating degradation issues in deeper networks. Offers various configurations, including ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152, with increasing depths.	25 million–44 million	48–99
GoogLeNet	Developed by Google researchers, GoogLeNet introduced the inception module with parallel convolutional pathways of different filter sizes. Enables efficient capture of features at multiple scales while reducing computational complexity. Achieves high accuracy on image classification tasks with fewer parameters compared to traditional architectures. Utilizes global average pooling and auxiliary classifiers to encourage convergence and regularization during training.	4.2 million	28
MobileNet	Designed specifically for mobile and embedded devices with limited computational resources. Utilizes depthwise separable convolutions to reduce model size and computational complexity while preserving performance. Offers different model sizes and complexities to balance between accuracy and efficiency, making it suitable for mobile applications. Well-suited for tasks like image classification, object detection, and semantic segmentation on resource-constrained devices.	13 million–3.5 million	28
Xception	An extension of the Inception architecture that replaces standard convolutional layers with depthwise separable convolutions. Aims to improve computational efficiency and model performance by decoupling spatial and channel-wise convolutions. Achieves state-of-the-art performance on image classification tasks with significantly fewer parameters compared to previous architectures. Well-suited for applications where computational resources are limited or efficiency is crucial.	7 million–56 million	22
DenseNet	Densely connected convolutional network where each layer receives inputs from all preceding layers; promotes feature reuse and efficient learning	7.98 M (DenseNet-121)	121

Table 2. Comparative analysis of Masked Face Recognition (MFR) methods.

Ref.	Method/Model	Core Techniques	Dataset/Type	Accuracy/Performance	Remarks
[137]	Haar + MobileNet	Haar-cascade, cosine distance, transfer learning	Custom	100%, 82.20%; 4–22 FPS	Real-time, high accuracy
[138]	FaceMaskNet-21	Deep metric learning	Custom (real-time CCTV)	88.92%, <10 ms	Real-time public surveillance
[139]	De-Occlusion Distillation	GAN, knowledge distillation	Not specified	–	Face completion + distillation
[140]	LPD	Latent part detection, data augmentation	MFV, MFI (real + synthetic)	–	High generalization performance
[141]	Re-ID Association	Person Re-ID + face quality ranking	Surveillance-like scenes	–	Matches masked to unmasked appearances
[142]	MTArcFace	ArcFace + multitask mask detection	Augmented ArcFace	99.78% (mask detection)	Joint FR and mask classification
[143]	CNN + MLP	VGG16, AlexNet, ResNet50, BoF	Eyes/forehead focus	–	Occlusion removal and pooling
[144]	ResNet-50	DL training with masked faces	Not specified	–	Practical for security systems
[145]	Semi-Siamese + 3D	Mutual info maximization, 3D synthesis	NIR images	–	Domain-invariant feature learning
[146]	Attention + Dictionary	Dictionary learning, dilated conv, attention	RMFRD, SMFRD	–	Preserves resolution, boosts accuracy
[147]	MFCosface	Large margin cosine loss, Att-Inception	Synthetic masked faces	–	Focuses on unmasked areas
[148]	Cropping + CBAM	Attention to eye regions, cropping	Custom	–	Cross-condition learning (mask/no mask)
[149]	AMaskNet	Mask transfer, attention-aware model	Augmented data	–	End-to-end + mask-aware inference
[150]	DeepMaskNet	Face mask detection + MFR	MDMFR	–	Unified benchmark dataset and model
[151]	RetinaFace + LBP	LBP + DL feature fusion	COMASK20, Essex	87% (COMASK), 98% (Essex)	Hybrid handcrafted + DL method
[152]	Ensemble MobileNet	Lightweight CNN, mobile deployment	1849 samples, 12 subjects	90.4%	Real-time FR mobile app
[153]	CBAM + ArcFace	Attention module + ArcFace loss	LFW, AgeDB-30, CFP-FP, MFR2	–	High precision on eye-region features
[30]	MobileNetV2 + TL	VGG16/19, ResNet variants, TL	Custom datasets	Up to 99.82%	Transfer learning performance analysis
[154]	PLFace	Progressive training, margin loss	ArcFace-based	–	Adaptive masked/unmasked training
[155]	ConvNeXt-T + Attention	Lightweight attention backbones	Custom masked dataset	99.76% (masked), 99.48% (combined)	Robust to lighting variation
[156]	Eyebrow GCN	Eyebrow pooling, GCN fusion	RMFRD, SMFRD	–	Leverages symmetry and component hierarchy
[157]	MmLwThV	Thermo-visible fusion, ensemble classifier	Visible + IR data	–	Mobile-ready, dual-modal input
[158]	MaskDUF	Uncertainty modeling, H-KLD, MUF	Custom + standard datasets	+1.33–13.28% over baselines	Learns sample recognizability distribution

Table 3. Summary of MFR and related face recognition datasets.

Dataset Name	Type	Size	Masking Type	Notes/Significance
Synthetic CelebA [77]	Synthetic	10,000 images	50 mask types	Based on CelebA, mask types vary by size, shape, and color
Synthetic Face-Occluded [122]	Synthetic	–	Occlusions (hands, masks, etc.)	Based on CelebA-HQ; includes 5 object types with 40+ variations
MFSR [43]	Real-world	21,357 images	Real	Segmentation + recognition; 1004 identities; manual mask annotations
MFDD [5]	Real-world	24,771 images	Real	Focused on face detection with masks
RMFRD [5]	Real-world	95,000 images	Real	5000 masked + 90,000 unmasked of 525 individuals
SMFRD [5]	Synthetic	500,000 images	Synthetic	10,000 individuals; improves training diversity
EMFR [164]	Real-world	4320 images	Real	Captured in sessions over days; includes reference and probe sets
AgeDB [160]	Real-world	16,488 images	No mask	Faces at different ages; shows impact of aging on recognition
CFP [161]	Real-world	7000 pairs	No mask	Frontal/profile views; evaluates pose variation
MS1MV2 [162]/MS1MV2-Masked [119]	Real/synthetic	58 M images	Synthetic masked version exists	Widely used large-scale FR dataset; synthetic masking adds robustness
WebFace [165]	Real-world	500,000 images	No mask	Faces from IMDb; identity-level diversity
Extended Yela B [166]	Real-world	16,128 images	No mask	Pose + illumination variations
LFW [167]/LFW-SM [39]	Real/synthetic	50,000/13,233 images	Simulated masks	Classical dataset; LFW-SM adds mask simulation
VGGFace2/VGGFace2_m [147]	Real/synthetic	3.3 M+ images	Simulated masks	High intra-class variation; VGGFace2_m masks for MFR
CASIA-FaceV5_m [147]	Real/synthetic	2500 images	Simulated masks	Asian faces; upgraded with masking
MFV/MFI [140]	Real-world	400 pairs/4916 images	Real	Designed specifically for MFR verification and identification
3D Landmark MFR dataset [123]	Real/synthetic	200 images	Real/simulated	Based on 3D Morphable Model; useful for 3D MFR evaluation
Masked Face Database (MFD)	Real-world	990 images	Real	45 individuals, gender balanced
Masked Faces in the Wild (MFW) [39]	Real-world	3000 images	Real	300 people with 5 masked + 5 unmasked images each

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Saoud, B.; Mohamed, A.H.H.M.; Shayea, I.; El-Saleh, A.A.; Alashbi, A. Review of Masked Face Recognition Based on Deep Learning. Technologies 2025, 13, 310. https://doi.org/10.3390/technologies13070310

AMA Style

Saoud B, Mohamed AHHM, Shayea I, El-Saleh AA, Alashbi A. Review of Masked Face Recognition Based on Deep Learning. Technologies. 2025; 13(7):310. https://doi.org/10.3390/technologies13070310

Chicago/Turabian Style

Saoud, Bilal, Abdul Hakim H. M. Mohamed, Ibraheem Shayea, Ayman A. El-Saleh, and Abdulaziz Alashbi. 2025. "Review of Masked Face Recognition Based on Deep Learning" Technologies 13, no. 7: 310. https://doi.org/10.3390/technologies13070310

APA Style

Saoud, B., Mohamed, A. H. H. M., Shayea, I., El-Saleh, A. A., & Alashbi, A. (2025). Review of Masked Face Recognition Based on Deep Learning. Technologies, 13(7), 310. https://doi.org/10.3390/technologies13070310

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Review of Masked Face Recognition Based on Deep Learning

Abstract

1. Introduction

2. Study Scope and Relevant Research

3. MFR Framework

3.1. Preprocessing of Images

3.2. Deep Learning Models

3.2.1. CNN

3.2.2. Autoencoders

3.2.3. Generative Adversarial Networks

3.2.4. Deep Belief Network

3.3. Extraction of Features

4. Mask Detection

5. Face Unmasking

6. Review of Face Recognition Techniques

6.1. Face Recognition in the Presence of Occlusions

6.2. Reviewing Methods for MFR

7. Datasets

8. Performance Evaluation Metrics

9. Challenges and Future Research Directions

10. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI