1. Introduction
Facial recognition is important for security, user authentication, and human–computer interaction. Under uncontrolled conditions, identifying a person from their face is challenging due to variations in lighting during photography (e.g., angle and brightness). In addition, illumination is a common source of poor performance, as the direction and intensity of light can significantly affect the appearance of a person’s face and reduce the separability of facial features from those of others in facial-recognition algorithms.
In addition to face recognition, recent advances in related visual recognition areas focus on methods that enhance illumination and domain robustness. For instance, spatial-channel collaborative multi-scale graph interaction deep transfer learning has been successfully used for unsupervised fault detection in rotating machinery, showing strong ability to transfer features across domains [
1]. Similarly, adaptive fused domain-cycling variational Generative Adversarial Networks (GANs) have addressed domain shift challenges in aerial image segmentation [
2]. Additionally, multimodal deep learning techniques have been developed for arc detection in railway systems, providing robust results even with limited training data [
3].
The rapid expansion of digital media and surveillance technologies has made face recognition an integral component of contemporary security, authentication, and social interaction systems. Nonetheless, achieving reliable recognition in unconstrained settings remains a formidable challenge due to variations in illumination, pose, expression, and background clutter. Among these factors, illumination changes remain among the most consistent causes of recognition failure, as variations in light direction, intensity, and color temperature can considerably alter facial appearance, reducing the discriminative power of learned features [
4].
Although deep learning-based systems, such as FaceNet and ArcFace, have achieved substantially higher recognition accuracy than conventional handcrafted techniques (e.g., Discrete Cosine Transform (DCT), Singular Value Decomposition (SVD), Hidden Markov Model (HMM)), these methods lack explicit mechanisms to normalize illumination or improve robustness under illumination variations [
5]. This results in unstable face embeddings and poor performance under non-uniform lighting or demographic imbalance. Previous efforts to address these issues are largely hand-engineered, using preprocessing, color-space thresholding, and classification based on demographic descriptors. These approaches entail no novel normalization and introduce generalization bias [
6].
To address these shortcomings, we propose a unified deep learning framework that integrates illumination normalization, skin-aware segmentation, and margin-based metric learning into a single end-to-end pipeline [
7]. The framework includes RetinaFace for robust face detection and alignment, a U-Net-based skin segmentation module that extracts facial regions, and ResNet and Vision Transformer (ViT) backbones enhanced with ArcFace, AdaFace, and MagFace losses to enable discriminative, quality-aware embedding learning. Complementing these components, a dual illumination-handling technique that combines photometric augmentation during training with neural relighting during inference ensures consistent feature extraction across variable illumination conditions [
8].
Although deep learning models such as FaceNet, CosFace, and ArcFace have significantly improved accuracy over traditional handcrafted features, they still lack effective methods for handling illumination changes or background clutter. Additionally, many existing segmentation and normalization methods rely on demographic assumptions, which can inadvertently introduce bias into recognition. To address these issues, we introduce an end-to-end face recognition system that integrates illumination normalization, background bias reduction, skin-color-aware segmentation, and quality-based metric learning into a single, efficient framework. This paper presents three main advancements beyond simple component integration:
- (1)
Mask-guided embedding modulation (Equation (2)), which employs the predicted skin mask to directly reweight convolutional features or ViT tokens during training.
- (2)
Addressing dual illumination by combining photometric augmentation during training with neural relighting at inference—an approach not previously used in face recognition.
- (3)
Joint spatial–quality margin learning that combines mask-guided features with quality-adaptive losses such as AdaFace and MagFace. These innovations are evaluated on challenging datasets with pose and illumination variations, as well as on saturated benchmarks.
In addition to conventional recognition frameworks, recent studies have explored domain adaptation and adversarial learning strategies to improve robustness under distribution shift. Conditional adversarial transfer learning models aim to align feature distributions across source and target domains to mitigate domain bias. Multi-domain GANs have also been proposed to synthesize cross-domain variations and enhance generalization of representations. Furthermore, reinforcement learning-based adaptive systems dynamically adjust model parameters to changing environmental conditions. While these approaches primarily address domain discrepancy, they do not explicitly integrate illumination-aware spatial modulation within the face embedding learning process, which is the focus of this paper.
The contribution lies not in any individual component but in the end-to-end integration of illumination normalization, segmentation-guided masking, and quality-aware metric learning within a single differentiable pipeline. Unlike previous approaches that treat relighting and segmentation as preprocessing steps, our framework jointly optimizes these modules during training. The skin mask directly influences feature representations (Equation (2)), and the dual illumination strategy (
Section 3.2) establishes a unique synergy between training and inference. The inference pipeline is fully differentiable, but segmentation and relighting are pre-trained independently and then integrated.
Unlike prior papers that address illumination, segmentation, or margin learning separately, the proposed framework unifies these components within a differentiable inference architecture with multi-stage training. The remainder of the paper is organized as follows:
Section 2 reviews related works;
Section 3 presents the proposed design;
Section 4 presents experimental results and analysis; and
Section 5 summarizes our findings and outlines future work.
2. Related Words
2.1. Illumination Handling in Face Recognition
Although traditional methods for normalizing illumination (e.g., DCT filtering, SVD decomposition, Independent Component Analysis (ICA)-based methods) offer some robustness in controlled settings, they generalize poorly to real-world lighting conditions. By contrast, recent neural network-based approaches that relight images or apply photometric augmentation are highly effective at reducing illumination variations, yet they have been used primarily as standalone preprocessing steps rather than as integrated components of the method [
9,
10,
11].
Recently, research has focused on learning-based illumination normalization. Neural relighting models [
12] and photometric data augmentation approaches show promise in generating illumination-invariant embeddings. However, most methods apply normalization as a preprocessing step rather than integrating it directly into embedding or recognition backbones. As a result, although the embeddings are generally less affected by illumination variations, some sensitivity remains, underscoring the need to incorporate end-to-end illumination handling within the face recognition model.
2.2. Skin Segmentation and Demographic-Prior-Free Face Analysis
Color-space thresholding methods–Luminance–Chrominance Color Space (YCbCr) and Hue, Saturation, Value (HSV)–have limited robustness across skin tones and lighting conditions. Deep learning-based segmentation methods, including stacked autoencoders and U-Net models, improve segmentation reliability but often embed demographic assumptions. A demographic-prior-free segmentation approach that avoids race classification is needed to reduce bias [
13,
14]. However, this approach often produced poor segmentations, particularly across different skin tones and lighting conditions. Even model-based classifiers, such as Gaussian or Bayesian methods [
15,
16,
17], were unable to capture the full range of skin chrominance.
Deep learning-based methods have improved performance by learning adaptive features from large skin-pixel datasets. For example, ref. [
14] implemented stacked autoencoders to produce an illumination-robust segmentation method. In more recent studies, researchers built on the work of [
4] by using U-Net or Transformer-based architectures [
18,
19,
20], thereby improving generalizability under more complex lighting conditions. Nevertheless, these methods, along with others, often rely on race classification or demographic priors, resulting in bias and accessibility issues.
In this paper, we introduce a skin segmentation module based on a reduced-potential background-induced bias U-Net. It isolates facial skin regions without relying on demographic assumptions. This approach enhances focus on facial features and produces skin representations that promote stability in matching across diverse populations.
2.3. Deep Metric Learning and Quality-Aware Embedding
Margin-based softmax losses, such as SphereFace [
21], CosFace [
22], and ArcFace [
23], improve intra-class compactness and inter-class separation. Margin-based softmax losses (SphereFace, CosFace, ArcFace) improve inter-class separation; quality-aware variants (AdaFace, MagFace) adapt margins to image quality [
23]. However, these methods do not explicitly handle illumination or background interference.
Recent research, including AdaFace [
23], MagFace [
16], and approaches based on ViT backbones [
24], have proposed quality-aware margin adaptation, enabling embeddings to adapt to varying image reliability. Despite improvements in propagating quality, these methods do not formally account for the effects of illumination and background clutter. Research has also demonstrated promising generalization of embedding-related presentation, but it has not yet fully integrated illumination normalization or other potential sources of bias.
Table 1 compares the proposed framework with representative prior methods.
Table 1 summarizes the Methodological differences that exist between the proposed SEM framework and key prior approaches that focus on different aspects. Margin-based methods such as ArcFace and AdaFace enhance the discriminability of embeddings but do not specifically address illumination variation or spatial bias. Illumination normalization techniques are typically applied as separate preprocessing steps, without integration with embedding supervision. Conversely, segmentation-based methods reduce background influence but are not directly linked to metric learning.
In contrast, the proposed framework integrates illumination normalization, skin-aware spatial modulation, and quality-adaptive margin learning into a single inference process. Rather than preprocessing alone, the segmentation mask directly shapes feature representations, enabling spatially guided embedding learning. This unified approach distinguishes itself from prior work and enhances robustness under challenging lighting and background conditions.
Proposed methods fall into three categories: illumination normalization techniques, segmentation-based preprocessing, and margin-based metric learning. However, earlier illumination techniques are typically unsupervised, segmentation is rarely combined with metric-learning layers, and margin-based losses often lack spatial-bias suppression. To our knowledge, no existing work simultaneously optimizes illumination normalization, spatial skin masking, and quality-adaptive angular margins within a single recognition framework.
3. Model Architecture
The proposed model comprises five modules: (1) detection and alignment, (2) illumination robustness, (3) skin-aware segmentation, (4) metric learning, and (5) verification. Each module in the overall inference pipeline is fully differentiable. However, the segmentation and relighting modules are trained in a multi-stage manner and then jointly integrated into the recognition framework, enabling the model to learn adaptively from raw data without any handcrafted preprocessing. Unlike conventional pipelines, the proposed model incorporates adaptive feature modulation guided by image-specific cues, enabling condition-aware representation learning. The overall flow is shown in
Figure 1.
3.1. Face Detection and Alignment
Reliable localization is the foundation of robust face recognition. The input image is denoted by . We employ RetinaFace, a single-stage detector that jointly predicts bounding boxes and five facial landmarks (eye centers, nose tip, and mouth corners). Landmark-based alignment normalizes pose and scale, ensuring geometrically consistent inputs for subsequent modules. RetinaFace detects facial bounding boxes and five landmarks. The input face image is denoted by where and represent the image height and width, respectively. Faces are then aligned and normalized to 112 × 112 resolution.
3.2. Photometric Augmentation
Changes in lighting conditions can lead to severe distortion in appearance. To reduce the impact of lighting changes, we implement a dual illumination-handling strategy:
Photometric augmentation (during training): It simulates extreme lighting variations (brightness, contrast, gamma, and color temperature).
Neural relighting (during inference): A lightweight neural relighting model maintains consistent luminance across facial regions.
This trains the embedding model to separate identity features from illumination changes.
The proposed model adjusts feature modulation based on image-dependent cues, enabling flexible reweighting of feature representations rather than uniform scaling. Although photometric augmentation is a common technique, combining it with inference-time relighting offers a complementary approach for training and inference that enhances robustness to illumination changes.
While the formulation might look like basic scaling, the framework actually involves image-dependent transformations via segmentation and relighting modules. These elements apply condition-aware changes to the input before feature extraction, setting this approach apart from simple uniform scaling.
The U-Net segmentation module creates a soft skin-probability mask (x). This mask is resized and multiplied element-wise with intermediate feature maps (x) from the backbone (ResNet or ViT). The modified features F′(x,y)= α × M × F + (1 − α) × F are then passed to the next layers. For ViT backbones, patch-level mask scores t_k are obtained by averaging pixel masks within each patch and are used to scale token embeddings before multi-head self-attention. This method helps reduce background interference while preserving important skin region details.
3.3. Skin Segmentation with U-Net
A U-Net model generates a continuous skin-probability mask. The mask is applied to the feature map or ViT tokens, suppressing non-skin regions. Segmentation is supervised with binary cross-entropy, Dice loss, total variation regularization, and an area-prior constraint.
The U-Net architecture comprises four encoder–decoder levels with channel dimensions of 64, 128, 256, and 512. Each block contains two 3 × 3 convolutional layers, followed by Batch Normalization and ReLU activation. Skip connections link the encoder and decoder stages. The final layer applies a 1 × 1 convolution followed by a sigmoid activation to produce a continuous skin probability mask at 112 × 112 resolution.
3.4. Neural Relighting Network
The relighting model is a lightweight five-layer convolutional network with residual connections. It predicts an illumination-adjusted RGB output using a combination of L1 reconstruction and structural similarity (SSIM) losses. The network contains approximately 1.2 M parameters and runs in real time.
The U-Net produces a soft mask
indicating the likelihood of each pixel being skin.
where
denotes the predicted skin probability at pixel ,
: bias term applied before activation,
: temperature parameter,
: the sigmoid function,
denotes the feature activation from the decoder layer at position ,
and : are learnable weights and bias.
Dice and binary cross-entropy losses supervise segmentation, while total-variation and area-ratio regularizers promote smooth, realistic masks, as shown in
Figure 2.
The proposed mask-guided embedding mechanism is defined by Equation (2), where the segmentation mask
M highlights skin regions while suppressing irrelevant background. The balance factor α controls the mask’s influence on the feature map, allowing the network to focus on more reliable facial regions while retaining contextual cues.
where,
: mask-weighted feature map,
: resized mask,
: original feature map from backbone,
original feature map from the backbone,
skin mask resized to match the spatial dimensions of ,
a learnable gating parameter (initialized to 0.7).
The Vision Transformer backbone computes an average mask score for each image component, which is then scaled by the token embedding before the Multi-Head Self-Attention layer. In other words, skin regions have a greater influence on the global representation than background regions, which contribute little to the context.
3.5. Intuitive Effect
By focusing on skin regions, the model learns to ignore confounding backgrounds and occlusions, thereby improving robustness to lighting variations and across skin tones. The segmentation and embedding modules work together: U-Net sharpens spatial attention, while the backbone learns high-level identity features within this attentional context.
3.5.1. Margin-Based Metric Learning
The final embedding vector
is
normalized and trained using three different margin-based softmax losses—ArcFace, AdaFace, and MagFace—that enforce angular constraints that compact intra-class clustering while enforcing inter-class separability:
where,
: ArcFace loss,
: additive margin,
: batch size,
: scaling factor,
the angle between the feature and its corresponding class center,
the angle between embedding and class .
The ArcFace loss is given in Equation (3) with an additive angular margin M to improve inter-class separation in the embedding space. By penalizing small angular distances between classes, it promotes more discriminative facial representations that are invariant to illumination. AdaFace and MagFace further adapt this idea by introducing a margin for image quality, enabling stable embeddings from low-quality and poorly lit images.
3.5.2. Regularization on the Mask
To prevent trivial masks (all-ones or all-zeros), we introduce weak regularizers as shown in Equations (4)–(6).
: the predicted skin mask. This term encourages smoothness by penalizing abrupt mask gradients,
: total variation loss promoting spatial smoothness in mask.
Differences across neighboring pixels penalize abrupt transitions.
mean of the predicted mask (average skin area),
: penalizes deviation from target skin-area ratio α = 0.6,
: target skin-area ratio (typically 0.6 for frontal faces).
The total loss is thus:
where,
: total segmentation objective,
= 0.01 and (selected via grid search),
: the Dice loss,
: the binary cross-entropy loss.
The overall segmentation objective in Equation (6) is a weighted sum of binary cross-entropy, Dice, and regularization terms. This approach enables precise pixel-level predictions while promoting smoothness and anatomical plausibility in the resulting skin masks.
3.6. Identity Feature Extractor
We adopt two backbones:
ResNet-100/iResNet for strong convolutional baselines.
Vision Transformer (ViT-Base), which leverages global self-attention.
For the transformer backbone, the input image is partitioned into non-overlapping patches of size
. For each patch, we compute a patch-level mask by averaging the pixel masks, as shown in Equation (7).
where
: the mask score for the image patch
: number of pixels (e.g., 16 × 16 = 256),
: denotes the set of pixel coordinates in the patch,
: skin probability mask.
Each token embedding
is then modulated by its mask score, as shown in Equation (8).
where,
: original token embedding from the ViT patch projection layer,
: mask-scaled embedding,
: mask weight determining importance of patch.
The masked tokens
are then passed through the Multi-Head Self-Attention (MSA) and Feed-Forward Network (FFN) blocks, as shown in Equation (9).
where,
MSA: denotes Multi-Head Self-Attention,
FFN: is a two-layer feed-forward network,
: is the sequence of mask-scaled token embeddings,
: output representation from encoder block.
This ensures that skin-relevant tokens contribute more strongly to the global representation, while still allowing non-skin tokens to provide contextual cues.
Embeddings are optimized using margin-based softmax losses:
ArcFace: hypersphere angular separation,
AdaFace: adaptive quality-aware margins,
MagFace: embedding quality calibration.
These losses yield compact intra-class clusters and well-separated inter-class boundaries, outperforming standard softmax and handcrafted descriptors.
3.7. Classifier and Scoring
Embeddings are normalized using the L2 norm, and recognition is performed using cosine similarity. Verification thresholds are calibrated on a validation set, and performance is reported using ROC, EER, and TPR@FPR. AdaFace and MagFace evaluations yield consistent scores across image quality levels.
3.8. Summary of Novel Contributions
While the individual components (RetinaFace, U-Net, photometric augmentation, neural relighting, and margin-based losses) are not new, the novelty of this work lies in their joint differentiable integration within a single face recognition pipeline. In other words, the novelty is in the integration, not in any individual component. Specifically:
Mask-guided embedding modulation (Equation (2)): The predicted skin mask is used as a spatially varying, learned attention directly on feature maps and ViT tokens during training, not as a preprocessing step.
Dual illumination strategy (
Section 3.2): Combining training-time photometric augmentation with inference-time neural relighting—a complementary training–inference synergy not previously shown in face recognition.
Joint spatial–quality margin learning: The mask-guided features are combined with quality-adaptive losses (AdaFace, MagFace) that adapt angular margins to image quality.
To our knowledge, no prior work simultaneously optimizes illumination normalization, spatial skin masking, and quality-adaptive angular margins in an end-to-end differentiable pipeline.
Table 1 highlights these differences against prior methods.
3.9. Training Strategy
The network is trained on the MS1M-ArcFace dataset and then fine-tuned on target datasets. Optimization uses Adam W with a cosine-annealed learning rate (reducing the learning rate from an initial value to zero) and partial-FC sampling to improve efficiency.
At inference time, aligned faces pass through the modules for model detection, relighting, segmentation, and embedding. Verification scores are reported as the cosine similarity between normalized embeddings. Finally, the proposed model in
Figure 3 illustrates the architecture and how its components fit together.
U-Net and relighting network are pre-trained separately on skin segmentation and illumination correction tasks, respectively. During face recognition training, their weights are frozen initially and then fine-tuned together with the backbone.
Figure 3 presents a flowchart of the components and processes of the proposed system. The first step is RetinaFace, which uses a ResNet-50 backbone, feature pyramid networks, and context modules, among other techniques, to detect face bounding boxes and landmarks for alignment, ensuring precise localization even under pose variations or occlusions. The illumination module further enhances this by adding photometric augmentation during training and neural relighting at inference, normalizing lighting conditions, and creating illumination-invariant face representations. At this stage, a U-Net segmentation network with an encoder–decoder architecture and skip connections produces a binary skin mask that enhances the desired regions while suppressing background and non-skin areas, thereby refining the input for feature extraction and reducing clutter and bias. The next step is the embedding module, which processes masked images and employs either ResNet-100 or a Vision Transformer (ViT-Base) as the backbone. While ResNet generates hierarchical features via residual bottleneck blocks, ViT employs patch embeddings and self-attention layers to capture global dependencies. Both methods produce compact 512-dimensional embeddings guided by skin-aware attention. The last stage applies margin-based metric learning losses, such as ArcFace, AdaFace, and MagFace, which optimize intra-class compactness and inter-class separation while incorporating quality-awareness to ensure stable embeddings. The final recognition step employs cosine similarity, enabling fast and efficient verification and user identification from large datasets.
Unlike previous methods that treat segmentation, relighting, or margin losses as separate preprocessing or post-processing steps, the proposed model integrates them into a single, differentiable pipeline for joint optimization. The key innovations include: (a) the modulation in Equation (2) employs a spatially varying learned mask instead of a fixed attention map; (b) the dual illumination approach (
Section 3.2) creates a synergy between training and inference; (c) to our knowledge, no other work combines mask-guided spatial modulation with quality-adaptive angular margins.
Table 1 (included earlier) illustrates these differences.
4. Experiments
4.1. Experimental Setup
All experiments were conducted under standardized verification settings to ensure a fair comparison with prior works.
4.1.1. Training Data and Cleaning Procedure
Data cleaning involved using tools to assess image quality and ensure that no duplicate images or identities were present. After this process, 4.67 million images associated with 85,200 unique individuals remained in the MS1M-ArcFace dataset, and the images were resized to 112 × 112 pixels. To perform this alignment, we used the RetinaFace model.
4.1.2. Pre-Training and Model Initialization
Public ArcFace MS1M pre-trained weights were used to initialize all backbone models (ResNet-100 and ViT-Base) rather than random initialization. The U-Net segmentation model and the neural relighting model were trained separately and then integrated into the main pipeline for evaluation. All backbone models and baseline configurations were initialized using identical MS1M-ArcFace pretrained weights to ensure fair comparison.
4.1.3. Training Hyperparameters
The hyperparameters were selected via grid search using a 5% MS1M validation split.
Optimizer: Stochastic Gradient Descent (SGD), (momentum = 0.9),
Learning Rate: 0.1 with cosine decay,
Weight Decay: 5,
Margin (ArcFace): m = 0.5,
Scale Factor: s = 64,
Batch Size: 512,
Epochs: 100,
Loss: ArcFace, AdaFace, or MagFace, depending on the experiment.
4.1.4. Partial-Fully Connected (FC) Training
To address the large number of identities, Partial-FC was used with a sampling rate of 0.3, in accordance with the official Insight Face training protocol.
4.1.5. Use of Relighting and Segmentation During the Training Process
Segmentation- A U-Net was trained independently and used during training and inference as a mask on the Backbone features.
Photometric Augmentation is only applied during the training of the backbone.
Neural Relighting is used only in Inference and is not used in training.
4.1.6. Component Analysis
To quantify contributions of each module, ablation experiments were conducted:
Baseline (ResNet-100 + ArcFace): 98.7% LFW accuracy,
+Photometric Augmentation: +2.1% CFP-FP improvement,
+Neural Relighting: +3.4% improvement under side lighting,
+U-Net Skin Segmentation: +1.6% Rank-1 identification improvement,
+ViT & MagFace: Best overall accuracy, with moderate computational.
As shown in
Table 2, the baseline model shows reduced robustness under side-lit and low-light conditions. Adding photometric augmentation during training significantly improves performance, particularly on CFP-FP. The best illumination robustness is achieved when neural relighting is applied at inference, yielding the highest TPR@1 × 10
−4. This confirms that both augmentation and relighting contribute to substantial gains under challenging lighting.
Without segmentation, as noted in
Table 3, the model remains sensitive to background noise and non-skin artifacts. Incorporating the U-Net skin mask improves Rank-1 accuracy and enhances consistency across different backgrounds. The best results are achieved when segmentation is applied directly to ViT tokens, indicating that mask-guided token weighting yields more discriminative and stable embeddings and avoids explicit demographic modeling by reducing background bias.
ArcFace provides a strong baseline, but AdaFace improves performance on low-quality or noisy images through its adaptive margin. MagFace achieves the best overall performance across all benchmarks, with the lowest EER and the highest verification accuracy. These findings, as shown in
Table 4, indicate that quality-aware margin modeling yields measurable benefits, particularly when combined with the proposed illumination and segmentation modules.
4.2. Datasets
We used the MS1M-ArcFace dataset, a large-scale, refined dataset comprising millions of face images with high inter- and intra-class variability, for training. To assess generalization, we conducted testing on a variety of benchmark datasets. The description of the datasets used is provided below, and their technical specifications are listed in
Table 5.
MS1M-ArcFace (Training Dataset). A subset of the MS-Celeb-1M dataset cleaned by [
25] and made available to the public. It contains approximately 5.8 million images of around 85,000 distinct identities after noise removal and alignment. This large dataset is commonly used to train margin-based softmax losses, including ArcFace, AdaFace, and MagFace [
16].
LFW. The LFW dataset comprises 13,233 face images of 5749 individuals, primarily obtained from the internet and captured in uncontrolled settings. It is widely used as a benchmark for face verification; hence, it is employed to draw conclusions about system performance in unconstrained environments [
26].
CFP-FP. The CFP-FP dataset comprises 7000 image pairs from 500 people and is designed to verify frontal and profile views. It is specifically designed to assess the performance of algorithms under extreme pose changes [
27].
AgeDB-30. The AgeDB dataset comprises images of celebrities from different age groups over an extended period. AgeDB-30 comprises 12,240 images from 440 participants with a maximum age difference of 30 years; thus, it is a benchmark for age-invariant face recognition [
28].
Custom Illumination Dataset. To evaluate robustness under challenging lighting, we constructed a dataset from two sources:
Extracted images: A total of 5000 face images were randomly sampled from LFW, CFP-FP, and AgeDB-30, ensuring no identity overlap with the MS1M-ArcFace training set. The sampling preserved the original pose and expression variations. The extracted set contains 3500 distinct subjects, each with 1–3 images.
Synthetically generated images: Another 5000 images were created by applying controlled illumination transformations to the extracted faces using the following operations (implemented in Open CV):
- ○
Gamma correction to simulate low-light and overexposure.
- ○
Contrast adjustment (contrast factor ).
- ○
Directional lighting masks (side-light from left/right, top-down, bottom-up) using a radial gradient overlay.
- ○
Mixed lighting—combinations of the above.
The final dataset contains 10,000 images from 3500 subjects, with approximately 30% low-light, 40% side-light, and 30% overexposed/front-light conditions. All images were aligned with RetinaFace and resized to pixels. This dataset is used solely to evaluate illumination robustness; it does not replace standard benchmarks.
The dataset was selected to subject the proposed model to rigorous testing across a wide range of diverse and challenging conditions. MS1M-ArcFace enables large-scale training, in which the backbone networks and margin-based losses learn, resulting in highly discriminative embeddings. LFW is a classic benchmark for unconstrained verification; thus, the results can be directly compared with those of previous studies. CFP-FP is devoted exclusively to assessing pose robustness, which is particularly important in real-world face recognition, given the prevalence of profile views. AgeDB-30 has been added to the evaluation to assess the model’s ability to handle large age gaps, thereby directly addressing age-invariant recognition. Finally, the custom illumination dataset was designed to directly test the proposed photometric augmentation and neural relighting modules, thereby ensuring the system’s reliability under very poor or uneven lighting conditions. In combination, these datasets constitute a comprehensive evaluation framework that accounts for generalization, pose variation, age progression, and illumination robustness.
4.3. Evaluation Metrics
The effectiveness of the proposed model will be thoroughly assessed using both threshold-dependent and threshold-independent evaluation metrics. Threshold-dependent metrics such as True Positive Rate (TPR), False Positive Rate (FPR), and overall Accuracy determine the system’s performance at specific decision thresholds. On the other hand, threshold-independent measures such as the Receiver Operating Characteristic (ROC) curve, Area under the Curve (AUC), and Equal Error Rate (EER) provide a more comprehensive understanding of the model’s ability to distinguish among various operating conditions. In addition, the True Negative Rate (TNR) is considered to evaluate specificity, thereby enabling balanced performance across positive and negative verification outcomes. Moreover, runtime efficiency is scrutinized to ensure that the proposed architecture not only yields high accuracy but also maintains an inference speed suitable for real-time or embedded deployment environments.
For a given decision threshold
shown in Equations (10) and (11);
The True Positive Rate (
TPR), also known as sensitivity or recall, is defined in Equation (10).
TPR is the percentage of correctly identified positive (matching) face pairs among all actual positive samples. The higher the
TPR, the better the ability to recognize bona fide identities.
where,
TP, FP, TN, and FN denote the number of true positives, false positives, true negatives, and false negatives, respectively.
The False Positive Rate (FPR) is defined in Equation (11) as the ratio of false-positive predictions (false matches) to all negative pairs. Minimizing FPR is essential for reliable verification, particularly in security-sensitive applications.
The ROC curve is obtained by plotting
TPR against
FPR as the threshold τ varies. The Area under the Curve (
AUC) is computed as shown in Equation (12).
The ROC curve is defined in Equation (12) as the path traced by TPR and FPR across different decision thresholds. The ROC curve indicates the model’s overall ability to discriminate between classes, regardless of the threshold used.
The Equal Error Rate is the operating point where the False Acceptance Rate (FAR) equals the False Rejection Rate (FRR), shown in Equations (13) and (14).
The area under the ROC Curve (
AUC) is computed using Equation (13), which summarizes overall recognition performance across all thresholds. In general, the larger the
AUC, the better the model’s ability to distinguish between honest and impostor pairs with high confidence.
with,
Equation (14) defines the Equal Error Rate (EER), the point at which the false acceptance and false rejection rates are equal. This single-point measure is commonly used to express the trade-off between security and usability in verification systems.
Verification accuracy is often reported at strict operating points such as
where,
The overall verification accuracy is calculated as the sum of pairs that are both correctly accepted and correctly rejected (Equation (15)). It provides a direct estimate of the total classification accuracy for face verification tasks.
For identification tasks, the
CMC curve measures the probability that the correct identity appears among the top-
k ranked candidates, as shown in Equation (16).
where,
N: number of queries,
: ground-truth identity,
: ranked predictions.
The True Negative Rate (TNR), which equals specificity, is given by Equation (16). This metric represents the percentage of non-matching pairs that are correctly rejected. A high TNR, together with TPR, ensures a very low false-match probability even when lighting and population conditions change.
We additionally report computational efficiency in terms of average inference latency and model size, as shown in Equation (17).
The overall efficiency of the proposed pipeline is evaluated using Equation (17), which quantifies the number of test images processed per second. This metric reflects the performance of the entire system and, hence, determines whether the system can be applied in real-time or near-real-time situations. In addition to ROC visualization, we report Area Under the Curve (AUC) values with 95% confidence intervals.
4.4. Quantitative Results and Statistical Robustness
Table 6 highlights the gains from adding our modules one by one (same backbone/loss)—rows correspond to baseline, then + photometric augmentation, then + neural relighting, then + U-Net mask (all modules).
Table 6 demonstrates the impact of stronger backbones and losses when all modules are used together. Each experiment was conducted five times with different random seeds. For each configuration, we report the mean and standard deviation to indicate stability, with standard deviations below 0.3% across all benchmarks. To compare two configurations (e.g., baseline versus proposed), an unpaired two-sample
t-test was used; the proposed method achieved statistically significant improvements (
p < 0.05) on CFP-FP and the custom illumination dataset. No paired
t-test was conducted across different model architectures because the runs were not paired by the same random seed.
The small discrepancies (<0.3%) indicate very high consistency. Moreover, performance increased relative to ArcFace across all datasets (p < 0.05).
To demonstrate the effectiveness of the proposed model, we compare it against baseline systems commonly used in face recognition. These baselines depict the standard method of face detection → embedding extraction → metric learning without the extra modules that have been introduced in this paper.
Baseline-1 (ArcFace with ResNet-100) is the highest-performing reference system, using RetinaFace for detection, ResNet-100 for embedding extraction, and ArcFace loss for training. This standard baseline is the state-of-the-art configuration used by many recent studies.
Baseline-2 (AdaFace with ResNet-100) is a better baseline that uses AdaFace loss, which varies the margin according to the image quality. This setting is particularly effective for low-quality images but does not employ segmentation or relighting.
Baseline-3 (MagFace with ResNet-100) is a quality-aware baseline that uses embedding magnitudes to show the face image quality, and hence improves verification performance during shifts in distribution. The baseline also lacks skin segmentation and illumination modules, unlike other baselines.
Baseline-4 (ViT-Base with ArcFace) is a transformer-based baseline that employs ViT-Base as the embedding backbone and ArcFace for supervision. Although it benefits from long-range attention, it lacks illumination handling and skin segmentation.
Compared with the proposed model, it stands apart from the baselines. This is because it includes three additional modules that specifically address common failure modes in unconstrained recognition.
Lighting normalization (photometric augmentation + neural relighting) makes the model less sensitive to lighting changes.
Skin-aware segmentation helps the model focus on the face by removing the background and other non-facial areas.
Flexible backbone networks (ResNet-100/ViT-Base) paired with quality-aware losses (MagFace, AdaFace) lead to the highest discriminability and generalization.
The experimental results clearly show that our proposed model consistently outperforms the baseline models across all datasets, including LFW, CFP-FP, AgeDB-30, and our custom illumination dataset, particularly at low false-positive rate (FPR) operating points. These improvements suggest that synergy among detection, preprocessing, and embedding quality is more effective than simply enhancing the loss function.
All experiments were repeated five times (
n = 5). Confidence intervals were computed as shown in Equation (18).
where,
n = 5 independent experimental runs.
The results in
Table 7 demonstrate that the proposed method consistently achieves the highest AUC values across all evaluated datasets. The performance improvement is particularly noticeable on CFP-FP and the custom illumination dataset, indicating enhanced robustness under pose and lighting variations. Moreover, the narrow confidence intervals suggest stable convergence and low variance across independent runs. Compared with baseline methods, the proposed framework exhibits statistically consistent gains while maintaining high performance in low false-positive operating regions, further supporting its robustness and reliability. Improvements on saturated benchmarks like LFW are minimal, but more significant gains are seen under difficult lighting conditions.
4.5. Component Contribution Analysis
To thoroughly investigate how each module contributes to the overall performance of the proposed model, we conducted controlled experiments in which components were selectively enabled or disabled. This paper demonstrates that the system’s performance is significantly affected by design choices.
4.5.1. Illumination Handling
Three settings were compared: (1) no illumination normalization, (2) training with photometric augmentation, and (3) augmentation combined with neural relighting. Results indicate that photometric augmentation plays a major role in mitigating performance drops due to uneven illumination, whereas neural relighting further increases the accuracy of datasets with challenging illumination conditions.
Skin Segmentation
Using U-Net segmentation to create skin-aware masks improved recognition accuracy by reducing the effects of background noise and hair regions. Without segmentation, embeddings were less compact, leading to higher error rates in busy backgrounds.
Backbone Architecture
ResNet-100 provides a solid low-latency baseline, whereas Vision Transformer (ViT-Base) achieves slightly higher accuracy at the cost of greater computational resources. This trade-off allows the model to be tailored for either real-time or high-accuracy applications.
Loss Functions
ArcFace remains a strong baseline for margin-based losses; however, AdaFace and MagFace further incorporate quality-aware margins. MagFace, in particular, shows greater stability in low-quality and profile-face scenarios, making it the most reliable loss for unconstrained recognition.
4.5.2. Detection and Alignment
In this section, RetinaFace and MTCNN are compared. Owing to its more accurate landmark localization and greater robustness to occlusion, RetinaFace achieved consistently higher recognition accuracy, making it suitable as a detection front end. The results are illustrated in
Figure 4.
Evaluation of each component indicates that the modules—illumination normalization, skin-aware segmentation, backbone selection, and quality-aware margin learning—substantially contribute to the system’s robustness and precision.
Photometric augmentation + relighting improved illumination robustness by up to 6%,
U-Net segmentation improved Rank-1 identification by 1–2%,
ViT + MagFace yielded the best results, but with higher latency.
4.5.3. Comparison with Alternative Attention Mechanisms
To determine whether the proposed mask-guided modulation is merely a form of feature scaling, we compared it with three alternative designs that use the same ResNet-100 backbone and ArcFace loss on the custom illumination dataset.
No masking: baseline,
Channel attention (SENet): global average pooling → two FC layers + sigmoid → channel-wise scaling,
Spatial attention (CBAM): average + max pooling → 7 × 7 conv + sigmoid + spatial scaling,
Proposed (U-Net mask modulation).
As demonstrated in
Table 8, the proposed method surpasses both attention mechanisms, suggesting that the explicit skin prior learned by U-Net offers additional information that data-driven attention alone does not capture. This comparison is not exhaustive (e.g., transformer-based attention is not tested), but it shows that our explicit skin mask outperforms SENet and CBAM on the custom illumination dataset.
4.5.4. Alternative Segmentation Designs
We compared U-Net against: (i) no segmentation, (ii) threshold-based skin detection in YCbCr space, and (iii) a simple CNN segmenter (3 conv layers). Results on Rank-1 identification are shown in
Table 9.
4.5.5. Alternative Backbone and Loss Combinations
We evaluated all pairings of ResNet-50, ResNet-100, ViT-Small, and ViT-Base with ArcFace, AdaFace, and MagFace. The combination of ViT-Base with MagFace achieved the highest accuracy, but ResNet-100 with MagFace provided a better balance of speed and accuracy (18 ms vs.to 25 ms).
5. Results and Analysis
5.1. Verification and Identification Performance
The proposed system achieves competitive state-of-the-art accuracy across all benchmarks. We achieved 99.8% accuracy on LFW, closely matching ArcFace performance, and on CFP-FP we observed more than a 2% improvement in TPR at FPR = 1 × 10−4, demonstrating robustness to pose variation. Finally, on AgeDB-30, quality-aware losses provided a 1.5% gain over ArcFace alone.
Although LFW is close to saturation, the importance of improvements is more evident at strict operating points. For instance, on CFP-FP at FPR = 1 × 10−5, the proposed method reaches a 92.1% TPR compared to ArcFace’s 88.3%—a 3.8% absolute increase. This difference is significant for high-security applications like border control.
5.2. Effect of Illumination Handling
Photometric augmentation substantially reduced performance degradation under uneven lighting. Neural relighting increased verification accuracy by 4–6% on the custom illumination dataset, further demonstrating the effectiveness of neural photometric feedback in extreme lighting conditions.
5.3. Effect of Skin-Aware Embedding
The introduction of the U-Net mask produced a 1–2% improvement in Rank-1 identification across all datasets. Qualitative visualizations showed that the embeddings became denser and less sensitive to background clutter.
5.4. Ablation Study
The Vision Transformer backbone marginally outperformed ResNet-100 on AgeDB-30, but ResNet was faster at inference. Alternative attention mechanisms, such as spatial and channel attention, may offer adaptive feature weighting. However, the proposed method explicitly incorporates spatial priors via segmentation masks. A thorough empirical comparison is left for future research. The current ablation examines the step-by-step addition of components.
Unlike attention-based models that adaptively learn which features are most important, the proposed approach uses predetermined transformations derived from specific image cues. A thorough comparison with attention mechanisms will be addressed in future studies.
5.5. Hardware Configuration
The model (detection, relighting, and segmentation) was estimated to run at ~25 ms per image on an NVIDIA RTX GPU, making it suitable for real-time applications. Our experiments show that each proposed component (RetinaFace detection, relighting, U-Net segmentation, and margin-based embedding learning) significantly improves robustness under challenging conditions. Compared with conventional models based on DCT, SVD, and HMM, this redesigned model offers improved accuracy and reduced demographic bias (by avoiding explicit demographic priors) and greater potential for real-world scalability in face recognition tasks, as shown in
Table 10.
Table 11 highlights differences in verification and identification performance across benchmark datasets. Classical methods based on DCT, SVD, and HMM exhibit limited generalization, with LFW accuracy below 90% and an EER above 8%, whereas deep learning baselines (e.g., FaceNet) show significant improvement but still do not perform well under challenging variations. Companies that publish revised margin-based softmax losses (e.g., ArcFace) report improvements, achieving a 99.7% verification rate on LFW and reducing the EER to 1.1%. Further improvements have been reported for AdaFace and MagFace, which utilize quality-aware learning to improve upon ArcFace, achieving over 96% on AgeDB-30 and reducing the EER. Notably, MagFace with a Vision Transformer backbone shows the best overall results, reaching 99.8% accuracy on LFW, 95.0%
TPR at
FPR = 1× 10
−4 on CFP-FP, and an EER of 0.7%, which validates the model proposed in this paper.
Verification performance on the CFP-FP dataset is shown in
Table 12 at the strict operating threshold of
. This is a particularly difficult task due to the large degree of pose variation (frontal and profile views). The results show that the proposed model achieves the highest
TPR among the baselines. Although ArcFace is a strong baseline, it performs poorly on highly extreme profile views. AdaFace improves robustness by accounting for image quality, whereas MagFace improves dissimilarity by providing quality-aware margins. Again, while the proposed model outperforms all baselines, its performance is further enhanced by illumination normalization and skin segmentation, in addition to MagFace loss. This shows that the benefits of skin segmentation and illumination normalization, together with quality-aware learning, extend beyond the proposed embedding loss.
Table 12 evaluates the speed of various backbone–loss combinations. ResNet-100 with ArcFace achieves 18 ms per image and a model size of 104 MB, making it suitable for real-world applications. MagFace slightly increases computational costs but offers greater resilience to low-quality images. The Vision Transformer backbone achieves the highest classification rate but requires 25 ms per image and consumes 120 MB of memory. The trade-offs in speed and classification accuracy of the ResNet-based combinations are better suited to real-time implementations with more stringent speed requirements. ViT-based combinations may be preferable for offline implementations or applications where speed is less important and higher accuracy is preferred.
The ROC curves in
Figure 5 compare the verification performance of ArcFace, AdaFace, and MagFace on the test datasets. As expected, all three deep metric-learning approaches show strong separability between positives and negatives, with the largest performance differences at lower false-positive rates (
FPRs). ArcFace provides a strong and consistent baseline, yielding a large increase in
TPR that tapers at more constrained thresholds. AdaFace outperforms ArcFace by leveraging quality-aware margins, resulting in a higher
TPR at
FPR = 1 × 10
−4 and suggesting greater resilience to fluctuations in pose and image quality. MagFace maintains higher performance than both ArcFace and AdaFace across the entire ROC curve and consistently achieves the highest area under the curve (AUC) values.
For decision point’s relevant to operational deployment—specifically, very low FPR thresholds (1 × 10−4 to 1 × 10−5)—MagFace still achieves the greatest stability and reliability for verification, significantly increasing TPR while reducing false acceptances. This effect is particularly pronounced in datasets with greater intra-class variability (e.g., CFP-FP), where MagFace exhibits broader tolerance to variations in illumination and profile views. Overall, the ROC evaluation indicates that quality-aware margin learning, in conjunction with distance-insensitive illumination and segmentation proposals, reliably yields more discriminative and stable embeddings under unconstrained conditions.
As presented in
Figure 6, the DET plots further emphasize the tradeoff between false positives and false negatives. ArcFace exhibits a larger error region, whereas AdaFace reduces the False Negative Rate (
FNR) at reasonable False Positive Rates (
FPRs). MagFace has the lowest error envelope confirming its effectiveness in balancing sensitivity to specificity. Overall, the Equal Error Rate (EER) is much lower for MagFace than with other methods demonstrating its stability for real-life applications.
As illustrated in
Figure 7, bar plots of
TPR at
FPR thresholds of
,
, and
reveal sharp differences in strict security settings. ArcFace performance declines sharply as thresholds tighten, whereas AdaFace remains relatively stable in
TPR performance. MagFace outperforms both approaches, maintaining strong verification rates even at an
FPR of 1 × 10
−5, which is particularly relevant in high-security applications such as border control and financial authentication.
Figure 8 illustrates the Cumulative Match Characteristic (CMC) curves, which assess how well identification performs as the rank order increases. While ArcFace shows impressive Rank-1 accuracy, it falls short compared to AdaFace and MagFace at higher ranks. AdaFace performs well in the mid-range, but MagFace leads overall, achieving the best Rank-1 and Rank-5 rates. This suggests that the model using MagFace achieves the highest identification accuracy in both closed- and open-set conditions.
The efficiency analysis in
Figure 9 highlights the trade-off between accuracy and inference time across various backbone-loss configurations. ResNet-100 paired with Arc-Face achieves a runtime of approximately 18 ms per image and has a compact model size, making it well-suited for real-time applications. By contrast, ResNet-100 with MagFace has a slightly higher latency of 20 ms but achieves higher accuracy. Meanwhile, the Vision Transformer with MagFace achieves the highest accuracy of approximately 99.9%, though it requires a longer inference time of approximately 25 ms and consumes more memory. This comparison shows that the proposed model can be readily adjusted to prioritize either real-time performance or high accuracy, depending on deployment requirements.
5.6. Qualitative Visualization and Analysis
To demonstrate the impact of illumination normalization and skin-sensitive segmentation, the images in
Figure 10 are exemplary samples from our custom illumination dataset.
As shown in
Table 13, a visual inspection of the embedding vectors indicates that the relighting module provides consistent illumination across facial surfaces, and the segmentation mask effectively excludes unwanted background information. Moreover, the final normalized embedding is primarily focused on consistent skin-texture representations rather than on lighting changes and shadows, supporting the quantitative improvements discussed earlier. Notably, these qualitative examples indicate that illumination normalization and skin-aware attention work synergistically to improve embedding stability under challenging conditions.
In
Table 13, we compare the proposed model with various methods. Ref. [
29] introduced a legacy method, which utilized DCT-II normalization, SVD for feature extraction, and KNN/HMM classifiers. While this approach performs well in controlled settings, it degrades under changes in lighting and pose, resulting in error rates exceeding 8% on more challenging datasets. Then there is FaceNet [
30], a deep metric learning technique that relies on triplet loss and was a game changer for end-to-end embedding learning. Although FaceNet shows improved robustness compared to traditional methods, it does not specifically address issues like lighting and background noise, making it vulnerable to low-quality inputs. Moving on to CosFace/SphereFace [
31], these angular-margin methods enhance inter-class separability and generally outperform FaceNet. However, they fall short in terms of quality-aware mechanisms or preprocessing steps, such as relighting or segmentation. Lastly, we have ArcFace [
32], which is often seen as a solid baseline. It introduces an additive angular margin loss to enhance discriminability, but, like the others, it does not explicitly account for image quality or include additional features such as skin-aware masking, which can limit its effectiveness in challenging environments.
The proposed model integrates advances in face recognition into a seamless end-to-end system that addresses the shortcomings of previous methods. First, it employs RetinaFace for face detection and landmark-based alignment, thereby maintaining accuracy across varying poses and occlusions. To handle lighting variations, it employs a two-pronged approach: photometric augmentation during training and neural relighting during inference, yielding face images that are insensitive to lighting changes. To minimize background noise and reduce demographic bias, a U-Net-based segmentation model generates skin-aware masks that facilitate feature extraction. Next, identity embeddings are generated using either a ResNet or a Vision Transformer backbone and fine-tuned with margin-based softmax losses such as ArcFace, AdaFace, and MagFace. These techniques enhance class separation while keeping similar classes close together and ensure quality-aware calibration. Finally, recognition is achieved through cosine similarity on normalized embeddings. This innovative design replaces traditional handcrafted models (such as DCT, SVD, and HMM) with a scalable, robust solution that delivers high accuracy even in challenging conditions.
5.7. Runtime and Efficiency Analysis
To assess the deployment’s feasibility, we measured the average inference latency for each module on an NVIDIA RTX 3090 GPU. We computed the runtime for each module using 1000 test images (batch size of 1). A summary of the results is provided in
Table 14.
The findings show that relighting and segmentation together add only about 10 ms of overhead while significantly improving accuracy. The entire pipeline runs at roughly 25 ms per image, making it suitable for real-time applications on high-end GPUs and nearly real-time on modern CPUs. The ResNet backbone is more efficient, whereas the ViT model achieves the highest accuracy, albeit with a slight increase in latency.
6. Discussion
Although photometric augmentation is a common method, combining it with inference-time relighting offers a complementary training and inference approach that enhances robustness to changes in illumination.
The experimental results demonstrate that integrating illumination normalization, skin-aware segmentation, and quality-aware metric learning improves recognition robustness under challenging visual conditions. In particular, the dual illumination-handling strategy—photometric augmentation during training combined with neural relighting during inference—substantially mitigates performance degradation caused by uneven lighting. The observed improvements under low-light and side-lit conditions indicate that separating identity features from illumination variations enhances embedding stability.
The proposed U-Net-based skin segmentation module reduces background interference by suppressing non-facial regions such as hair, clothing, and cluttered environments. Rather than relying on threshold-based skin-color rules or explicit race-dependent modeling, the segmentation module operates through learned spatial attention. This architectural choice avoids the use of demographic priors that are present in some classical color-space approaches. However, it should be noted that this paper does not include quantitative subgroup-level fairness evaluation (e.g., FAR-gap or TPR-gap across ethnicity or gender).
The comparison between ResNet-100 and Vision Transformer backbones highlights a trade-off between efficiency and representational capacity. ResNet-100 provides lower latency and smaller memory overhead, making it more suitable for real-world deployment. The Vision Transformer backbone achieves slightly higher accuracy, particularly under pose and age variations, at the cost of increased computational complexity. This flexibility allows the framework to be adapted depending on deployment requirements.
Statistical evaluation across repeated runs shows low standard deviation (<0.3%), confirming the stability of the proposed architecture. Although improvements on LFW are marginal due to dataset saturation, more noticeable gains are observed on CFP-FP and the custom illumination dataset. This suggests that the main contribution lies in robustness under difficult lighting and pose conditions rather than incremental improvement on already saturated benchmarks.
While the segmentation-guided masking offers explicit spatial filtering, it is unlike attention-based methods such as spatial, channel, and transformer-based attention, which learn feature importance in a dynamic manner. Instead, this method uses fixed transformations guided by segmentation and illumination normalization. Although this enhances interpretability and stability, it might be less adaptable than fully learnable attention models.
Despite these benefits, several limitations persist. The relighting and segmentation modules need well-aligned, clean training data. Their performance may decline with severe occlusions, misalignments, or heavy noise. Although explicit demographic modeling is not employed, simply removing demographic inputs does not ensure bias elimination, as implicit bias can still stem from unbalanced training data. Performance may also suffer under heavy occlusion or poor alignment, since high-quality data is essential for effective segmentation and relighting.
Finally, while the proposed pipeline achieves real-time inference on high-end GPUs (~25 ms per image), it remains computationally demanding for low-power embedded or edge devices. Lightweight optimization and model compression techniques are necessary for broader deployment scenarios.
7. Limitations
Despite the promising results, several limitations of the proposed framework should be acknowledged.
First, the proposed approach primarily relies on combining existing components, including face detection, segmentation, and deep feature extraction modules. Although the main contribution lies in how these components are jointly designed and interact within a unified system, the framework does not present a completely new, standalone algorithm. Future research could investigate new architectures or theoretically grounded models to enhance the methodological contribution.
Second, the segmentation-guided masking strategy is assessed against a baseline without masking. However, alternative attention methods—such as spatial attention, channel attention, or transformer-based attention—have not been tested empirically. Consequently, it is unclear how the proposed masking compares with other adaptive feature selection approaches.
Third, while photometric augmentation is commonly used to enhance robustness to illumination changes, this work focuses on combining it with inference-time relighting. However, a thorough analysis of how these two techniques interact has not been included and is a subject for future research.
Fourth, although the method does not include explicit demographic modeling, we have not conducted any quantitative fairness evaluations at the subgroup level (such as TPR gaps across ethnicity or gender) because the datasets do not contain demographic annotations. Consequently, we do not claim that the method is fair or unbiased. A formal fairness assessment will be addressed in future work.
Fifth, the performance enhancements observed on standard benchmarks like LFW are modest, as expected given the saturation of these datasets. While more significant improvements appear under difficult lighting conditions, their practical impact in real-world applications still needs validation.
Sixth, the ablation study primarily evaluates the incremental effect of each component sequentially. However, it does not investigate alternative architectural configurations or compare different strategies, limiting the comprehensiveness of the design space exploration.
Seventh, several components of the framework, including segmentation and relighting modules, rely on accurately aligned, high-quality input data. Performance may decrease in scenarios involving severe occlusion, inaccurate face alignment, or low-resolution inputs. Importantly, mistakes in segmentation masks can negatively impact the following feature extraction process.
Eighth, the full pipeline’s computational complexity remains relatively high due to multiple processing steps. While real-time performance is possible on high-end GPUs, deploying it on resource-constrained or edge devices remains difficult without additional optimization. The existing modulation mechanism is less complex than fully learnable attention-based methods and may have limited flexibility.
The current formulation does not incorporate adaptive attention mechanisms and may be interpreted as a simplified feature modulation strategy.
Finally, this paper does not provide a detailed analysis of failure cases. In practical use, the method may face difficulties in situations with severe lighting imbalance, significant occlusion, motion blur, or faulty preprocessing outputs. Exploring these failure modes systematically and developing more robust mitigation strategies are crucial areas for future research.
8. Conclusions
This paper presents a unified deep learning framework for illumination-robust face recognition. By integrating RetinaFace detection, photometric training augmentation, neural relighting, U-Net segmentation, and quality-aware metric learning, the system achieves competitive state-of-the-art performance across multiple benchmarks. Future extensions include video-based recognition, multimodal fusion, and bias-mitigation strategies.
Experimental results on the standard benchmarks—LFW, CFP-FP, AgeDB-30, and a custom illumination dataset—show that the model matches the competitive state of the art, with 99.8% accuracy on LFW and a TPR of 95.0% at FPR = 1 × 10−4 on CFP-FP. The integration of neural relighting and skin-aware segmentation, consistently and significantly improves performance: relighting improves illumination uniformity by 1.35 ± 0.12, and segmentation boosts verification accuracy by up to 2%.
In addition to accuracy, the proposed method offers practical versatility. The ResNet backbone strikes a strong balance between high accuracy and fast inference, whereas the Vision Transformer backbone delivers only slightly higher accuracy in offline applications or settings with ample computational resources. The modular structure also allows independent upgrades to the relighting, segmentation, or embedding components, ensuring compatibility with future work or deployment environments. Future work will apply this framework to video and multimodal face recognition, incorporating temporal coherence, depth, and thermal cues to further increase robustness under uncontrolled conditions. Another objective is to reduce dataset bias and improve generalization across demographic and environmental domains through self-supervised pre-training and demographic-prior-free loss regularization.