Skin Classification for Face Recognition Based on Deep Learning with U-Net and ResNet

Karamizadeh, Sasan; Shojae Chaeikar, Saman

doi:10.3390/electronics15091950

Open AccessArticle

Skin Classification for Face Recognition Based on Deep Learning with U-Net and ResNet

by

Sasan Karamizadeh

^1,*

and

Saman Shojae Chaeikar

^2,*

¹

Ershad Damavand Institute of Higher Education, Tehran 1416834311, Iran

²

Department of Cybersecurity, Sydney International School of Technology and Commerce, Sydney, NSW 2000, Australia

^*

Authors to whom correspondence should be addressed.

Electronics 2026, 15(9), 1950; https://doi.org/10.3390/electronics15091950

Submission received: 9 February 2026 / Revised: 29 April 2026 / Accepted: 30 April 2026 / Published: 4 May 2026

(This article belongs to the Special Issue Advanced Face Recognition Technology in Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

Face recognition under uncontrolled lighting remains challenging due to variations in brightness, background noise, and low-quality features. This paper presents a unified deep learning model that integrates illumination normalization, skin-aware spatial modulation, and quality-based margin learning within a single inference process. Unlike earlier methods that treat relighting or segmentation as preprocessing, this approach directly integrates mask-guided feature modulation into embedding learning. The system comprises RetinaFace detection, photometric augmentation during training, lightweight neural relighting at inference, U-Net-based skin segmentation, and identity embeddings trained with ArcFace, AdaFace, or MagFace losses, with angular margins adapted to feature quality. Experiments on Labeled Faces in the Wild (LFW), Celebrities in Frontal-Profile (CFP-FP), Age Database 30 (AgeDB-30), and a custom illumination dataset demonstrate steady enhancements in difficult lighting conditions. The model reaches a competitive 99.8% accuracy on LFW and shows notable improvements on pose-hard CFP-FP and the custom dataset, such as a +2.6% increase in TPR at 1

{\times 10}^{- 4}

FPR. The key innovations include: (i) mask-guided embedding modulation that embeds segmentation into feature learning, (ii) a dual strategy combining training-time photometric data augmentation with inference-time neural relighting, and (iii) joint spatial–quality margin learning via AdaFace/MagFace. Finally, results confirm consistent gains under challenging illumination and pose variations.

Keywords:

face recognition; illumination robustness; RetinaFace; U-Net segmentation; ArcFace; AdaFace; MagFace; Vision Transformer; skin-aware embedding

1. Introduction

Facial recognition is important for security, user authentication, and human–computer interaction. Under uncontrolled conditions, identifying a person from their face is challenging due to variations in lighting during photography (e.g., angle and brightness). In addition, illumination is a common source of poor performance, as the direction and intensity of light can significantly affect the appearance of a person’s face and reduce the separability of facial features from those of others in facial-recognition algorithms.

In addition to face recognition, recent advances in related visual recognition areas focus on methods that enhance illumination and domain robustness. For instance, spatial-channel collaborative multi-scale graph interaction deep transfer learning has been successfully used for unsupervised fault detection in rotating machinery, showing strong ability to transfer features across domains [1]. Similarly, adaptive fused domain-cycling variational Generative Adversarial Networks (GANs) have addressed domain shift challenges in aerial image segmentation [2]. Additionally, multimodal deep learning techniques have been developed for arc detection in railway systems, providing robust results even with limited training data [3].

The rapid expansion of digital media and surveillance technologies has made face recognition an integral component of contemporary security, authentication, and social interaction systems. Nonetheless, achieving reliable recognition in unconstrained settings remains a formidable challenge due to variations in illumination, pose, expression, and background clutter. Among these factors, illumination changes remain among the most consistent causes of recognition failure, as variations in light direction, intensity, and color temperature can considerably alter facial appearance, reducing the discriminative power of learned features [4].

Although deep learning-based systems, such as FaceNet and ArcFace, have achieved substantially higher recognition accuracy than conventional handcrafted techniques (e.g., Discrete Cosine Transform (DCT), Singular Value Decomposition (SVD), Hidden Markov Model (HMM)), these methods lack explicit mechanisms to normalize illumination or improve robustness under illumination variations [5]. This results in unstable face embeddings and poor performance under non-uniform lighting or demographic imbalance. Previous efforts to address these issues are largely hand-engineered, using preprocessing, color-space thresholding, and classification based on demographic descriptors. These approaches entail no novel normalization and introduce generalization bias [6].

To address these shortcomings, we propose a unified deep learning framework that integrates illumination normalization, skin-aware segmentation, and margin-based metric learning into a single end-to-end pipeline [7]. The framework includes RetinaFace for robust face detection and alignment, a U-Net-based skin segmentation module that extracts facial regions, and ResNet and Vision Transformer (ViT) backbones enhanced with ArcFace, AdaFace, and MagFace losses to enable discriminative, quality-aware embedding learning. Complementing these components, a dual illumination-handling technique that combines photometric augmentation during training with neural relighting during inference ensures consistent feature extraction across variable illumination conditions [8].

Although deep learning models such as FaceNet, CosFace, and ArcFace have significantly improved accuracy over traditional handcrafted features, they still lack effective methods for handling illumination changes or background clutter. Additionally, many existing segmentation and normalization methods rely on demographic assumptions, which can inadvertently introduce bias into recognition. To address these issues, we introduce an end-to-end face recognition system that integrates illumination normalization, background bias reduction, skin-color-aware segmentation, and quality-based metric learning into a single, efficient framework. This paper presents three main advancements beyond simple component integration:

(1): Mask-guided embedding modulation (Equation (2)), which employs the predicted skin mask to directly reweight convolutional features or ViT tokens during training.
(2): Addressing dual illumination by combining photometric augmentation during training with neural relighting at inference—an approach not previously used in face recognition.
(3): Joint spatial–quality margin learning that combines mask-guided features with quality-adaptive losses such as AdaFace and MagFace. These innovations are evaluated on challenging datasets with pose and illumination variations, as well as on saturated benchmarks.

In addition to conventional recognition frameworks, recent studies have explored domain adaptation and adversarial learning strategies to improve robustness under distribution shift. Conditional adversarial transfer learning models aim to align feature distributions across source and target domains to mitigate domain bias. Multi-domain GANs have also been proposed to synthesize cross-domain variations and enhance generalization of representations. Furthermore, reinforcement learning-based adaptive systems dynamically adjust model parameters to changing environmental conditions. While these approaches primarily address domain discrepancy, they do not explicitly integrate illumination-aware spatial modulation within the face embedding learning process, which is the focus of this paper.

The contribution lies not in any individual component but in the end-to-end integration of illumination normalization, segmentation-guided masking, and quality-aware metric learning within a single differentiable pipeline. Unlike previous approaches that treat relighting and segmentation as preprocessing steps, our framework jointly optimizes these modules during training. The skin mask directly influences feature representations (Equation (2)), and the dual illumination strategy (Section 3.2) establishes a unique synergy between training and inference. The inference pipeline is fully differentiable, but segmentation and relighting are pre-trained independently and then integrated.

Unlike prior papers that address illumination, segmentation, or margin learning separately, the proposed framework unifies these components within a differentiable inference architecture with multi-stage training. The remainder of the paper is organized as follows: Section 2 reviews related works; Section 3 presents the proposed design; Section 4 presents experimental results and analysis; and Section 5 summarizes our findings and outlines future work.

2. Related Words

2.1. Illumination Handling in Face Recognition

Although traditional methods for normalizing illumination (e.g., DCT filtering, SVD decomposition, Independent Component Analysis (ICA)-based methods) offer some robustness in controlled settings, they generalize poorly to real-world lighting conditions. By contrast, recent neural network-based approaches that relight images or apply photometric augmentation are highly effective at reducing illumination variations, yet they have been used primarily as standalone preprocessing steps rather than as integrated components of the method [9,10,11].

Recently, research has focused on learning-based illumination normalization. Neural relighting models [12] and photometric data augmentation approaches show promise in generating illumination-invariant embeddings. However, most methods apply normalization as a preprocessing step rather than integrating it directly into embedding or recognition backbones. As a result, although the embeddings are generally less affected by illumination variations, some sensitivity remains, underscoring the need to incorporate end-to-end illumination handling within the face recognition model.

2.2. Skin Segmentation and Demographic-Prior-Free Face Analysis

Color-space thresholding methods–Luminance–Chrominance Color Space (YCbCr) and Hue, Saturation, Value (HSV)–have limited robustness across skin tones and lighting conditions. Deep learning-based segmentation methods, including stacked autoencoders and U-Net models, improve segmentation reliability but often embed demographic assumptions. A demographic-prior-free segmentation approach that avoids race classification is needed to reduce bias [13,14]. However, this approach often produced poor segmentations, particularly across different skin tones and lighting conditions. Even model-based classifiers, such as Gaussian or Bayesian methods [15,16,17], were unable to capture the full range of skin chrominance.

Deep learning-based methods have improved performance by learning adaptive features from large skin-pixel datasets. For example, ref. [14] implemented stacked autoencoders to produce an illumination-robust segmentation method. In more recent studies, researchers built on the work of [4] by using U-Net or Transformer-based architectures [18,19,20], thereby improving generalizability under more complex lighting conditions. Nevertheless, these methods, along with others, often rely on race classification or demographic priors, resulting in bias and accessibility issues.

In this paper, we introduce a skin segmentation module based on a reduced-potential background-induced bias U-Net. It isolates facial skin regions without relying on demographic assumptions. This approach enhances focus on facial features and produces skin representations that promote stability in matching across diverse populations.

2.3. Deep Metric Learning and Quality-Aware Embedding

Margin-based softmax losses, such as SphereFace [21], CosFace [22], and ArcFace [23], improve intra-class compactness and inter-class separation. Margin-based softmax losses (SphereFace, CosFace, ArcFace) improve inter-class separation; quality-aware variants (AdaFace, MagFace) adapt margins to image quality [23]. However, these methods do not explicitly handle illumination or background interference.

Recent research, including AdaFace [23], MagFace [16], and approaches based on ViT backbones [24], have proposed quality-aware margin adaptation, enabling embeddings to adapt to varying image reliability. Despite improvements in propagating quality, these methods do not formally account for the effects of illumination and background clutter. Research has also demonstrated promising generalization of embedding-related presentation, but it has not yet fully integrated illumination normalization or other potential sources of bias. Table 1 compares the proposed framework with representative prior methods.

Table 1 summarizes the Methodological differences that exist between the proposed SEM framework and key prior approaches that focus on different aspects. Margin-based methods such as ArcFace and AdaFace enhance the discriminability of embeddings but do not specifically address illumination variation or spatial bias. Illumination normalization techniques are typically applied as separate preprocessing steps, without integration with embedding supervision. Conversely, segmentation-based methods reduce background influence but are not directly linked to metric learning.

In contrast, the proposed framework integrates illumination normalization, skin-aware spatial modulation, and quality-adaptive margin learning into a single inference process. Rather than preprocessing alone, the segmentation mask directly shapes feature representations, enabling spatially guided embedding learning. This unified approach distinguishes itself from prior work and enhances robustness under challenging lighting and background conditions.

Proposed methods fall into three categories: illumination normalization techniques, segmentation-based preprocessing, and margin-based metric learning. However, earlier illumination techniques are typically unsupervised, segmentation is rarely combined with metric-learning layers, and margin-based losses often lack spatial-bias suppression. To our knowledge, no existing work simultaneously optimizes illumination normalization, spatial skin masking, and quality-adaptive angular margins within a single recognition framework.

3. Model Architecture

The proposed model comprises five modules: (1) detection and alignment, (2) illumination robustness, (3) skin-aware segmentation, (4) metric learning, and (5) verification. Each module in the overall inference pipeline is fully differentiable. However, the segmentation and relighting modules are trained in a multi-stage manner and then jointly integrated into the recognition framework, enabling the model to learn adaptively from raw data without any handcrafted preprocessing. Unlike conventional pipelines, the proposed model incorporates adaptive feature modulation guided by image-specific cues, enabling condition-aware representation learning. The overall flow is shown in Figure 1.

3.1. Face Detection and Alignment

Reliable localization is the foundation of robust face recognition. The input image is denoted by

x

. We employ RetinaFace, a single-stage detector that jointly predicts bounding boxes and five facial landmarks (eye centers, nose tip, and mouth corners). Landmark-based alignment normalizes pose and scale, ensuring geometrically consistent inputs for subsequent modules. RetinaFace detects facial bounding boxes and five landmarks. The input face image is denoted by

x ϵ R^{H \times W \times 3},

where

H

and

W

represent the image height and width, respectively. Faces are then aligned and normalized to 112 × 112 resolution.

3.2. Photometric Augmentation

Changes in lighting conditions can lead to severe distortion in appearance. To reduce the impact of lighting changes, we implement a dual illumination-handling strategy:

Photometric augmentation (during training): It simulates extreme lighting variations (brightness, contrast, gamma, and color temperature).
Neural relighting (during inference): A lightweight neural relighting model maintains consistent luminance across facial regions.

This trains the embedding model to separate identity features from illumination changes.

The proposed model adjusts feature modulation based on image-dependent cues, enabling flexible reweighting of feature representations rather than uniform scaling. Although photometric augmentation is a common technique, combining it with inference-time relighting offers a complementary approach for training and inference that enhances robustness to illumination changes.

While the formulation might look like basic scaling, the framework actually involves image-dependent transformations via segmentation and relighting modules. These elements apply condition-aware changes to the input before feature extraction, setting this approach apart from simple uniform scaling.

The U-Net segmentation module creates a soft skin-probability mask (x). This mask is resized and multiplied element-wise with intermediate feature maps (x) from the backbone (ResNet or ViT). The modified features F′(x,y)= α × M × F + (1 − α) × F are then passed to the next layers. For ViT backbones, patch-level mask scores t_k are obtained by averaging pixel masks within each patch and are used to scale token embeddings before multi-head self-attention. This method helps reduce background interference while preserving important skin region details.

3.3. Skin Segmentation with U-Net

A U-Net model generates a continuous skin-probability mask. The mask is applied to the feature map or ViT tokens, suppressing non-skin regions. Segmentation is supervised with binary cross-entropy, Dice loss, total variation regularization, and an area-prior constraint.

The U-Net architecture comprises four encoder–decoder levels with channel dimensions of 64, 128, 256, and 512. Each block contains two 3 × 3 convolutional layers, followed by Batch Normalization and ReLU activation. Skip connections link the encoder and decoder stages. The final layer applies a 1 × 1 convolution followed by a sigmoid activation to produce a continuous skin probability mask at 112 × 112 resolution.

3.4. Neural Relighting Network

The relighting model is a lightweight five-layer convolutional network with residual connections. It predicts an illumination-adjusted RGB output using a combination of L1 reconstruction and structural similarity (SSIM) losses. The network contains approximately 1.2 M parameters and runs in real time.

Segmentation stage.

The U-Net produces a soft mask

M (x, y) \in [0, 1]

indicating the likelihood of each pixel being skin.

M (x, y) = σ (W_{f} \times F (x, y) + b)

(1)

where

M (x, y) :

denotes the predicted skin probability at pixel

(x, y)

,

b

: bias term applied before activation,

τ

: temperature parameter,

σ (.)

: the sigmoid function,

F (x, y) :

denotes the feature activation from the decoder layer at position

(x, y)

,

W_{f}

and

b

: are learnable weights and bias.

Dice and binary cross-entropy losses supervise segmentation, while total-variation and area-ratio regularizers promote smooth, realistic masks, as shown in Figure 2.

Mask-guided embedding

The proposed mask-guided embedding mechanism is defined by Equation (2), where the segmentation mask M highlights skin regions while suppressing irrelevant background. The balance factor α controls the mask’s influence on the feature map, allowing the network to focus on more reliable facial regions while retaining contextual cues.

F^{'} (x, y) = α M (x, y) F (x, y) + (1 - α) F (x, y)

(2)

where,

F^{'}

: mask-weighted feature map,

M

: resized mask,

F

: original feature map from backbone,

F (x, y) :

original feature map from the backbone,

M (x, y) :

skin mask resized to match the spatial dimensions of

F

,

α ϵ [0, 1] :

a learnable gating parameter (initialized to 0.7).

The Vision Transformer backbone computes an average mask score for each image component, which is then scaled by the token embedding before the Multi-Head Self-Attention layer. In other words, skin regions have a greater influence on the global representation than background regions, which contribute little to the context.

3.5. Intuitive Effect

By focusing on skin regions, the model learns to ignore confounding backgrounds and occlusions, thereby improving robustness to lighting variations and across skin tones. The segmentation and embedding modules work together: U-Net sharpens spatial attention, while the backbone learns high-level identity features within this attentional context.

3.5.1. Margin-Based Metric Learning

The final embedding vector

z

is

L_{2}

normalized and trained using three different margin-based softmax losses—ArcFace, AdaFace, and MagFace—that enforce angular constraints that compact intra-class clustering while enforcing inter-class separability:

L_{A r c F a c e} = - \frac{1}{N} \sum_{i} l o g \frac{e^{s (c o s (θ_{y i} + m))}}{e^{s (c o s (θ_{y i} + m)} + \sum_{j \neq y_{i}} e^{s c o s θ_{j}}}

(3)

where,

L_{A r c F a c e}

: ArcFace loss,

m

: additive margin,

N

: batch size,

s

: scaling factor,

θ_{y i} :

the angle between the feature and its corresponding class center,

θ_{j} :

the angle between embedding and class

j

.

The ArcFace loss is given in Equation (3) with an additive angular margin M to improve inter-class separation in the embedding space. By penalizing small angular distances between classes, it promotes more discriminative facial representations that are invariant to illumination. AdaFace and MagFace further adapt this idea by introducing a margin for image quality, enabling stable embeddings from low-quality and poorly lit images.

3.5.2. Regularization on the Mask

To prevent trivial masks (all-ones or all-zeros), we introduce weak regularizers as shown in Equations (4)–(6).

Total Variation (TV): encourages smoothness

L_{T V} (M) = \sum_{i, j} \sqrt{{(M_{i + 1, j} - M_{i, j})}^{2} + {(M_{i, j + 1} {- M}_{i, j})}^{2}}

(4)

where,

M_{i, j}

: the predicted skin mask. This term encourages smoothness by penalizing abrupt mask gradients,

L_{T V}

: total variation loss promoting spatial smoothness in mask.

Differences across neighboring pixels penalize abrupt transitions.

Area Prior: constrains expected skin ratio

L_{a r e a} {(\bar{M} - α)}^{2}

(5)

where,

\bar{M} :

mean of the predicted mask (average skin area),

L_{a r e a}

: penalizes deviation from target skin-area ratio α = 0.6,

α

: target skin-area ratio (typically 0.6 for frontal faces).

The total loss is thus:

L_{s e g} = L_{B C E} + L_{D i c e} + λ_{1} L_{T V} + λ_{2} L_{a r e a}

(6)

where,

L_{s e g}

: total segmentation objective,

λ_{1}

= 0.01 and

λ_{2} = 0.005

(selected via grid search),

L_{D i c e}

: the Dice loss,

L_{B C E}

: the binary cross-entropy loss.

The overall segmentation objective in Equation (6) is a weighted sum of binary cross-entropy, Dice, and regularization terms. This approach enables precise pixel-level predictions while promoting smoothness and anatomical plausibility in the resulting skin masks.

3.6. Identity Feature Extractor

We adopt two backbones:

ResNet-100/iResNet for strong convolutional baselines.
Vision Transformer (ViT-Base), which leverages global self-attention.

For the transformer backbone, the input image is partitioned into non-overlapping patches of size

P \times P

. For each patch, we compute a patch-level mask by averaging the pixel masks, as shown in Equation (7).

t_{k} = \frac{1}{|p_{k}|} \sum_{(x, y) \in P_{k}} M_{(X, Y),}

(7)

where

t_{k}

: the mask score for the

k^{t h}

image patch

p_{k},

|p_{k}|

: number of pixels (e.g., 16 × 16 = 256),

p_{k}

: denotes the set of pixel coordinates in the

k^{t h}

patch,

M_{(X, Y),}

: skin probability mask.

Each token embedding

Z_{t}

is then modulated by its mask score, as shown in Equation (8).

E_{k}^{'} = t_{k} \cdot E_{k}

(8)

where,

E_{k}

: original token embedding from the ViT patch projection layer,

E_{k}^{'}

: mask-scaled embedding,

t_{k}

: mask weight determining importance of patch.

The masked tokens

{{\tilde{z}}_{t}}

are then passed through the Multi-Head Self-Attention (MSA) and Feed-Forward Network (FFN) blocks, as shown in Equation (9).

z^{'} = F F N (M S A (E^{'}))

(9)

where,

MSA: denotes Multi-Head Self-Attention,

FFN: is a two-layer feed-forward network,

E^{'}

: is the sequence of mask-scaled token embeddings,

z^{'}

: output representation from encoder block.

This ensures that skin-relevant tokens contribute more strongly to the global representation, while still allowing non-skin tokens to provide contextual cues.

Embeddings are optimized using margin-based softmax losses:

ArcFace: hypersphere angular separation,

AdaFace: adaptive quality-aware margins,

MagFace: embedding quality calibration.

These losses yield compact intra-class clusters and well-separated inter-class boundaries, outperforming standard softmax and handcrafted descriptors.

3.7. Classifier and Scoring

Embeddings are normalized using the L2 norm, and recognition is performed using cosine similarity. Verification thresholds are calibrated on a validation set, and performance is reported using ROC, EER, and TPR@FPR. AdaFace and MagFace evaluations yield consistent scores across image quality levels.

3.8. Summary of Novel Contributions

While the individual components (RetinaFace, U-Net, photometric augmentation, neural relighting, and margin-based losses) are not new, the novelty of this work lies in their joint differentiable integration within a single face recognition pipeline. In other words, the novelty is in the integration, not in any individual component. Specifically:

Mask-guided embedding modulation (Equation (2)): The predicted skin mask is used as a spatially varying, learned attention directly on feature maps and ViT tokens during training, not as a preprocessing step.
Dual illumination strategy (Section 3.2): Combining training-time photometric augmentation with inference-time neural relighting—a complementary training–inference synergy not previously shown in face recognition.
Joint spatial–quality margin learning: The mask-guided features are combined with quality-adaptive losses (AdaFace, MagFace) that adapt angular margins to image quality.

To our knowledge, no prior work simultaneously optimizes illumination normalization, spatial skin masking, and quality-adaptive angular margins in an end-to-end differentiable pipeline. Table 1 highlights these differences against prior methods.

3.9. Training Strategy

The network is trained on the MS1M-ArcFace dataset and then fine-tuned on target datasets. Optimization uses Adam W with a cosine-annealed learning rate (reducing the learning rate from an initial value to zero) and partial-FC sampling to improve efficiency.

At inference time, aligned faces pass through the modules for model detection, relighting, segmentation, and embedding. Verification scores are reported as the cosine similarity between normalized embeddings. Finally, the proposed model in Figure 3 illustrates the architecture and how its components fit together.

U-Net and relighting network are pre-trained separately on skin segmentation and illumination correction tasks, respectively. During face recognition training, their weights are frozen initially and then fine-tuned together with the backbone.

Figure 3 presents a flowchart of the components and processes of the proposed system. The first step is RetinaFace, which uses a ResNet-50 backbone, feature pyramid networks, and context modules, among other techniques, to detect face bounding boxes and landmarks for alignment, ensuring precise localization even under pose variations or occlusions. The illumination module further enhances this by adding photometric augmentation during training and neural relighting at inference, normalizing lighting conditions, and creating illumination-invariant face representations. At this stage, a U-Net segmentation network with an encoder–decoder architecture and skip connections produces a binary skin mask that enhances the desired regions while suppressing background and non-skin areas, thereby refining the input for feature extraction and reducing clutter and bias. The next step is the embedding module, which processes masked images and employs either ResNet-100 or a Vision Transformer (ViT-Base) as the backbone. While ResNet generates hierarchical features via residual bottleneck blocks, ViT employs patch embeddings and self-attention layers to capture global dependencies. Both methods produce compact 512-dimensional embeddings guided by skin-aware attention. The last stage applies margin-based metric learning losses, such as ArcFace, AdaFace, and MagFace, which optimize intra-class compactness and inter-class separation while incorporating quality-awareness to ensure stable embeddings. The final recognition step employs cosine similarity, enabling fast and efficient verification and user identification from large datasets.

Unlike previous methods that treat segmentation, relighting, or margin losses as separate preprocessing or post-processing steps, the proposed model integrates them into a single, differentiable pipeline for joint optimization. The key innovations include: (a) the modulation in Equation (2) employs a spatially varying learned mask instead of a fixed attention map; (b) the dual illumination approach (Section 3.2) creates a synergy between training and inference; (c) to our knowledge, no other work combines mask-guided spatial modulation with quality-adaptive angular margins. Table 1 (included earlier) illustrates these differences.

4. Experiments

4.1. Experimental Setup

All experiments were conducted under standardized verification settings to ensure a fair comparison with prior works.

4.1.1. Training Data and Cleaning Procedure

Data cleaning involved using tools to assess image quality and ensure that no duplicate images or identities were present. After this process, 4.67 million images associated with 85,200 unique individuals remained in the MS1M-ArcFace dataset, and the images were resized to 112 × 112 pixels. To perform this alignment, we used the RetinaFace model.

4.1.2. Pre-Training and Model Initialization

Public ArcFace MS1M pre-trained weights were used to initialize all backbone models (ResNet-100 and ViT-Base) rather than random initialization. The U-Net segmentation model and the neural relighting model were trained separately and then integrated into the main pipeline for evaluation. All backbone models and baseline configurations were initialized using identical MS1M-ArcFace pretrained weights to ensure fair comparison.

4.1.3. Training Hyperparameters

The hyperparameters were selected via grid search using a 5% MS1M validation split.

Optimizer: Stochastic Gradient Descent (SGD), (momentum = 0.9),
Learning Rate: 0.1 with cosine decay,
Weight Decay: 5 ${\times 10}^{- 4}$ ,
Margin (ArcFace): m = 0.5,
Scale Factor: s = 64,
Batch Size: 512,
Epochs: 100,
Loss: ArcFace, AdaFace, or MagFace, depending on the experiment.

4.1.4. Partial-Fully Connected (FC) Training

To address the large number of identities, Partial-FC was used with a sampling rate of 0.3, in accordance with the official Insight Face training protocol.

4.1.5. Use of Relighting and Segmentation During the Training Process

Segmentation- A U-Net was trained independently and used during training and inference as a mask on the Backbone features.
Photometric Augmentation is only applied during the training of the backbone.
Neural Relighting is used only in Inference and is not used in training.

4.1.6. Component Analysis

To quantify contributions of each module, ablation experiments were conducted:

Baseline (ResNet-100 + ArcFace): 98.7% LFW accuracy,
+Photometric Augmentation: +2.1% CFP-FP improvement,
+Neural Relighting: +3.4% improvement under side lighting,
+U-Net Skin Segmentation: +1.6% Rank-1 identification improvement,
+ViT & MagFace: Best overall accuracy, with moderate computational.

As shown in Table 2, the baseline model shows reduced robustness under side-lit and low-light conditions. Adding photometric augmentation during training significantly improves performance, particularly on CFP-FP. The best illumination robustness is achieved when neural relighting is applied at inference, yielding the highest TPR@1 × 10⁻⁴. This confirms that both augmentation and relighting contribute to substantial gains under challenging lighting.

Without segmentation, as noted in Table 3, the model remains sensitive to background noise and non-skin artifacts. Incorporating the U-Net skin mask improves Rank-1 accuracy and enhances consistency across different backgrounds. The best results are achieved when segmentation is applied directly to ViT tokens, indicating that mask-guided token weighting yields more discriminative and stable embeddings and avoids explicit demographic modeling by reducing background bias.

ArcFace provides a strong baseline, but AdaFace improves performance on low-quality or noisy images through its adaptive margin. MagFace achieves the best overall performance across all benchmarks, with the lowest EER and the highest verification accuracy. These findings, as shown in Table 4, indicate that quality-aware margin modeling yields measurable benefits, particularly when combined with the proposed illumination and segmentation modules.

4.2. Datasets

We used the MS1M-ArcFace dataset, a large-scale, refined dataset comprising millions of face images with high inter- and intra-class variability, for training. To assess generalization, we conducted testing on a variety of benchmark datasets. The description of the datasets used is provided below, and their technical specifications are listed in Table 5.

MS1M-ArcFace (Training Dataset). A subset of the MS-Celeb-1M dataset cleaned by [25] and made available to the public. It contains approximately 5.8 million images of around 85,000 distinct identities after noise removal and alignment. This large dataset is commonly used to train margin-based softmax losses, including ArcFace, AdaFace, and MagFace [16].
LFW. The LFW dataset comprises 13,233 face images of 5749 individuals, primarily obtained from the internet and captured in uncontrolled settings. It is widely used as a benchmark for face verification; hence, it is employed to draw conclusions about system performance in unconstrained environments [26].
CFP-FP. The CFP-FP dataset comprises 7000 image pairs from 500 people and is designed to verify frontal and profile views. It is specifically designed to assess the performance of algorithms under extreme pose changes [27].
AgeDB-30. The AgeDB dataset comprises images of celebrities from different age groups over an extended period. AgeDB-30 comprises 12,240 images from 440 participants with a maximum age difference of 30 years; thus, it is a benchmark for age-invariant face recognition [28].

Custom Illumination Dataset. To evaluate robustness under challenging lighting, we constructed a dataset from two sources:

Extracted images: A total of 5000 face images were randomly sampled from LFW, CFP-FP, and AgeDB-30, ensuring no identity overlap with the MS1M-ArcFace training set. The sampling preserved the original pose and expression variations. The extracted set contains 3500 distinct subjects, each with 1–3 images.
Synthetically generated images: Another 5000 images were created by applying controlled illumination transformations to the extracted faces using the following operations (implemented in Open CV):
○
Gamma correction $(γ \in [0.3, 1.8])$ to simulate low-light and overexposure.
○
Contrast adjustment (contrast factor $\in [0.5, 2.0]$ ).
○
Directional lighting masks (side-light from left/right, top-down, bottom-up) using a radial gradient overlay.
○
Mixed lighting—combinations of the above.

The final dataset contains 10,000 images from 3500 subjects, with approximately 30% low-light, 40% side-light, and 30% overexposed/front-light conditions. All images were aligned with RetinaFace and resized to

112 \times 112

pixels. This dataset is used solely to evaluate illumination robustness; it does not replace standard benchmarks.

The dataset was selected to subject the proposed model to rigorous testing across a wide range of diverse and challenging conditions. MS1M-ArcFace enables large-scale training, in which the backbone networks and margin-based losses learn, resulting in highly discriminative embeddings. LFW is a classic benchmark for unconstrained verification; thus, the results can be directly compared with those of previous studies. CFP-FP is devoted exclusively to assessing pose robustness, which is particularly important in real-world face recognition, given the prevalence of profile views. AgeDB-30 has been added to the evaluation to assess the model’s ability to handle large age gaps, thereby directly addressing age-invariant recognition. Finally, the custom illumination dataset was designed to directly test the proposed photometric augmentation and neural relighting modules, thereby ensuring the system’s reliability under very poor or uneven lighting conditions. In combination, these datasets constitute a comprehensive evaluation framework that accounts for generalization, pose variation, age progression, and illumination robustness.

4.3. Evaluation Metrics

The effectiveness of the proposed model will be thoroughly assessed using both threshold-dependent and threshold-independent evaluation metrics. Threshold-dependent metrics such as True Positive Rate (TPR), False Positive Rate (FPR), and overall Accuracy determine the system’s performance at specific decision thresholds. On the other hand, threshold-independent measures such as the Receiver Operating Characteristic (ROC) curve, Area under the Curve (AUC), and Equal Error Rate (EER) provide a more comprehensive understanding of the model’s ability to distinguish among various operating conditions. In addition, the True Negative Rate (TNR) is considered to evaluate specificity, thereby enabling balanced performance across positive and negative verification outcomes. Moreover, runtime efficiency is scrutinized to ensure that the proposed architecture not only yields high accuracy but also maintains an inference speed suitable for real-time or embedded deployment environments.

True Positive Rate (TPR) and False Positive Rate (FPR)

For a given decision threshold

T

shown in Equations (10) and (11);

T P R (T) = \frac{T P (T)}{T P (T) + F N (T)}

(10)

The True Positive Rate (TPR), also known as sensitivity or recall, is defined in Equation (10). TPR is the percentage of correctly identified positive (matching) face pairs among all actual positive samples. The higher the TPR, the better the ability to recognize bona fide identities.

F P R (V) = \frac{F P (T)}{F P (T) + T N (T)}

(11)

where,

TP, FP, TN, and FN denote the number of true positives, false positives, true negatives, and false negatives, respectively.

The False Positive Rate (FPR) is defined in Equation (11) as the ratio of false-positive predictions (false matches) to all negative pairs. Minimizing FPR is essential for reliable verification, particularly in security-sensitive applications.

Receiver Operating Characteristic (ROC) Curve

The ROC curve is obtained by plotting TPR against FPR as the threshold τ varies. The Area under the Curve (AUC) is computed as shown in Equation (12).

A U C = \int T P R (F P R) d (F P R)

(12)

The ROC curve is defined in Equation (12) as the path traced by TPR and FPR across different decision thresholds. The ROC curve indicates the model’s overall ability to discriminate between classes, regardless of the threshold used.

Equal Error Rate (EER)

The Equal Error Rate is the operating point where the False Acceptance Rate (FAR) equals the False Rejection Rate (FRR), shown in Equations (13) and (14).

The area under the ROC Curve (AUC) is computed using Equation (13), which summarizes overall recognition performance across all thresholds. In general, the larger the AUC, the better the model’s ability to distinguish between honest and impostor pairs with high confidence.

E E R = F A R (T^{*}) = F R R (T^{*}) Such that FAR = FRR

(13)

with,

F A R (T) = \frac{F P (T)}{F P (T) + T N (T)}, F R R (T) = \frac{F N (T)}{T P (T) + F N (T)}

(14)

Equation (14) defines the Equal Error Rate (EER), the point at which the false acceptance and false rejection rates are equal. This single-point measure is commonly used to express the trade-off between security and usability in verification systems.

Verification Rate at Fixed FPR

Verification accuracy is often reported at strict operating points such as

F P R = 10^{- 3}, 10^{- 4}, 10^{- 5} : V R @ F P R = T P R (T) | F P R (T) = α

(15)

where,

$α$ is the chosen false positive rate.

The overall verification accuracy is calculated as the sum of pairs that are both correctly accepted and correctly rejected (Equation (15)). It provides a direct estimate of the total classification accuracy for face verification tasks.

Cumulative Match Characteristic (CMC) Curve

For identification tasks, the CMC curve measures the probability that the correct identity appears among the top-k ranked candidates, as shown in Equation (16).

C M C (k) = \frac{1}{N} \sum_{i = 1}^{N} 1 {y_{i} ϵ T o p - k ({\hat{y}}_{i})

(16)

where,

N: number of queries,

y_{i}

: ground-truth identity,

{\hat{y}}_{i}

: ranked predictions.

The True Negative Rate (TNR), which equals specificity, is given by Equation (16). This metric represents the percentage of non-matching pairs that are correctly rejected. A high TNR, together with TPR, ensures a very low false-match probability even when lighting and population conditions change.

Inference Efficiency

We additionally report computational efficiency in terms of average inference latency and model size, as shown in Equation (17).

L a t e n c y = \frac{T o t a l I n f e r e n c e T i m e}{N u m b e r o f I m a g e s},

(17)

S i z e = N u m b e r o f P a r a m e t e r s \times S t o r a g e p r e c i s i o n

The overall efficiency of the proposed pipeline is evaluated using Equation (17), which quantifies the number of test images processed per second. This metric reflects the performance of the entire system and, hence, determines whether the system can be applied in real-time or near-real-time situations. In addition to ROC visualization, we report Area Under the Curve (AUC) values with 95% confidence intervals.

4.4. Quantitative Results and Statistical Robustness

Table 6 highlights the gains from adding our modules one by one (same backbone/loss)—rows correspond to baseline, then + photometric augmentation, then + neural relighting, then + U-Net mask (all modules). Table 6 demonstrates the impact of stronger backbones and losses when all modules are used together. Each experiment was conducted five times with different random seeds. For each configuration, we report the mean and standard deviation to indicate stability, with standard deviations below 0.3% across all benchmarks. To compare two configurations (e.g., baseline versus proposed), an unpaired two-sample t-test was used; the proposed method achieved statistically significant improvements (p < 0.05) on CFP-FP and the custom illumination dataset. No paired t-test was conducted across different model architectures because the runs were not paired by the same random seed.

The small discrepancies (<0.3%) indicate very high consistency. Moreover, performance increased relative to ArcFace across all datasets (p < 0.05).

To demonstrate the effectiveness of the proposed model, we compare it against baseline systems commonly used in face recognition. These baselines depict the standard method of face detection → embedding extraction → metric learning without the extra modules that have been introduced in this paper.

Baseline-1 (ArcFace with ResNet-100) is the highest-performing reference system, using RetinaFace for detection, ResNet-100 for embedding extraction, and ArcFace loss for training. This standard baseline is the state-of-the-art configuration used by many recent studies.
Baseline-2 (AdaFace with ResNet-100) is a better baseline that uses AdaFace loss, which varies the margin according to the image quality. This setting is particularly effective for low-quality images but does not employ segmentation or relighting.
Baseline-3 (MagFace with ResNet-100) is a quality-aware baseline that uses embedding magnitudes to show the face image quality, and hence improves verification performance during shifts in distribution. The baseline also lacks skin segmentation and illumination modules, unlike other baselines.
Baseline-4 (ViT-Base with ArcFace) is a transformer-based baseline that employs ViT-Base as the embedding backbone and ArcFace for supervision. Although it benefits from long-range attention, it lacks illumination handling and skin segmentation.

Compared with the proposed model, it stands apart from the baselines. This is because it includes three additional modules that specifically address common failure modes in unconstrained recognition.

Lighting normalization (photometric augmentation + neural relighting) makes the model less sensitive to lighting changes.
Skin-aware segmentation helps the model focus on the face by removing the background and other non-facial areas.
Flexible backbone networks (ResNet-100/ViT-Base) paired with quality-aware losses (MagFace, AdaFace) lead to the highest discriminability and generalization.

The experimental results clearly show that our proposed model consistently outperforms the baseline models across all datasets, including LFW, CFP-FP, AgeDB-30, and our custom illumination dataset, particularly at low false-positive rate (FPR) operating points. These improvements suggest that synergy among detection, preprocessing, and embedding quality is more effective than simply enhancing the loss function.

All experiments were repeated five times (n = 5). Confidence intervals were computed as shown in Equation (18).

C I = m e a n \pm 1.96 \times \frac{s t d}{\sqrt{n}}, n = 5

(18)

where,

n = 5 independent experimental runs.

The results in Table 7 demonstrate that the proposed method consistently achieves the highest AUC values across all evaluated datasets. The performance improvement is particularly noticeable on CFP-FP and the custom illumination dataset, indicating enhanced robustness under pose and lighting variations. Moreover, the narrow confidence intervals suggest stable convergence and low variance across independent runs. Compared with baseline methods, the proposed framework exhibits statistically consistent gains while maintaining high performance in low false-positive operating regions, further supporting its robustness and reliability. Improvements on saturated benchmarks like LFW are minimal, but more significant gains are seen under difficult lighting conditions.

4.5. Component Contribution Analysis

To thoroughly investigate how each module contributes to the overall performance of the proposed model, we conducted controlled experiments in which components were selectively enabled or disabled. This paper demonstrates that the system’s performance is significantly affected by design choices.

4.5.1. Illumination Handling

Three settings were compared: (1) no illumination normalization, (2) training with photometric augmentation, and (3) augmentation combined with neural relighting. Results indicate that photometric augmentation plays a major role in mitigating performance drops due to uneven illumination, whereas neural relighting further increases the accuracy of datasets with challenging illumination conditions.

Skin Segmentation
Using U-Net segmentation to create skin-aware masks improved recognition accuracy by reducing the effects of background noise and hair regions. Without segmentation, embeddings were less compact, leading to higher error rates in busy backgrounds.
Backbone Architecture
ResNet-100 provides a solid low-latency baseline, whereas Vision Transformer (ViT-Base) achieves slightly higher accuracy at the cost of greater computational resources. This trade-off allows the model to be tailored for either real-time or high-accuracy applications.
Loss Functions
ArcFace remains a strong baseline for margin-based losses; however, AdaFace and MagFace further incorporate quality-aware margins. MagFace, in particular, shows greater stability in low-quality and profile-face scenarios, making it the most reliable loss for unconstrained recognition.

4.5.2. Detection and Alignment

In this section, RetinaFace and MTCNN are compared. Owing to its more accurate landmark localization and greater robustness to occlusion, RetinaFace achieved consistently higher recognition accuracy, making it suitable as a detection front end. The results are illustrated in Figure 4.

Evaluation of each component indicates that the modules—illumination normalization, skin-aware segmentation, backbone selection, and quality-aware margin learning—substantially contribute to the system’s robustness and precision.

Photometric augmentation + relighting improved illumination robustness by up to 6%,
U-Net segmentation improved Rank-1 identification by 1–2%,
ViT + MagFace yielded the best results, but with higher latency.

4.5.3. Comparison with Alternative Attention Mechanisms

To determine whether the proposed mask-guided modulation is merely a form of feature scaling, we compared it with three alternative designs that use the same ResNet-100 backbone and ArcFace loss on the custom illumination dataset.

No masking: baseline,
Channel attention (SENet): global average pooling → two FC layers + sigmoid → channel-wise scaling,
Spatial attention (CBAM): average + max pooling → 7 × 7 conv + sigmoid + spatial scaling,
Proposed (U-Net mask modulation).

As demonstrated in Table 8, the proposed method surpasses both attention mechanisms, suggesting that the explicit skin prior learned by U-Net offers additional information that data-driven attention alone does not capture. This comparison is not exhaustive (e.g., transformer-based attention is not tested), but it shows that our explicit skin mask outperforms SENet and CBAM on the custom illumination dataset.

4.5.4. Alternative Segmentation Designs

We compared U-Net against: (i) no segmentation, (ii) threshold-based skin detection in YCbCr space, and (iii) a simple CNN segmenter (3 conv layers). Results on Rank-1 identification are shown in Table 9.

4.5.5. Alternative Backbone and Loss Combinations

We evaluated all pairings of ResNet-50, ResNet-100, ViT-Small, and ViT-Base with ArcFace, AdaFace, and MagFace. The combination of ViT-Base with MagFace achieved the highest accuracy, but ResNet-100 with MagFace provided a better balance of speed and accuracy (18 ms vs.to 25 ms).

5. Results and Analysis

5.1. Verification and Identification Performance

The proposed system achieves competitive state-of-the-art accuracy across all benchmarks. We achieved 99.8% accuracy on LFW, closely matching ArcFace performance, and on CFP-FP we observed more than a 2% improvement in TPR at FPR = 1 × 10⁻⁴, demonstrating robustness to pose variation. Finally, on AgeDB-30, quality-aware losses provided a 1.5% gain over ArcFace alone.

Although LFW is close to saturation, the importance of improvements is more evident at strict operating points. For instance, on CFP-FP at FPR = 1 × 10⁻⁵, the proposed method reaches a 92.1% TPR compared to ArcFace’s 88.3%—a 3.8% absolute increase. This difference is significant for high-security applications like border control.

5.2. Effect of Illumination Handling

Photometric augmentation substantially reduced performance degradation under uneven lighting. Neural relighting increased verification accuracy by 4–6% on the custom illumination dataset, further demonstrating the effectiveness of neural photometric feedback in extreme lighting conditions.

5.3. Effect of Skin-Aware Embedding

The introduction of the U-Net mask produced a 1–2% improvement in Rank-1 identification across all datasets. Qualitative visualizations showed that the embeddings became denser and less sensitive to background clutter.

5.4. Ablation Study

The Vision Transformer backbone marginally outperformed ResNet-100 on AgeDB-30, but ResNet was faster at inference. Alternative attention mechanisms, such as spatial and channel attention, may offer adaptive feature weighting. However, the proposed method explicitly incorporates spatial priors via segmentation masks. A thorough empirical comparison is left for future research. The current ablation examines the step-by-step addition of components.

Unlike attention-based models that adaptively learn which features are most important, the proposed approach uses predetermined transformations derived from specific image cues. A thorough comparison with attention mechanisms will be addressed in future studies.

5.5. Hardware Configuration

The model (detection, relighting, and segmentation) was estimated to run at ~25 ms per image on an NVIDIA RTX GPU, making it suitable for real-time applications. Our experiments show that each proposed component (RetinaFace detection, relighting, U-Net segmentation, and margin-based embedding learning) significantly improves robustness under challenging conditions. Compared with conventional models based on DCT, SVD, and HMM, this redesigned model offers improved accuracy and reduced demographic bias (by avoiding explicit demographic priors) and greater potential for real-world scalability in face recognition tasks, as shown in Table 10.

Table 11 highlights differences in verification and identification performance across benchmark datasets. Classical methods based on DCT, SVD, and HMM exhibit limited generalization, with LFW accuracy below 90% and an EER above 8%, whereas deep learning baselines (e.g., FaceNet) show significant improvement but still do not perform well under challenging variations. Companies that publish revised margin-based softmax losses (e.g., ArcFace) report improvements, achieving a 99.7% verification rate on LFW and reducing the EER to 1.1%. Further improvements have been reported for AdaFace and MagFace, which utilize quality-aware learning to improve upon ArcFace, achieving over 96% on AgeDB-30 and reducing the EER. Notably, MagFace with a Vision Transformer backbone shows the best overall results, reaching 99.8% accuracy on LFW, 95.0% TPR at FPR = 1× 10⁻⁴ on CFP-FP, and an EER of 0.7%, which validates the model proposed in this paper.

Verification performance on the CFP-FP dataset is shown in Table 12 at the strict operating threshold of

F R P = 10^{- 4}

. This is a particularly difficult task due to the large degree of pose variation (frontal and profile views). The results show that the proposed model achieves the highest TPR among the baselines. Although ArcFace is a strong baseline, it performs poorly on highly extreme profile views. AdaFace improves robustness by accounting for image quality, whereas MagFace improves dissimilarity by providing quality-aware margins. Again, while the proposed model outperforms all baselines, its performance is further enhanced by illumination normalization and skin segmentation, in addition to MagFace loss. This shows that the benefits of skin segmentation and illumination normalization, together with quality-aware learning, extend beyond the proposed embedding loss.

Table 12 evaluates the speed of various backbone–loss combinations. ResNet-100 with ArcFace achieves 18 ms per image and a model size of 104 MB, making it suitable for real-world applications. MagFace slightly increases computational costs but offers greater resilience to low-quality images. The Vision Transformer backbone achieves the highest classification rate but requires 25 ms per image and consumes 120 MB of memory. The trade-offs in speed and classification accuracy of the ResNet-based combinations are better suited to real-time implementations with more stringent speed requirements. ViT-based combinations may be preferable for offline implementations or applications where speed is less important and higher accuracy is preferred.

The ROC curves in Figure 5 compare the verification performance of ArcFace, AdaFace, and MagFace on the test datasets. As expected, all three deep metric-learning approaches show strong separability between positives and negatives, with the largest performance differences at lower false-positive rates (FPRs). ArcFace provides a strong and consistent baseline, yielding a large increase in TPR that tapers at more constrained thresholds. AdaFace outperforms ArcFace by leveraging quality-aware margins, resulting in a higher TPR at FPR = 1 × 10⁻⁴ and suggesting greater resilience to fluctuations in pose and image quality. MagFace maintains higher performance than both ArcFace and AdaFace across the entire ROC curve and consistently achieves the highest area under the curve (AUC) values.

For decision point’s relevant to operational deployment—specifically, very low FPR thresholds (1 × 10⁻⁴ to 1 × 10⁻⁵)—MagFace still achieves the greatest stability and reliability for verification, significantly increasing TPR while reducing false acceptances. This effect is particularly pronounced in datasets with greater intra-class variability (e.g., CFP-FP), where MagFace exhibits broader tolerance to variations in illumination and profile views. Overall, the ROC evaluation indicates that quality-aware margin learning, in conjunction with distance-insensitive illumination and segmentation proposals, reliably yields more discriminative and stable embeddings under unconstrained conditions.

As presented in Figure 6, the DET plots further emphasize the tradeoff between false positives and false negatives. ArcFace exhibits a larger error region, whereas AdaFace reduces the False Negative Rate (FNR) at reasonable False Positive Rates (FPRs). MagFace has the lowest error envelope confirming its effectiveness in balancing sensitivity to specificity. Overall, the Equal Error Rate (EER) is much lower for MagFace than with other methods demonstrating its stability for real-life applications.

As illustrated in Figure 7, bar plots of TPR at FPR thresholds of

10^{- 3}

,

10^{- 4}

, and

10^{- 5}

reveal sharp differences in strict security settings. ArcFace performance declines sharply as thresholds tighten, whereas AdaFace remains relatively stable in TPR performance. MagFace outperforms both approaches, maintaining strong verification rates even at an FPR of 1 × 10⁻⁵, which is particularly relevant in high-security applications such as border control and financial authentication.

Figure 8 illustrates the Cumulative Match Characteristic (CMC) curves, which assess how well identification performs as the rank order increases. While ArcFace shows impressive Rank-1 accuracy, it falls short compared to AdaFace and MagFace at higher ranks. AdaFace performs well in the mid-range, but MagFace leads overall, achieving the best Rank-1 and Rank-5 rates. This suggests that the model using MagFace achieves the highest identification accuracy in both closed- and open-set conditions.

The efficiency analysis in Figure 9 highlights the trade-off between accuracy and inference time across various backbone-loss configurations. ResNet-100 paired with Arc-Face achieves a runtime of approximately 18 ms per image and has a compact model size, making it well-suited for real-time applications. By contrast, ResNet-100 with MagFace has a slightly higher latency of 20 ms but achieves higher accuracy. Meanwhile, the Vision Transformer with MagFace achieves the highest accuracy of approximately 99.9%, though it requires a longer inference time of approximately 25 ms and consumes more memory. This comparison shows that the proposed model can be readily adjusted to prioritize either real-time performance or high accuracy, depending on deployment requirements.

5.6. Qualitative Visualization and Analysis

To demonstrate the impact of illumination normalization and skin-sensitive segmentation, the images in Figure 10 are exemplary samples from our custom illumination dataset.

As shown in Table 13, a visual inspection of the embedding vectors indicates that the relighting module provides consistent illumination across facial surfaces, and the segmentation mask effectively excludes unwanted background information. Moreover, the final normalized embedding is primarily focused on consistent skin-texture representations rather than on lighting changes and shadows, supporting the quantitative improvements discussed earlier. Notably, these qualitative examples indicate that illumination normalization and skin-aware attention work synergistically to improve embedding stability under challenging conditions.

In Table 13, we compare the proposed model with various methods. Ref. [29] introduced a legacy method, which utilized DCT-II normalization, SVD for feature extraction, and KNN/HMM classifiers. While this approach performs well in controlled settings, it degrades under changes in lighting and pose, resulting in error rates exceeding 8% on more challenging datasets. Then there is FaceNet [30], a deep metric learning technique that relies on triplet loss and was a game changer for end-to-end embedding learning. Although FaceNet shows improved robustness compared to traditional methods, it does not specifically address issues like lighting and background noise, making it vulnerable to low-quality inputs. Moving on to CosFace/SphereFace [31], these angular-margin methods enhance inter-class separability and generally outperform FaceNet. However, they fall short in terms of quality-aware mechanisms or preprocessing steps, such as relighting or segmentation. Lastly, we have ArcFace [32], which is often seen as a solid baseline. It introduces an additive angular margin loss to enhance discriminability, but, like the others, it does not explicitly account for image quality or include additional features such as skin-aware masking, which can limit its effectiveness in challenging environments.

The proposed model integrates advances in face recognition into a seamless end-to-end system that addresses the shortcomings of previous methods. First, it employs RetinaFace for face detection and landmark-based alignment, thereby maintaining accuracy across varying poses and occlusions. To handle lighting variations, it employs a two-pronged approach: photometric augmentation during training and neural relighting during inference, yielding face images that are insensitive to lighting changes. To minimize background noise and reduce demographic bias, a U-Net-based segmentation model generates skin-aware masks that facilitate feature extraction. Next, identity embeddings are generated using either a ResNet or a Vision Transformer backbone and fine-tuned with margin-based softmax losses such as ArcFace, AdaFace, and MagFace. These techniques enhance class separation while keeping similar classes close together and ensure quality-aware calibration. Finally, recognition is achieved through cosine similarity on normalized embeddings. This innovative design replaces traditional handcrafted models (such as DCT, SVD, and HMM) with a scalable, robust solution that delivers high accuracy even in challenging conditions.

5.7. Runtime and Efficiency Analysis

To assess the deployment’s feasibility, we measured the average inference latency for each module on an NVIDIA RTX 3090 GPU. We computed the runtime for each module using 1000 test images (batch size of 1). A summary of the results is provided in Table 14.

The findings show that relighting and segmentation together add only about 10 ms of overhead while significantly improving accuracy. The entire pipeline runs at roughly 25 ms per image, making it suitable for real-time applications on high-end GPUs and nearly real-time on modern CPUs. The ResNet backbone is more efficient, whereas the ViT model achieves the highest accuracy, albeit with a slight increase in latency.

6. Discussion

Although photometric augmentation is a common method, combining it with inference-time relighting offers a complementary training and inference approach that enhances robustness to changes in illumination.

The experimental results demonstrate that integrating illumination normalization, skin-aware segmentation, and quality-aware metric learning improves recognition robustness under challenging visual conditions. In particular, the dual illumination-handling strategy—photometric augmentation during training combined with neural relighting during inference—substantially mitigates performance degradation caused by uneven lighting. The observed improvements under low-light and side-lit conditions indicate that separating identity features from illumination variations enhances embedding stability.

The proposed U-Net-based skin segmentation module reduces background interference by suppressing non-facial regions such as hair, clothing, and cluttered environments. Rather than relying on threshold-based skin-color rules or explicit race-dependent modeling, the segmentation module operates through learned spatial attention. This architectural choice avoids the use of demographic priors that are present in some classical color-space approaches. However, it should be noted that this paper does not include quantitative subgroup-level fairness evaluation (e.g., FAR-gap or TPR-gap across ethnicity or gender).

The comparison between ResNet-100 and Vision Transformer backbones highlights a trade-off between efficiency and representational capacity. ResNet-100 provides lower latency and smaller memory overhead, making it more suitable for real-world deployment. The Vision Transformer backbone achieves slightly higher accuracy, particularly under pose and age variations, at the cost of increased computational complexity. This flexibility allows the framework to be adapted depending on deployment requirements.

Statistical evaluation across repeated runs shows low standard deviation (<0.3%), confirming the stability of the proposed architecture. Although improvements on LFW are marginal due to dataset saturation, more noticeable gains are observed on CFP-FP and the custom illumination dataset. This suggests that the main contribution lies in robustness under difficult lighting and pose conditions rather than incremental improvement on already saturated benchmarks.

While the segmentation-guided masking offers explicit spatial filtering, it is unlike attention-based methods such as spatial, channel, and transformer-based attention, which learn feature importance in a dynamic manner. Instead, this method uses fixed transformations guided by segmentation and illumination normalization. Although this enhances interpretability and stability, it might be less adaptable than fully learnable attention models.

Despite these benefits, several limitations persist. The relighting and segmentation modules need well-aligned, clean training data. Their performance may decline with severe occlusions, misalignments, or heavy noise. Although explicit demographic modeling is not employed, simply removing demographic inputs does not ensure bias elimination, as implicit bias can still stem from unbalanced training data. Performance may also suffer under heavy occlusion or poor alignment, since high-quality data is essential for effective segmentation and relighting.

Finally, while the proposed pipeline achieves real-time inference on high-end GPUs (~25 ms per image), it remains computationally demanding for low-power embedded or edge devices. Lightweight optimization and model compression techniques are necessary for broader deployment scenarios.

7. Limitations

Despite the promising results, several limitations of the proposed framework should be acknowledged.

First, the proposed approach primarily relies on combining existing components, including face detection, segmentation, and deep feature extraction modules. Although the main contribution lies in how these components are jointly designed and interact within a unified system, the framework does not present a completely new, standalone algorithm. Future research could investigate new architectures or theoretically grounded models to enhance the methodological contribution.

Second, the segmentation-guided masking strategy is assessed against a baseline without masking. However, alternative attention methods—such as spatial attention, channel attention, or transformer-based attention—have not been tested empirically. Consequently, it is unclear how the proposed masking compares with other adaptive feature selection approaches.

Third, while photometric augmentation is commonly used to enhance robustness to illumination changes, this work focuses on combining it with inference-time relighting. However, a thorough analysis of how these two techniques interact has not been included and is a subject for future research.

Fourth, although the method does not include explicit demographic modeling, we have not conducted any quantitative fairness evaluations at the subgroup level (such as TPR gaps across ethnicity or gender) because the datasets do not contain demographic annotations. Consequently, we do not claim that the method is fair or unbiased. A formal fairness assessment will be addressed in future work.

Fifth, the performance enhancements observed on standard benchmarks like LFW are modest, as expected given the saturation of these datasets. While more significant improvements appear under difficult lighting conditions, their practical impact in real-world applications still needs validation.

Sixth, the ablation study primarily evaluates the incremental effect of each component sequentially. However, it does not investigate alternative architectural configurations or compare different strategies, limiting the comprehensiveness of the design space exploration.

Seventh, several components of the framework, including segmentation and relighting modules, rely on accurately aligned, high-quality input data. Performance may decrease in scenarios involving severe occlusion, inaccurate face alignment, or low-resolution inputs. Importantly, mistakes in segmentation masks can negatively impact the following feature extraction process.

Eighth, the full pipeline’s computational complexity remains relatively high due to multiple processing steps. While real-time performance is possible on high-end GPUs, deploying it on resource-constrained or edge devices remains difficult without additional optimization. The existing modulation mechanism is less complex than fully learnable attention-based methods and may have limited flexibility.

The current formulation does not incorporate adaptive attention mechanisms and may be interpreted as a simplified feature modulation strategy.

Finally, this paper does not provide a detailed analysis of failure cases. In practical use, the method may face difficulties in situations with severe lighting imbalance, significant occlusion, motion blur, or faulty preprocessing outputs. Exploring these failure modes systematically and developing more robust mitigation strategies are crucial areas for future research.

8. Conclusions

This paper presents a unified deep learning framework for illumination-robust face recognition. By integrating RetinaFace detection, photometric training augmentation, neural relighting, U-Net segmentation, and quality-aware metric learning, the system achieves competitive state-of-the-art performance across multiple benchmarks. Future extensions include video-based recognition, multimodal fusion, and bias-mitigation strategies.

Experimental results on the standard benchmarks—LFW, CFP-FP, AgeDB-30, and a custom illumination dataset—show that the model matches the competitive state of the art, with 99.8% accuracy on LFW and a TPR of 95.0% at FPR = 1 × 10⁻⁴ on CFP-FP. The integration of neural relighting and skin-aware segmentation, consistently and significantly improves performance: relighting improves illumination uniformity by 1.35 ± 0.12, and segmentation boosts verification accuracy by up to 2%.

In addition to accuracy, the proposed method offers practical versatility. The ResNet backbone strikes a strong balance between high accuracy and fast inference, whereas the Vision Transformer backbone delivers only slightly higher accuracy in offline applications or settings with ample computational resources. The modular structure also allows independent upgrades to the relighting, segmentation, or embedding components, ensuring compatibility with future work or deployment environments. Future work will apply this framework to video and multimodal face recognition, incorporating temporal coherence, depth, and thermal cues to further increase robustness under uncontrolled conditions. Another objective is to reduce dataset bias and improve generalization across demographic and environmental domains through self-supervised pre-training and demographic-prior-free loss regularization.

Author Contributions

Conceptualization, S.K. and S.S.C.; methodology, S.K.; software, S.K.; validation, S.S.C.; formal analysis, S.K.; investigation, S.S.C.; writing—original draft preparation, S.K. and S.S.C.; writing—review and editing, S.K. and S.S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research has received no funding.

Data Availability Statement

The datasets utilized in this research are publicly accessible. The MS1M-ArcFace dataset served for training purposes, whereas evaluation was conducted using LFW, CFP-FP, and AgeDB-30.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, X.; Jiang, H.; Dong, Y.; Mu, M. Spatial-channel collaborative multi-scale graph interaction deep transfer learning for unsupervised rotating machinery fault diagnosis. Eng. Appl. Artif. Intell. 2026, 176, 114691. [Google Scholar] [CrossRef]
Benjdira, B.; Bazi, Y.; Koubaa, A.; Ouni, K. Unsupervised domain adaptation using generative adversarial networks for semantic segmentation of aerial images. Remote Sens. 2019, 11, 1369. [Google Scholar] [CrossRef]
Yan, J.; Cheng, Y.; Zhang, F.; Li, M.; Zhou, N.; Jin, B.; Wang, H.; Yang, H.; Zhang, W. Research on multimodal techniques for arc detection in railway systems with limited data. Struct. Health Monit. 2025. [Google Scholar] [CrossRef]
Cao, X.; Shen, W.; Yu, L.; Wang, Y.; Yang, J.; Zhang, Z. Illumination invariant extraction for face recognition using neighboring wavelet coefficients. Pattern Recognit. 2015, 45, 1299–1305. [Google Scholar] [CrossRef]
Ahmad, F.; Khan, A.; Islam, I.U.; Uzair, M.; Ullah, H. Illumination normalization using independent component analysis and filtering. Imaging Sci. J. 2017, 65, 308–313. [Google Scholar] [CrossRef]
Pan, J.; Yang, X.; Cai, H.; Mu, B. Image noise smoothing using a modified Kalman filter. Neurocomputing 2016, 173, 1625–1629. [Google Scholar] [CrossRef]
Karamizadeh, S.; Pourmirzaei, M.; Zamani, M.; Shankar, A. A hybrid CNN-transformer architecture for adult image and video content recognition on the internet. Multimed. Tools Appl. 2025, 84, 49197–49217. [Google Scholar] [CrossRef]
Salam, A.A.; Akram, M.U.; Yousaf, M.H.; Rao, B. DermaTransNet: Where transformer attention meets U-Net for skin image segmentation. IEEE Access 2025, 13, 64305–64329. [Google Scholar] [CrossRef]
Song, Y.; Tang, H.; Meng, F.; Wang, C.; Wu, M.; Shu, Z.; Tong, G. A transformer-based low-resolution face recognition method via on-and-offline knowledge distillation. Neurocomputing 2022, 509, 193–205. [Google Scholar] [CrossRef]
Sikandar, T.; Ghazali, K.H.; Mohd, I.I.; Rabbi, M. Skin color pixel classification for face detection with hijab and niqab. In Proceedings of the International Conference on Robotics, Automation and Sciences, Penang, Malaysia, 26–28 July 2017. [Google Scholar]
Karamizadeh, S.; Chaeikar, S.S.; Najafabadi, M.K. Enhancing facial recognition and expression analysis with unified zero-shot and deep learning techniques. IEEE Access 2025, 13, 43508–43519. [Google Scholar] [CrossRef]
Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; Song, L. SphereFace: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Karamizadeh, S.; Shojae Chaeikar, S.; Jolfaei, A. Adult content image recognition by Boltzmann machine limited and deep learning. Evol. Intell. 2023, 16, 1185–1194. [Google Scholar] [CrossRef]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. ArcFace: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Kim, D.; Cho, D.; Heo, B. AdaFace: Quality adaptive margin for face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Meng, Y.; Zhao, X.; Huang, G.; Zhou, F. MagFace: A universal representation for face recognition and quality assessment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the Ninth International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Zhang, Q.; Guo, Q.; Gao, R.; Juefei-Xu, F.; Yu, H.; Feng, W. Adversarial relighting against face recognition. IEEE Trans. Inf. Forensics Secur. 2024, 19, 9145–9157. [Google Scholar] [CrossRef]
Zhao, Z.; Lin, H.; Shi, D.; Zhou, G. A non-regularization self-supervised Retinex approach to low-light image enhancement with parameterized illumination estimation. Pattern Recognit. 2024, 146, 110025. [Google Scholar] [CrossRef]
Ardiawan, M.I.; Negarara, G.P.K. A Comparative Analysis of FaceNet, VGGFace, and GhostFaceNets Face Recognition Algorithms for Potential Criminal Suspect Identification. J. Appl. Artif. Intell. 2024, 5, 34–49. [Google Scholar] [CrossRef]
Jiang, D.; Jin, Y.; Zhang, F.L.; Zhu, Z.; Zhang, Y.; Tong, R.; Tang, M. Sphere face model: A 3d morphable model with hypersphere manifold latent space using joint 2d/3d training. Comput. Vis. Media 2023, 9, 279–296. [Google Scholar] [CrossRef]
Zheng, J.; Gong, X. ExpFace: Exponential Angular Margin Loss for Deep Face Recognition. arXiv 2025, arXiv:2509.19753. [Google Scholar] [CrossRef]
Firmansyah, A.; Kusumasari, T.F.; Alam, E.N. Comparison of face recognition accuracy of ArcFace, FaceNet and FaceNet512 models on deepface framework. In Proceedings of the 2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE), Jakarta, Indonesia, 16 February 2023. [Google Scholar]
Bao, F.; Nie, S.; Xue, K.; Cao, Y.; Li, C.; Su, H.; Zhu, J. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Thokal, V.; Patil, P.R. Face Recognition Using Multi-strategy based African Vulture Optimization Algorithm. Int. J. Intell. Eng. Syst. 2025, 18, 214–229. [Google Scholar] [CrossRef]
Gunturi, S.K.; Alugubelly, M.; Jayabalan, M.; Aggarwal, S. Real-Time Masked Face Recognition in the Wild with few shots. In Proceedings of the 2023 16th International Conference on Developments in eSystems Engineering (DeSE), Istanbul, Turkey, 18–20 December 2023. [Google Scholar]
Djamaluddin, M.; Munir, R.; Utama, N.P.; Kistijantoro, A.I. Open-set profile-to-frontal face recognition on a very limited dataset. IEEE Access 2023, 11, 65787–65797. [Google Scholar] [CrossRef]
Beaubrun, A.; Annan, J.; Wu, H.; Merino, X.; Bowyer, K.; King, M.C. The AgeDB-30M Dataset: Melanated Faces for Age-Invariant Face Recognition. In Proceedings of the 2025 IEEE 19th International Conference on Automatic Face and Gesture Recognition (FG), Tampa, FL, USA, 26–30 May 2025. [Google Scholar]
Karamizadeh, S.; Abdullah, S.M. Race classification using gaussian-based weight K-nn algorithm for face recognition. J. Eng. Res. 2018, 6, 103–121. [Google Scholar] [CrossRef]
Suguna, G.C.; Kavitha, H.S.; Sunita, S. Face recognition system for realtime applications using SVM combined with FaceNet and MTCNN. Int. J. Electr. Eng. Technol. 2021, 12, 328–335. [Google Scholar]
Liu, W.; Wen, Y.; Raj, B.; Singh, R.; Weller, A. Sphereface revived: Unifying hyperspherical face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2458–2474. [Google Scholar] [CrossRef] [PubMed]
Karamizadeh, S.; Shojae Chaeikar, S.; Salarian, H. Combining MTCNN and Enhanced FaceNet with Adaptive Feature Fusion for Robust Face Recognition. Technologies 2025, 13, 450. [Google Scholar] [CrossRef]

Figure 1. The Proposed Pipeline Architecture.

Figure 2. Skin-Aware Embedding Modulation.

Figure 3. Architecture of the proposed model.

Figure 4. Component Analysis.

Figure 5. Illustration of the comparative verification performance of ArcFace, AdaFace, and MagFace across the evaluated datasets.

Figure 6. DET curve between false positives and false negatives.

Figure 7. TRR at Different FPR thresholds.

Figure 8. CMC Curve measures identification performance across increasing rank levels.

Figure 9. Efficiency vs. Accuracy.

Figure 10. Qualitative examples of illumination and segmentation effects. (a) original image, presented in uneven lighting, (b) neural-relit image, consistent luminance, minimal color cast, (c) U-Net segmentation mask, background/hair suppression, (d) Final embedding activation map overlay, highlighting the skin areas.

Table 1. Comparison between the proposed framework and representative prior methods. (✓: supported, ✗: not supported).

Method	Illumination Handling	Segmentation	Quality-Adaptive Margin	End-to-End Inference
ArcFace	✗	✗	✗	✓
AdaFace	✗	✗	✓	✓
Neural Relighting (prior)	✓	✗	✗	✗
Segmentation-based FR	✗	✓	✗	✗
Proposed	✓	✓	✓	✓

Table 2. Summary of the impact of different illumination-handling components on verification accuracy.

Model Variant	LFW (%)	CFP-FP (%)	$TPR @ 1 {\times 10}^{- 4}$
Baseline (ResNet-100 + ArcFace)	98.7	92.1	0.89
+Photometric Augmentation	99.1	94.0	0.92
+Neural Relighting	99.3	94.8	0.94

Table 3. Evaluating the contribution of skin-aware segmentation to recognition performance.

Model Variant	Rank-1 (%)	LFW (%)
Baseline (No segmentation)	98.0	98.7
+U-Net Skin Mask	99.5	99.1
+Mask-Guided ViT Tokens	99.7	99.4

Table 4. Loss Function Comparison.

Loss Function	LFW (%)	CFP-FP (%)	EER (%)
ArcFace	99.4	94.8	1.1
AdaFace	99.6	95.0	0.9
MagFace	99.7	95.2	0.7

Table 5. Overview of datasets.

Dataset	Identities	Images/Pairs	Primary Challenge	Usage
MS1M-ArcFace	~85,000	~5.8 M images	Large-scale, noisy web data	Training
LFW	5749	13,233 images	Unconstrained web faces	Verification benchmark
CFP-FP	500	7000 pairs	Frontal vs. profile views	Pose-robust verification
AgeDB-30	440	12,240 images	Age progression (≤30 years)	Age-invariant verification
Custom Illumination (this paper)	3500	~10,000 images (5000 extracted + 5000 synthetically generated)	Diverse lighting (low-light, side-light, overexposure)	Testing illumination robustness

Table 6. (a) Controlled comparison with fixed backbone and loss (ResNet-100 + ArcFace). Modules are added row-by-row; the last row includes all proposed components. (b) Effect of backbone and loss (all with full modules). The symbol ↓ indicates that EER drops when extra components have been added to the architecture.

(a)
Method	LFW Acc. (%)	CFP-FP TPR@1 × 10⁻⁴(%)	AgeDB-30 Acc. (%)	EER (%) ↓
Baseline (no modules)	98.7 ± 0.1	92.1 ± 0.2	92.4 ± 0.3	1.1 ± 0.1
+Photometric augmentation	99.1 ± 0.1	94.0 ± 0.2	93.8 ± 0.3	0.9 ± 0.1
+Neural relighting (inference)	99.3 ± 0.1	94.8 ± 0.2	94.5 ± 0.3	0.8 ± 0.1
+U-Net mask (full proposed)	99.5 ± 0.1	95.0 ± 0.2	95.2 ± 0.3	0.7 ± 0.1
(b)
Method	LFW Acc. (%)	CFP-FP TPR@1 × 10⁻⁴(%)	AgeDB-30 Acc. (%)	EER (%) ↓
ResNet-100 + ArcFace	99.5 ± 0.1	95.0 ± 0.2	95.2 ± 0.3	0.7 ± 0.1
(full modules)	99.5 ± 0.1	95.0 ± 0.2	95.2 ± 0.3	0.7 ± 0.1
ResNet-100 + MagFace	99.6 ± 0.1	95.1 ± 0.2	96.1 ± 0.3	0.6 ± 0.1
ViT-Base + MagFace (full proposed)	99.8 ± 0.1	95.0 ± 0.2	96.9 ± 0.2	0.7 ± 0.1

Table 7. AUC with 95% Confidence Intervals.

Method	LFW AUC (%)	CFP-FP AUC (%)	AgeDB-30 AUC (%)	Illumination Dataset AUC (%)
ArcFace	99.65 ± 0.08 (99.58–99.72)	98.12 ± 0.15 (97.99–98.25)	97.84 ± 0.18 (97.68–98.00)	94.20 ± 0.35 (93.88–94.52)
AdaFace	99.71 ± 0.07 (99.65–99.77)	98.45 ± 0.12 (98.35–98.55)	98.02 ± 0.16 (97.88–98.16)	95.10 ± 0.28 (94.86–95.34)
MagFace	99.69 ± 0.09 (99.61–99.77)	98.30 ± 0.14 (98.18–98.42)	97.95 ± 0.17 (97.80–98.10)	94.85 ± 0.31 (94.58–95.12)
Proposed Method	99.82 ± 0.05 (99.78–99.86)	98.97 ± 0.09 (98.89–99.05)	98.64 ± 0.11 (98.54–98.74)	96.75 ± 0.22 (96.56–96.94)

Table 8. Results (TPR@FPR = 1× 10⁻⁴ on illumination dataset).

Method	TPR (%)
No masking	88.2
SENet	90.1
CBAM	91.4
Proposed model	94.8

Table 9. Segmentation method comparison.

Segmentation	Rank-1 (%)
None	98.0
YCbCr threshold	98.3
Simple CNN	98.9
U-Net (proposed)	99.5

Table 10. Verification accuracy (%) and EER (%) across benchmark datasets.

Method	LFW Acc.	CFP-FP TPR@1 × 10⁻⁴	AgeDB-30 Acc.	EER
DCT + SVD + KNN/HMM	89.2	72.5	77.8	8.1
FaceNet	98.6	87.9	92.1	1.9
ArcFace (ResNet-100)	99.7	92.4	95.5	1.1
AdaFace (ResNet-100)	99.7	94.1	96.2	0.9
MagFace(ViT-Base)	99.8	95.0	96.9	0.7

Table 11. Results of (CFP-FP, TPR@FPR = 1 × 10⁻⁴).

Configuration	TPR (%)
Baseline (no augmentation, no mask)	88.7
+Photometric augmentation	91.5
+Augmentation + Relighting	94.3
+Relighting + U-Net Masking	95.0

Table 12. Efficiency comparison.

Backbone + Loss	Inference Speed (ms/img)	Model Size (MB)
ResNet-100 + ArcFace	18	104
ResNet-100 + MagFace	20	107
ViT-Base + MagFace	25	120

Table 13. Comparison of different methods.

Model	Detection/Alignment	Illumination Handling	Segmentation	Embedding Method	Reported Accuracy (LFW) (%)	EER (%)	Key Limitations
DCT + SVD + KNN/HMM	Manual crop	DCT-II filtering	N/A	SVD + Statistical Class.	~89	8.1	Illumination/pose sensitive, handcrafted features
FaceNet	Basic align (MTCNN)	N/A	N/A	Triplet loss (ResNet)	~98.6	1.9	Sensitive to low-quality inputs
CosFace/SphereFace	MTCNN	N/A	N/A	Angular-margin softmax	~99.2	1.3	No quality modeling
ArcFace	MTCNN	N/A	N/A	Additive angular margin	~99.7%	1.1	Background/illumination sensitivity
Proposed Model	RetinaFace	Photometric + Relight	U-Net (skin-aware)	ArcFace/AdaFace/MagFace + ResNet/ViT	99.8%	0.7	Higher complexity, requires GPU for real-time.

Table 14. Runtime and memory footprint of individual modules.

Module	Function	Avg Latency (ms/img)	Memory (MB)
RetinaFace Detection & Alignment	Localization + Landmark	5.8	52
Neural Relighting	Illumination correction	4.6	28
U-Net Segmentation	Skin mask generation	6.1	35
Embedding Extraction (ResNet-100)	Feature encoding	8.5	104
Proposed model	End-to-end inference	25.0 ± 1.2	—

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Karamizadeh, S.; Shojae Chaeikar, S. Skin Classification for Face Recognition Based on Deep Learning with U-Net and ResNet. Electronics 2026, 15, 1950. https://doi.org/10.3390/electronics15091950

AMA Style

Karamizadeh S, Shojae Chaeikar S. Skin Classification for Face Recognition Based on Deep Learning with U-Net and ResNet. Electronics. 2026; 15(9):1950. https://doi.org/10.3390/electronics15091950

Chicago/Turabian Style

Karamizadeh, Sasan, and Saman Shojae Chaeikar. 2026. "Skin Classification for Face Recognition Based on Deep Learning with U-Net and ResNet" Electronics 15, no. 9: 1950. https://doi.org/10.3390/electronics15091950

APA Style

Karamizadeh, S., & Shojae Chaeikar, S. (2026). Skin Classification for Face Recognition Based on Deep Learning with U-Net and ResNet. Electronics, 15(9), 1950. https://doi.org/10.3390/electronics15091950

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Skin Classification for Face Recognition Based on Deep Learning with U-Net and ResNet

Abstract

1. Introduction

2. Related Words

2.1. Illumination Handling in Face Recognition

2.2. Skin Segmentation and Demographic-Prior-Free Face Analysis

2.3. Deep Metric Learning and Quality-Aware Embedding

3. Model Architecture

3.1. Face Detection and Alignment

3.2. Photometric Augmentation

3.3. Skin Segmentation with U-Net

3.4. Neural Relighting Network

3.5. Intuitive Effect

3.5.1. Margin-Based Metric Learning

3.5.2. Regularization on the Mask

3.6. Identity Feature Extractor

3.7. Classifier and Scoring

3.8. Summary of Novel Contributions

3.9. Training Strategy

4. Experiments

4.1. Experimental Setup

4.1.1. Training Data and Cleaning Procedure

4.1.2. Pre-Training and Model Initialization

4.1.3. Training Hyperparameters

4.1.4. Partial-Fully Connected (FC) Training

4.1.5. Use of Relighting and Segmentation During the Training Process

4.1.6. Component Analysis

4.2. Datasets

4.3. Evaluation Metrics

4.4. Quantitative Results and Statistical Robustness

4.5. Component Contribution Analysis

4.5.1. Illumination Handling

4.5.2. Detection and Alignment

4.5.3. Comparison with Alternative Attention Mechanisms

4.5.4. Alternative Segmentation Designs

4.5.5. Alternative Backbone and Loss Combinations

5. Results and Analysis

5.1. Verification and Identification Performance

5.2. Effect of Illumination Handling

5.3. Effect of Skin-Aware Embedding

5.4. Ablation Study

5.5. Hardware Configuration

5.6. Qualitative Visualization and Analysis

5.7. Runtime and Efficiency Analysis

6. Discussion

7. Limitations

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI