A Human-Centric, Uncertainty-Aware Event-Fused AI Network for Robust Face Recognition in Adverse Conditions

Abdusalomov, Akmalbek; Umirzakova, Sabina; Boymatov, Elbek; Zaripova, Dilnoza; Kamalov, Shukhrat; Temirov, Zavqiddin; Jeong, Wonjun; Choi, Hyoungsun; Whangbo, Taeg Keun

doi:10.3390/app15137381

Open AccessArticle

A Human-Centric, Uncertainty-Aware Event-Fused AI Network for Robust Face Recognition in Adverse Conditions

by

Akmalbek Abdusalomov

¹

,

Sabina Umirzakova

¹,

Elbek Boymatov

²,

Dilnoza Zaripova

²,

Shukhrat Kamalov

³,

Zavqiddin Temirov

⁴,

Wonjun Jeong

¹,

Hyoungsun Choi

¹

and

Taeg Keun Whangbo

^1,*

¹

Department of Computer Engineering, Gachon University Sujeong-Gu, Seongnam-si 13120, Republic of Korea

²

Department of Computer Systems/Information and Educational Technologies, Tashkent University of Information Technologies Named after Muhammad Al-Khwarizmi, Tashkent 100200, Uzbekistan

³

Department of Artificial intelligence, Tashkent State University of Economics, Tashkent 100066, Uzbekistan

⁴

Department of Digital Technologies, Alfraganus University, Yukori Karakamish Street 2a, Tashkent 100190, Uzbekistan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(13), 7381; https://doi.org/10.3390/app15137381

Submission received: 14 May 2025 / Revised: 22 June 2025 / Accepted: 30 June 2025 / Published: 30 June 2025

(This article belongs to the Special Issue New Technologies and Applications of Visual-Based Human-Computer Interactions)

Download

Browse Figures

Versions Notes

Abstract

Face recognition systems often falter when deployed in uncontrolled settings, grappling with low light, unexpected occlusions, motion blur, and the degradation of sensor signals. Most contemporary algorithms chase raw accuracy yet overlook the pragmatic need for uncertainty estimation and multispectral reasoning rolled into a single framework. This study introduces HUE-Net—a Human-centric, Uncertainty-aware, Event-fused Network—designed specifically to thrive under severe environmental stress. HUE-Net marries the visible RGB band with near-infrared (NIR) imagery and high-temporal-event data through an early-fusion pipeline, proven more responsive than serial approaches. A custom hybrid backbone that couples convolutional networks with transformers keeps the model nimble enough for edge devices. Central to the architecture is the perturbed multi-branch variational module, which distills probabilistic identity embeddings while delivering calibrated confidence scores. Complementing this, an Adaptive Spectral Attention mechanism dynamically reweights each stream to amplify the most reliable facial features in real time. Unlike previous efforts that compartmentalize uncertainty handling, spectral blending, or computational thrift, HUE-Net unites all three in a lightweight package. Benchmarks on the IJB-C and N-SpectralFace datasets illustrate that the system not only secures state-of-the-art accuracy but also exhibits unmatched spectral robustness and reliable probability calibration. The results indicate that HUE-Net is well-positioned for forensic missions and humanitarian scenarios where trustworthy identification cannot be deferred.

Keywords:

face recognition; multispectral fusion; uncertainty estimation; event-based vision; near-infrared imaging; spectral attention; edge deployment; adversarial robustness

1. Introduction

The use of face recognition is a primary feature of artificial intelligence in the contemporary world, with applications ranging from border security and surveillance to smartphone unlocking and even humanitarian aid [1]. A growing body of work within deep learning has enabled face recognition models to achieve nearly human accuracy in highly controlled settings [2], yet these models tend to perform poorly in realistic settings [3]. More specifically, occlusions [4], illumination changes [5], aging [6], trauma [7], low resolution [8], and environmental artifacts such as fog or motion blur [9] pose significant challenges. In such scenarios, traditional RGB recognition systems fail to generalize because of their narrow predictive horizons and inability to measure uncertainty in their decision-making processes. In addition, face recognition systems are mostly built to operate as contemporary models, which means that they lack flexibility in reasoning. While offering high accuracy on standard benchmarks, they lack confidence estimation mechanisms and do not adapt to variable input quality [10]. This shortfall is critical in humanitarian applications, such as post-disaster victim identification or forensic analysis, where incorrect predictions can have severe consequences. Facial input is degraded by injury, partial occlusion, or poor lighting [11]. Furthermore, due to the computational footprint of most state-of-the-art models, they become infeasible to deploy on edge devices in resource-constrained environments where high-stakes, fast decision-making with minimal power usage is important [12].

Recent studies have explored different aspects of robust facial analysis under difficult imaging conditions. For instance, HyperFace [13] presents a deep fusion model for hyperspectral face recognition using high-dimensional spectral data, while MS-EVS [14] uses asynchronous multispectral event data for face detection in dynamic scenes. These works highlight the potential of spectral and temporal cues, yet they often lack uncertainty estimation or are limited to detection tasks. In parallel, probabilistic deep models such as BPMB [15] and RobustFace [16] have introduced Bayesian or ensemble-based calibration strategies, but these are typically computationally expensive or single-modality. HUE-Net is motivated by the need to unify these advances into a single, lightweight, multi-modal recognition system that is both uncertainty-aware and edge-deployable.

To address these issues, we developed HUE-Net: a Human-centric, Uncertainty-aware, Event-fused Network tailored for face recognition in adverse conditions. HUE-Net is built upon three guiding principles. First, it incorporates Cross-Spectrum Perception, integrating visible (RGB), near-infrared (NIR), and event-based representations to enhance robustness under a wide variety of lighting and motion conditions. Second, it incorporates uncertainty modeling through a variational multi-branch structure, empowering the network with the ability to not just make predictions but also evaluate the confidence behind those predictions. Third, HUE-Net is crafted to be resource-efficient, blending convolutional and transformer-based modules equipped with spectral attention and low-rank transformations for real-time inference on edge devices. The factor that distinguishes HUE-Net from its predecessors is the unified cooperation of perception, reasoning, and adaptability. Unlike the rest of the models, which utilize multispectral inputs or probabilistic outputs separately, HUE-Net integrates all of these into one architecture, which selectively differentiates attention to the most informative spectral channels and predicts uncertainty while still performing under hardware constraints. This enables HUE-Net to be one of the strongest contenders for state-of-the-art recognition accuracy and a reliable solution for mission-critical tasks in uncontrolled and ethically sensitive environments.

In this work, we focus on the architectural design and empirical validation of HUE-Net. We evaluate HUE-Net extensively on different challenging datasets and demonstrate that it achieves state-of-the-art results on the IJB-C unconstrained face dataset [17] and N-SpectralFace dataset [14]. The model also outperforms strong baselines in calibration, modality robustness, and inference efficiency. Our contributions can be summarized as follows:

We propose a novel hybrid deep learning architecture that fuses RGB, NIR, and event-based data streams through adaptive spectral attention, enabling spectral rebalancing under dynamic visual conditions.
We introduce a perturbed multi-branch variational module (PMB-VM) that enables both uncertainty estimation and probabilistic ensemble learning in a compact, end-to-end-trainable form.
We design a spectral consistency training loss and a robustness-aware evaluation protocol that collectively improve generalization in the presence of occlusion, injury, and missing modalities.
We validate our model on multiple benchmark datasets and edge deployment platforms, achieving new state-of-the-art performance in both accuracy and reliability.

The remainder of this paper is structured as follows: Section 2 reviews related work in multispectral vision, uncertainty-aware deep learning, and efficient face recognition. Section 3 introduces the architecture and methodological components of HUE-Net. Section 4 outlines the experimental setup, datasets, and metrics. Section 5 presents quantitative and qualitative results. Finally, Section 6 concludes with a summary of findings and future directions.

2. Related Works

HUE-Net is developed on an interdisciplinary basis of multispectral face recognition, event-based vision [18], and deep neural network probabilistic modeling [19], along with lightweight edge-deployable architectures [20]. This section discusses pivotal advancements within these fields that guide the framework of our model, while making note of the gaps that remain in practical, resource-limited, and uncertain environments for facial recognition systems (Table 1).

In the last decade, the application of deep learning in automated face recognition technology has yielded systems that function with remarkable accuracy in controlled environments. Models like ArcFace [21], ElasticFace [22], and CosFace [23] have rendered exceptional performance by projecting embeddings on hyperspheres and applying angular margin-based loss optimizations. Yet, mostly, these models are restricted to clean, balanced datasets trained under optimal conditions. Their performance in the real world is greatly hindered by exogenous variables such as pose variation, occlusion, lighting shifts, and various sensor artifacts, posing tremendous uncertainty. In addition, these models are tailored for single-modality RGB inputs and lack strategies for spectral variability as well as mechanisms to predict confidently under uncertain qualifiers. To enhance reliability, greater architectural and generative alterations have been made in other studies. For instance, LightCNN [24] pioneered the use of noise-resilient convolutional filters, while the developers of DR-GAN [25] and TP-GAN [26] devised approaches aimed at neutralizing dependence on specific pose and illumination conditions. Even with these adjustments, these models remain anchored to unadapted deterministic RGB-only pipelines where spectral reliability and predictive uncertainty—critical aspects of dependability for deployment in condition-critical or low-visibility situations—are unaccounted for. Concerning the limitations of RGB systems, the multispectral approach has garnered increased attention from researchers. By adding depth, thermal, or near-infrared (NIR) data, multispectral techniques try to acquire structural and physiological facial features that are invariant to lighting changes. It has been demonstrated that vascular patterns and fine textures obscured to RGB sensors are visible using NIR imaging [27], while thermal imaging provides depictions of sub-dermal heat distributions that are occlusive and supportive to light conditions [28]. Many of these approaches rely on expensive and sophisticated sensor configurations, and as a result, most do not generalize well in the absence or degradation of some modalities. This lack of robustness makes broad deployment impractical.

Event-based vision sensors have appeared on the scene as a compelling alternative to traditional frame-based imaging. These sensors capture changes in the pixel’s intensity asynchronously, enabling ultra-low-latency capturing, heightened temporal resolution, and remarkable stability during motion as well as in high-dynamic-range environments [29]. MS-EVS—a Multispectral Event-based Framework for Face Detection [14]—demonstrates the benefits of the early fusion of spectral cues for face detection within dynamic scenes. Yet, the work was constrained by detection, ignoring model uncertainty and identity-level recognition, which are both fundamental for robust decision-making in complex environments. Uncertainty modeling has become a focal point in the context of deep learning due to the need to address unpredictability for real-world deployments [19]. Conventional neural networks are often overconfident in their predictions, and when wrong, confidence exacerbates the problem. Limited applicability to forensics, healthcare, and emergency response creates high-stakes reliance on trust in technology [30]. Bayesian neural networks (BNNs) model epistemic uncertainty in a principled fashion by integrating learning distributions over the weights of the model rather than employing a deterministic approach to values [31]. Exact inference in deep models is intractable, so approximate methods like Monte Carlo Dropout [32], Bayes by Backprop [15], and Deep Ensembles [33] have been formulated. These techniques are very effective in generating accurate predictions and are especially useful in detecting out-of-distribution samples or ambiguous inputs [34].

Motivated by these developments, we propose PMB-VM, which extends uncertainty modeling to face recognition by performing stochastic sampling across multiple lightweight branches informed by Bayesian regularization. This allows the model to produce high-confidence embeddings and accurately calibrated variance estimates without significantly increasing the embedding computational cost. Unlike prior models that decouple uncertainty estimation as an additional effort from significantly complex ensemble architectures, PMB-VM is seamlessly integrated into HUE-Net’s pipeline and provides efficient performance. While earlier attempts have individually solved the problems of recognition accuracy, spectral variability, and uncertainty quantification, none have integrated all these facets into a single streamlined deployable framework. HUE-Net bridges this gap by combining multispectral event fusion, variational uncertainty modeling, and adaptive attention within an edge-efficient architecture. This enables the model to not just attain state-of-the-art recognition performance, but also reason about the quality of features and confidence in predictions—vital for real-world face recognition in ethically sensitive, mission-critical settings.

3. The Materials and Methods

HUE-Net was developed with the aim of implementing robust and accurate deep face recognition capabilities in the wild, where conditions are rarely favorable and spectral cues fluctuate in their usefulness. To this end, HUE-Net incorporates and unifies three key innovations into a streamlined deep learning framework: (1) a multispectral input representation that integrates RGB with NIR and event-based sensors, (2) a compact yet highly expressive hybrid convolutional–transformer backbone; and also (3) a probabilistic inference strategy to account for epistemic uncertainty. HUE-Net’s overall architecture is designed to harness the spatial granularity of CNNs alongside global context modeling from transformers while remaining lightweight for real-time edge device inference. At the input stage, the network conducts the early fusion of aligned spectral modalities with a 6-channel tensor that captures and encodes visual information in the spectral and temporal domains. This fusion is refined further with an adaptive attention mechanism that dynamically adjusts the contribution from each modality based on content relevance and reliability concerning spatial accuracy (Figure 1).

We propose PMB-VM for model interpretability and robustness, which creates multiple stochastic embeddings through variational sampling, enabling uncertainty estimation, and a form of ensemble regularization like a form of collective annealing. Concurrently, the architecture employs a spectral consistency loss aligning identity embeddings from different input streams, which promotes generalization when some modalities are rendered unusable. The components of HUE-Net are meticulously tuned to provide high spectral discrimination, robust spectral performance, well-calibrated confidence reliability, and computational efficiency. We start with an outline of the multispectral event fusion, describe the proposed HUE-NET framework’s structure, and walk through its architecture in detail, beginning with the multispectral event fusion input structure and progressing to the backbone architecture, uncertainty-aware variational modeling, spectral attention mechanism, and lastly, the composite loss function for end-to-end training.

3.1. Multispectral Event Fusion

To achieve reliable facial recognition in highly unconstrained contexts like those with extreme lighting, occlusions, or injuries deforming the face, it is essential to exploit imaging cues beyond the RGB spectrum. HUE-Net employs a multifusion processing model through a multispectral event fusion input strategy, which combines three data modalities: RGB, NIR, and simulated event representations. This fusion aims to increase the spectral bandwidth as well as the temporal responsiveness of the model. Although RGB and NIR modalities were acquired via asynchronous sequential capture (~10 ms interval), the static nature of subjects and controlled capture timing allowed us to treat these frames as quasi-synchronous. We performed a two-stage alignment: a coarse affine transformation using mutual information, followed by optical flow-based warping to correct local motion. This strategy avoids assuming perfect temporal registration and provides accurate alignment even in the presence of small inter-frame shifts. This enables the model to perform better under difficult imaging conditions (Figure 2). HUE-Net receives an input tensor of six channels, which are arranged as follows:

X = [R, G, B, N I R, E^{R G B}, E^{N I R}] \in R^{H \times W \times 6}

(1)

where

R, G, B

correspond to the visible channels;

N I R

denotes the near-infrared channel, captured at a wavelength of ~850 nm; and

E^{R G B}, E^{N I R}

represent the RGB-based and

N I R

-based simulated event channels, respectively. All channels are spatially registered and normalized to a dynamic range of [0, 1] prior to fusion.

In HUE-Net, each modality has a specific and complementary function for preserving facial recognition under diverse and difficult conditions. From the RGB input, photometric data such as skin color and texture, as well as facial features, are recognizable to humans and highly effective in proper lighting. The NIR channel, on the other hand, goes deeper into the skin layers and is much more useful in low-light or backlit situations because it is less sensitive to illumination variation. The event-based channels, in the meantime, detect temporal contrast changes in pixel intensity over time. They work well in areas where motion blur, flickering lights, or fine-grained obstruction occurs. All these modalities enhance and enrich perception and structure and, together with HUE-Net, offer robust face recognition over a wide range of imaging disturbances. Because of the limited availability of real datasets based on events and facial recognition, we employ a simulation-based approximation influenced by earlier studies in the field of neuro-morphic vision. Given a pair of temporally adjacent frames

I_{t}

and

I_{t - δ}

, the event frame E is computed via a logarithmic differencing mechanism:

E^{c} = (x, y) = \{\begin{matrix} 1, & {\log I}_{t}^{c} (x, y) - {\log I}_{t - δ}^{c} (x, y) > θ^{+} \\ - 1 & {\log I}_{t}^{c} (x, y) - {\log I}_{t - δ}^{c} (x, y) > {- θ}^{-} \\ 0 & o t h e r w i s e \end{matrix}

(2)

where

c \in \{R, G, B, N I R\}

is the spectral channel, (x, y) is the pixel coordinates, and

θ^{+}, θ^{-}

are empirically tuned contrast thresholds. The resulting sparse event maps are then smoothed using a Gaussian kernel and stacked alongside the original modalities to generate the final input tensor. This simulation paradigm captures motion-induced or photometric discontinuities that standard frame-based representations may overlook. It particularly enhances the model’s ability to detect facial contours and edges in the presence of occlusion or spectral interference.

Multispectral and event channels often suffer from misalignment, primarily due to differences in sensor configurations and the asynchronous nature of their data acquisition. To overcome these discrepancies, HUE-Net employs a multi-step registration strategy. First, affine alignment is performed by optimizing cross-spectral mutual information, particularly between the RGB and NIR domains, ensuring a coarse spatial alignment. This is followed by optical flow refinement, which further corrects for subtle micro-motions and dynamic texture mismatches. In the end, histogram matching is applied to harmonize the distribution of intensity values across different modalities, reducing spectral inconsistency. Once the alignment process is complete, all input channels are resized to a standard resolution of 224 × 224 and undergo z-score normalization. This preprocessing pipeline plays a crucial role in promoting training stability and reducing biases that may arise from modality-specific variations.

Unlike many existing architectures that rely on late fusion—processing each modality in isolation before merging their features at a later stage—HUE-Net adopts an early-fusion approach. This design choice stems from recognition of the fact that identity-relevant cues are often not confined to a single modality but are distributed across spectral domains and tend to manifest as correlated patterns in the early layers of a network. By integrating modalities at the input level, HUE-Net enables the learning of shared low-level filters that can jointly encode cross-spectral relationships and spatio-temporal dynamics from the outset. This strategy enhances the model’s ability to extract robust, multi-modal representations and offers practical advantages in terms of computational efficiency. Early fusion eliminates the need for separate processing branches, thereby reducing parameter complexity and improving inference speed—an essential consideration for deployment on resource-constrained edge devices.

3.2. Backbone Architecture

To achieve high recognition accuracy under complex visual conditions while maintaining computational efficiency, HUE-Net introduces a novel hybrid backbone architecture that synergistically combines the local inductive biases of Convolutional Neural Networks (CNNs) with the global contextual modeling capabilities of transformer modules. The design is tailored specifically to accommodate multispectral and event-based inputs while being lightweight enough for deployment on edge devices such as the Jetson Nano and mobile SoCs.

The HUE-Net architecture is built around a carefully calibrated sequence of four interdependent stages to optimize accuracy and efficiency. It begins with two convolutional volumes that are built with depthwise separable convolutions. This method enables the extraction of local features, for example, facial shapes and textures, with a greatly reduced computational burden. The addition of transformer modules with Split Transpose Depthwise Attention (STDA) is incorporated in the latter two stages. Such modules provide reasoning at the global level that the model needs to interpret broader spatial relationships as well as spectral interactions necessary for distinguishing identities in complex scenes. After this step, LoRaLin, a low-rank linear transformation layer, acts as a bottleneck to tighten the increasingly abstract features with a compact yet informative representation. The final step in the workflow takes place with an output projection layer that sets the resulting embeddings to be fit for an angular margin-based classification. All these steps combine into one flow that lets HUE-Net learn powerful facial representations while keeping it lightweight and fast enough for real-time use in constrained settings. This hybrid architecture is formally denoted as

f (X) = P r o j \circ L o R a L i n \circ T_{4} \circ T_{3} \circ C_{2} \circ C_{1} (X)

(3)

where

X \in R^{H \times W \times 6}

is the fused multispectral event input,

C_{i}

are convolutional stages,

T_{i}

are transformer stages, and Proj is the output embedding head.

The first two stages utilize inverted residual bottleneck blocks with depthwise separable convolutions, spinning off MobileNetV3, to capture spatial patterns including edges, contours, and localized texture patches efficiently. Each block is defined as

F^{(l)} σ (B N (D W C o n v (P W C o n v (F^{(l - 1)}))))

(4)

where

P W C o n v

denotes pointwise 1 × 1 convolution,

D W C o n v

is a depthwise convolution, BN is batch normalization, and σ is the Swish activation. These blocks significantly reduce the computational burden (FLOPs) without sacrificing representational power.

To seamlessly capture the long-range dependencies and intricate relationships between spectral channels, the final stages of HUE-Net integrate modified transformer blocks based on STDA mechanisms. Unlike classical self-attention mechanisms (known for their Quad-complex computational inefficiency), STDA has developed a more streamlined approach. This improvement comes from the systematic workflow where attention is performed after the input tokens are split into nonoverlapping patches. This design allows for localized processing but performs transposition from spatial to channel dimensions for cross-spectral mixing before the application of depthwise separable attention. Global elusive contextual information is captured at minimal cost. With this design, the network can reason across space and the spectrum simultaneously, improving identity recognition performance in visually impaired and degraded conditions in the spectrum.

In HUE-Net’s final stages, modified transformer units based on STDA mechanisms are added to stack on long-range dependency and spectral channel relationships. Unlike self-attention mechanisms, STDA adopts a more efficient self-attention strategy. STDA efficiency is realized in the form of specific orderly steps: input tokens are split into nonoverlapping patches to allow local processing, then the spatial-to-channel transposing of these patches is performed to allow cross-spectral mixing, and finally, global context extraction is performed using sub-ordinary depthwise separable attention. Incorporating the attention mechanism in this manner helps the network reason globally and identically across space and the spectrum, which amplifies identity recognition ability even when the visuals are hard or stripped. Each block of the transformer carries out the following in the given order:

F^{(l)} = M L P (S T D A (L N (F^{(l - 1)})) + F^{(l - 1)})

(5)

where LN refers to layer normalization, MLP is a two-stage feed-forward network, and residual connections are kept to allow for non-disruptive gradient flow attenuation. This captures high-order spatial and spectral interactions to further improve model performance when identifying subjects despite extreme occlusion or distortion. In order to decrease the computational burden, Stage 4’s output of the high-dimensional feature tensor is filtered through a LoRaLin (low-rank linear) transformation. Instead of using a standard fully connected (FC) layer with a weight matrix

W \in R^{d_{i n} \times d_{o u t}},

we decompose it into two smaller matrices:

W \approx A B, A \in R^{d_{i n} \times r}, B \in R^{r \times d_{o u t}}, r ≪ m i n (d_{i n} \times d_{o u t})

(6)

This approach balances the number of parameters and matrix multiplications while preserving adequate expressive power. Based on our observations, an r value of 32 offers the best balance of efficiency with accuracy. The final output of the backbone is an

l_{2}

-normalized identity embedding vector

z \in R^{128}

, computed as

z = \frac{h}{{‖h‖}_{2}}, h = L o R a L i n (F^{(4)})

(7)

The model is trained with an additive angular margin softmax loss (ArcFace or ElasticFace variants) to maintain strong separation for different classes and compaction for similar classes, conditioning them with this normalized vector. The full backbone is around 6.4 million parameters and consumes about 180 MFLOPs for every forward pass with a 224 × 224 input. Relative to traditional backbones like ResNet-50 or ViT-B, this is a 10–15-fold reduction in memory and computational expense, making the model extremely efficient for embedded inference.

3.3. Perturbed Multi-Branch Variational Module

Face recognition technologies pose enormous risks to forensic identification and disaster response applications when used in non-ideal conditions—in the case of occlusions, visual noise, partial facial trauma, or uncontrolled spectral quality. Uncertainty estimation is crucial in such high-stakes scenarios where the workload is far more difficult than the system has seen, especially since conventional deterministic deep neural networks are extremely overconfident in their predictions with even mildly degraded or unfamiliar inputs. To mitigate this limitation, HUE-Net implements the perturbed multi-branch variational module (PMB-VM), which is one of the main structural elements intended to capture and quantify predictive uncertainty. Based on concepts of Bayesian neural networks and stochastic dropouts, PMB-VM replaces determinism with an estimated framework, enabling the model, in addition to learning to facially represent the details robustly, to estimate confidence in the output’s validity through variance estimation across multiple sampled branches. Within this framework, two main types of uncertainty are accounted for: epistemic uncertainty, gaps in the model’s knowledge due to insufficient data or representation limitations, and aleatoric uncertainty, caused by the signal’s inherent variability and noise, like occluded or poorly lit faces. By tackling both issues, PMB-VM improves model performance in unpredictable environments, which are commonly encountered in real-world face recognition tasks. PMB-VM introduces randomness across five slender branches at training time, stitching together depthwise separable convolutions, variational linear layers, and standard Monte Carlo Dropout. Each line computes an identity code while simultaneously recording uncertainty as the spread of angular distances among the sampled codes. Weight posteriors are approximated through the Bayes-by-Backprop routine, with optimization directed at the Evidence Lower Bound. During inference, ten separate forward passes with dropout still engaged feed into a pooling step that produces the final embedding and its confidence score.

Data augmentation and noise modeling techniques can partially alleviate aleatoric uncertainty. On the other hand, epistemic uncertainty requires probabilistic reasoning for model parameters. Therefore, we use a variational Bayesian approximation in which every branch in PMB-VM approximates the posterior distribution over parameters by a Gaussian distribution with a mean and variance that can be adjusted through learning:

q_{ϕ} (ω) = N (ω; μ, σ^{2}), ϕ = \{μ, σ\}

(8)

This enables the network to capture multiple realizations of its parameters during inference, yielding a distribution over output embeddings corresponding to the model’s level of confidence. The PMB-VM module is integrated after the final backbone stage and comprises M independent stochastic branches. Each of these branches is designed as a lightweight residual block followed by a variational sampling layer. Each branch shares the same structural template but is perturbed during training via stochastic dropout and noise-injected weight sampling.

F \in R^{d \times d \times C}

denotes the high-level feature map from the backbone. Each branch

m \in \{1, \dots, M\}

applies

z^{(m)} = f_{m} (F; ω_{m}), ω_{m} ~ q_{ϕ} (ω)

(9)

The outputs of all branches are aggregated via expectation:

\bar{z} = \frac{1}{M} \sum_{m = 1}^{M} z^{(m)}, V a r (z) = \frac{1}{M} \sum_{m = 1}^{M} {(z^{(m)} - \bar{z})}^{2}

(10)

Here,

\bar{z}

is the final deterministic embedding used for recognition, while the variance provides a calibrated measure of predictive uncertainty.

Each branch within the perturbed multi-branch variational module (PMB-VM) is meticulously designed to achieve a trade-off between computational load and uncertainty modeling potential. Each branch’s core is a residual structure instantiated with depthwise separable convolutions. These convolutions reduce the number of parameters significantly while still retaining the network’s ability to meaningfully extract features. For stochastic control during training, Monte Carlo Dropout (MC-Dropout) is employed at both spatial and channel levels, effectively mimicking a variety of realizations of the network and increasing robustness to input alterations. The last step of every branch consists of a Bayesian fully connected layer, which is known to be trained by the Bayesian Backprop approach. This method trains models by optimizing the variational parameters while minimizing ELBO, thus encouraging the learned weight distributions to remain, in a sense, close to the prior and not too strongly data-dependent. Each branch contributes to a diverse but consistent ensemble of predictions, with this design enabling HUE-Net to not only make precise identity inferences but also confidently quantify prediction confidence:

L_{E L B O} = E_{q_{ϕ (ω)}} [l o g p (y | x, ω)] - K L (q_{ϕ} (ω) | | p (ω))

(11)

In this variational framework,

p (ω)

represents the prior distribution over model parameters, typically assumed to be standard normal. The Kullback–Leibler (KL) divergence serves to regularize the learned variational distribution to avoid excessive divergence from the prior. For inference, HUE-Net performs several stochastic forward passes over the variational branches to produce a set of sampled embeddings. These embeddings are averaged to form a robust identity representation, while the variability around this average, captured by the standard deviation of angular distances between the embeddings and class centroids, serves as an empirical measure of prediction confidence. The ability to estimate uncertainty provides several important benefits. First, it enables decision-making based on confidence levels, allowing the system to withhold low-confidence predictions or manually labeled sensitive ones for sensitive contexts. Second, it grants robustness by enabling the detection of adversarial or outlier inputs where unusually high uncertainty could indicate occlusion, spoofing, or unknown identities. Lastly, during training, uncertainty estimates can be used to inform curriculum learning approaches where samples with higher uncertainty are used to enhance model generalization for challenging inputs.

3.4. Adaptive Spectral Attention

In the case of real-life scenarios of face recognition, context-centric event data, RGB, and near-infrared (NIR) capture images are used in different contexts. Each context has limitations that may at times differ significantly. These perspectives differ in informative value as well. Under harsh lighting conditions, RGB channels may suffer from overexposure which masks feature depiction. On the other hand, NIR lacks lighting resilience, but in extreme indoor natural lighting, it is prone to passive infrared disruption. Although event-based information is temporally rich, it is static and therefore sparse and noisy. Thus, treating all modalities simultaneously may strengthen the model’s identity representation capabilities, but it does not provide robust extraction. This challenge is addressed through the Adaptive Spectral Attention (ASA) mechanism, which allows HUE-Net to adjust specific alteration amplifications to contextual relevance in a dramatic contextual role-based framework. The attention mechanism acts as a smart dynamic gate system: it flexibly determines where spatial focus should be applied and what prioritization at the contextual level depends on.

The ASA process begins by capturing global context from the fused input tensor.

X \in R^{H \times W \times C}

denotes the fused input where C = 6 channels correspond to RGB, NIR, and two event-based maps. To summarize the spectral context, a global average pooling is applied across spatial dimensions:

s_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{i, j, c}, f o r c = 1, \dots, C

(12)

This produces a channel descriptor

s \in R^{C}

that captures modality-specific global statistics. To model non-linear interactions between these spectral channels, we apply a two-layer gating network, essentially a bottlenecked multi-layer perceptron (MLP):

a_{s p e c} = σ (W_{2} \times δ (W_{1} \times s))

(13)

Here,

W_{1} \in R^{\frac{C}{r} \times C}, W_{2} \in R^{C \times \frac{C}{r}}, δ (\cdot)

is the ReLU activation function, and

σ (\cdot)

is the sigmoid function. The reduction ratio r controls the bottleneck width, typically set to 2 in our configuration. The output

a_{s p e c} \in R^{C}

is a learned set of attention weights that dynamically scale each modality’s contribution. These attention weights are broadcast and applied to the corresponding input channels:

{\tilde{X}}_{i, j, c} = a_{s p e c} (c) \times X_{i, j, c}

(14)

This operation yields a recalibrated tensor

\tilde{X} \in R^{H \times W \times C}

, where less reliable spectral sources are suppressed and more informative ones are emphasized. To build on this modulation, especially when the modality quality differs among facial areas, we add an optional spatial refinement stage. This block obtains a spectral–spatial attention map

α \in R^{H \times W \times C}

through the use of depthwise separable convolutions followed by a channel softmax at each spatial position:

α_{i, j, c} = \frac{e x p (K_{c} * {\tilde{X}}_{c})}{\sum_{c ’}^{C} e x p (K_{c ’} * {\tilde{X}}_{c ’})}

(15)

where * denotes a convolution with learned kernel

K_{c}

. The final attended representation is then computed via elementwise multiplication:

X_{a t t} = α ⊙ \tilde{X}

.

This operation makes attention to information both modality-aware and spatially adaptive, for reliable information distributed over the face. All parts of the ASA module are differentiable and are thus trained in conjunction with the rest of the network. The module aims to form modality preferences that are learned to statistically align with the scenes, adaptive to the statistical properties of the scene in a certain manner. For example, in low-light scenes, ASA progressively enhances the weights assigned to NIR and event channels while suppressing the contributions from RGB. Conversely, in bright indoor conditions with stable illumination, RGB could become dominant again. This is not imposed as a constraint but rather emerges via interactions captured by the attention network. The ASA module is still responsive while remaining lightweight, comprising less than 1% of the total parameters and FLOPs in HUE-Net. Its impact, however, is profound. Through controlled ablation studies, enabling ASA increased top-1 accuracy by over 2% while fully improving reliability under occlusion and noise. Aside from these characteristics, ASA has accurate XAI prospects as well. One could infer which modalities the model relies on by visualizing the attention weights and maps, thus shedding light on the decision-making process the model exposes. In the context of validation and trust, for practical purposes, such as forensic identification or recognizing victims of a disaster, such trust transparency is important.

3.5. Loss Function

A comprehensive face recognition system must learn to produce highly discriminative embeddings even in the presence of noisy labels, modality imbalance, and unpredictable visual distortions. These models undergo even more significant strain when diverse input spaces such as RGB, NIR, and event streams are utilized, especially when the model is uncertainty-aware, as in HUE-Net (Figure 3).

To accomplish these goals, we develop a composite loss function that simultaneously optimizes identity separability, uncertainty estimation regularization, and inter-modal spectral consistency reinforcement. Fundamental to our approach is the angular margin softmax loss, which has gained immense popularity in the recent face recognition literature due to its efficacy in augmenting inter-class separation and the compactness of intra-class embeddings. We enhance this with two auxiliary loss components, one based on Bayesian variational inference and the other on cross-modal embedding alignment, to provide strong guidance in the deterministic and stochastic sections of the network. The composite objective function can be stated as

L_{T} = L_{i d} + λ_{1} \times L_{u n c e r t a i n t y} + λ_{2} \times L_{c o n s i s t e n c y}

(16)

where

L_{i d}

governs the classification performance,

L_{u n c e r t a i n t y}

regulates variational inference in the perturbed multi-branch variational module (PMB-VM), and

L_{c o n s i s t e n c y}

encourages stability across spectral embeddings. The coefficients

λ_{1}

and

λ_{2}

are empirically chosen to balance the relative contributions of each component. To enforce discriminative power in the identity embedding space, we adopt the ArcFace loss (also known as additive angular margin softmax):

L_{i d} = \frac{1}{N} \sum_{i = 1}^{N} l o g \frac{e^{s \times c o s (θ_{y_{i} + m})}}{e^{s \times c o s (θ_{y_{i} + m})} + \sum_{j \neq y_{i}} e^{s \times c o s (θ_{j})}}

(17)

Here, N is the batch size,

θ_{j}

is the angle between the normalized embedding vector

z_{i}

and the normalized weight vector

ω_{j}

for class j, s is a scaling factor, and mmm is the additive angular margin that enhances decision boundaries. The formulation promotes the model’s ability to capture embeddings that are tightly clustered for the same identity and impose large angular separations between different identities. Since HUE-Net applies variational sampling in PMB-VM for estimating epistemic uncertainty, it is crucial to constrain the process with a probabilistic objective. Under the Bayes-by-Backprop paradigm, a Kullback–Leibler divergence component is incorporated along with the learned variational posterior

q (ω) = N (μ, σ^{2})

and a prior

p (ω) = N (0,1)

.

L_{u n c e r t a i n t y} = K L [q (ω) | | p (ω)] = \frac{1}{2} \sum_{i = 1}^{d} (l o g \frac{1}{σ_{i}^{2}} + \frac{σ_{i}^{2} + μ_{i}^{2} - 1}{1})

(18)

This regularization guarantees that the learned weights for each stochastic branch do not diverge too much from the prior, which helps mitigate overconfidence and stabilize uncertainty estimation. Also, the network encourages consistent output by requiring stochastic embeddings to be sampled multiple times during training, producing an average prediction. While ASA guarantees that the network focuses on the most relevant modalities, it is equally vital that identity embeddings do not vary across spectral domains. To achieve this, we propose a spectral consistency loss that alleviates the Euclidean distance between embeddings sourced from different spectral subspaces, such as RGB-only and NIR-only:

L_{c o n s i s t e n c y} = \frac{1}{N} \sum_{i = 1}^{N} \sum_{(m, n)} {‖z_{i}^{(m)} - z_{i}^{(n)}‖}_{2}^{2}

(19)

where

z_{i}^{(m)}

and

z_{i}^{(n)}

are the identity embeddings computed from two distinct modality subsets for the same sample. This coherent consistency constraint motivates the model to embed identical identities similarly, irrespective of modality differences, hence increasing the model’s robustness to modality dropout, partial occlusion, or sensor failure. During the training phase, all three loss components are optimized simultaneously. The parameters

λ_{1} = 1 \times 10^{- 3}

and

λ_{2} = 0.1

are observed to achieve stable convergence for all datasets. Of particular note, incorporating spectral consistency loss has a particular impact on regularization, considerably enhancing generalization, especially when dealing with unseen modalities, as well as during cross-domain evaluations. To maintain efficiency and stability, we do not compute all O(B²C²) pairwise distances. Instead, we sample a fixed number (K = 32) of inter-modal pairs per batch, ensuring that each modality pair is sampled uniformly. This preserves gradient diversity without incurring quadratic memory cost. In practice, this subsampled spectral consistency loss approximates the full version closely and supports real-time training. Altogether, this formulation of multi-objective loss permits HUE-Net to optimally achieve three critical objectives: unconditional recognition precision, well-calibrated prediction uncertainty, and spectral strength. Instead of attempting these objectives separately, the loss formulation compels the model to fulfill them all at once, which permits high-certainty predictions in scenarios that are visually complex or spectrum-shifted.

4. Experiments

To thoroughly assess the accuracy and dependability of HUE-Net, we focused on robustness in severe visual conditions and performed multi-faceted tests using both real-world and synthetic benchmarks. Our objectives were fourfold: (1) capture recognition accuracy verbatim, (2) evaluate performance under modality impairment and vexatious occlusion, (3) assess the estimation of uncertainty, and (4) measure the efficiency of inference relative to peers. The assessment tested extreme identification difficulties like illumination, motion blur, facial disfigurement, and cross-modal identification, which are commonplace in altruistic, forensic, and surveillance endeavors.

4.1. Dataset

To evaluate the cross-domain and cross-modal capabilities of HUE-Net, we worked with three benchmark datasets. The IJB-C dataset [17] is a well-known benchmark that includes face images captured in unconstrained environments that pose challenges such as resolution, lighting, and pose. It acts as a central benchmark for testing identification performance in surveillance environments Table 2.

The N-SpectralFace [14] dataset is a compiled set merging simulated with real-world RGB, NIR, and event data captured as video streams of face images. It contains both aligned and misaligned halves as well as varying motion and lighting. This dataset aims to evaluate the spectral generalization and alignment invariance of the multispectral model. The capture of RGB-NIR pairs was conducted using synchronized sensors, while event frame synthesis was derived from high-frame-rate video clips through temporally differencing filters. All datasets were divided into training, validation, and testing subsets following consistent splitting rules. No identity was retained from the training set to the test set to maintain control for the evaluation of generalization.

Face images entered the pipeline in batches, were pruned by MTCNN, and then were resized to 224 × 224 pixels via five-landmark registration. Across the three input modalities, pixel values were first clipped to the [0, 1] range and then z-score-normalized with statistics drawn solely from the training subset. Temporal event frames were synthesized on the fly by applying a difference filter, followed by a Gaussian blur. To enrich the training corpus, a suite of modality-consistent augmentations was employed: random horizontal flips, 5–10 random crops, and light Gaussian noise (σ between 0.01 and 0.05) in both RGB and NIR. Photometric jitter—tweaks to brightness and contrast—completed the recipe, reinforcing model generalization, particularly for HUE-Net, which is sensitive to spectral interplay.

4.2. Evaluation Metrics

To evaluate the performance of HUE-Net, we systematically used a set of quantitative metrics that measure recognition accuracy, calibration accuracy, and robustness to modality degradation. These criteria were selected to encompass traditional face recognition evaluation as well as the distinctive features of HUE-Net’s implementation, especially its probabilistic modeling and multispectral versatility. As the primary metric for classification, we present top-1 identification accuracy. This is defined as

{A c c}_{t o p - 1} = \frac{1}{N} \sum_{i = 1}^{N} 1 ({\hat{y}}_{i} = y_{i})

(20)

where N stands for the total number of test samples,

y_{i}

represents the actual identity label,

{\hat{y}}_{i}

symbolizes the predicted label, and 1(⋅) is the indicator function that returns 1 when a particular condition is satisfied. Furthermore, we assessed the Rank-1 Verification Rate (VR_FAR) on IJB-C [17], especially under a fixed false acceptance rate (FAR), as performed in open-set recognition system reset protocols. This metric, as with other metrics of performance, is derived from the ROC curve and is calculated as

{V R}_{F A R} = m a x \{T P R | F P R \leq τ\}

(21)

where TPR and FPR denote the true positive rate and false positive rate, respectively, and τ is the FAR threshold. ROC AUC (Receiver Operating Characteristic—Area Under Curve) is used to summarize verification performance over all operating points. In this case, ROC AUC is defined as the integral of the PR-FPR curve:

A U C = \int_{0}^{1} T P R (F P R) d (F P R)

(22)

This metric captures the likelihood that a randomly selected positive pair (identical identity) is scored higher than a randomly selected negative pair (different identity).

To assess how well the model predicted confidence matches with the actual likelihood of being correct, we calculate Expected Calibration Error (ECE). Given a set of predictions with confidence values, ECE quantifies the weighted average accuracy and confidence disparity across M bins:

C E = \sum_{m = 1}^{M} \frac{|B_{m}|}{N} |a c c (B_{m}) - c o n f (B_{m})|

(23)

Here,

B_{m}

is the set of samples falling into the m-th confidence bin,

a c c (B_{m})

is the empirical accuracy in that bin, and

{c o n f (B}_{m})

is the mean predicted confidence. Lower ECE values indicate better-calibrated predictions.

The Brier score offers a complementary view of probabilistic accuracy, especially for softmax outputs. It measures the mean squared error between the predicted probability

{\hat{p}}_{i}

and true class indicator

y_{i} \in \{0,1\}

:

B r i e r = \frac{1}{N} \sum_{i = 1}^{N} {({\hat{p}}_{i} - y_{i})}^{2}

(24)

Smaller values show both confident and correct predictions; thus, both overconfidence and underconfidence are punished.

To quantify the robustness under partial spectral loss, we propose a new metric, the Spectral Robustness Score (SRS), which assesses accuracy degradation when one or more modalities are blocked during inference.

{A c c}_{f u l l}

denotes the accuracy using all modalities, and

{A c c}_{m a s k e d}^{(c)}

is the accuracy with channel ccc dropped. The average degradation is computed as

S R S = {A c c}_{f u l l} - \frac{1}{C} \sum_{c = 1}^{C} {A c c}_{m a s k e d}^{(c)}

(25)

Lower values of SRS suggest greater tolerance to absent or degraded spectral inputs, which indicates an enhanced capability of cross-modality generalization. For assessing the feasibility of deployment, we report the model parameter count P, floating-point operations (FLOPs) in a single forward pass, and inference latency T on GPUs as well as edge devices. These measurements indicate the burden the model imposes with real-time responsive constraints:

P = \sum_{l = 1}^{L} d i m (W_{l}), T = m e a n (r u n t i m e p e r s a m p l e)

(26)

where

W_{l}

are the trainable weight matrices in layer l. Together, these metrics provide a comprehensive view of HUE-Net’s capabilities—from recognition accuracy and confidence calibration to operational efficiency and spectral robustness—highlighting its suitability for real-world, safety-critical applications.

To ensure the reproducibility and statistical validity of our results, all reported metrics—including top-1 Accuracy, the SRS, ECE, and the Brier score—are averaged over five independent training runs with different random seeds. Each run uses identical training protocols and dataset splits. For each metric, we report the mean ± standard deviation, reflecting the inherent variability due to stochastic training factors such as initialization and dropout.

All experiments were conducted using PyTorch 2.1 on an NVIDIA RTX 2080 GPU. The network was trained from scratch using the Adam optimizer with an initial learning rate of 0.001, decayed by a cosine annealing scheduler. The batch size was 64, and training was run for 150 epochs. We employed standard data augmentation techniques (random crop, horizontal flip, brightness/contrast jitter) applied uniformly across all modalities (RGB, NIR, event). We used an 80/10/10 split for training, validation, and testing, ensuring no identity overlap between sets. To validate repeatability, we performed 5-fold cross-validation on the N-SpectralFace dataset and report average metrics across runs (±standard deviation). PMB-VM was configured with 5 stochastic branches, each using a dropout rate p = 0.3 and variational priors initialized as isotropic Gaussians.

5. Results and Discussion

The evaluation results scientifically validate HUE-Net’s effectiveness as a reliable and efficient face recognition model within unconstrained, multi-modal environments. Under all datasets and scenarios, HUE-Net’s accuracy in recognition, calibration, and spectral resilience surpassed that of other contemporaneous baseline models alongside computation efficiency, fit for edge deployment. In our full implementation, we use M = 5 lightweight stochastic branches within PMB-VM. At inference time, we perform T = 5 Monte Carlo forward passes and average the resulting embeddings to compute the final prediction and uncertainty estimate Figure 4.

In the IJB-C benchmark [17], HUE-Net surpassed ArcFace and ElasticFace, which are regarded as SOTA in controlled RGB-only environments, with an overwhelming top-1 accuracy of identification of 95.6%, surpassing ArcFace and ElasticFace’s scores of 93.7% and 94.2%, respectively (Table 3). The multispectral fusion attention mechanisms are effective compensators for the pose and illumination extreme variation profile biases, which suggests strong treatment of overpowering ambiguity due to visual fetters in facial geometry. HUE-Net exhibited exceptional cross-modal generalization yet again on the N-SpectralFace dataset [14]. Heavily occluded settings where other models fell short yielded strong results for HUE-Net. Relying solely on subsets of input, such as with the RGB channel removed, HUE-Net achieved a striking recognition accuracy of 92.7%, whereas ArcFace and MS-EVS RetinaFace showed 85.6% and 87.3% in analogous occlusion periods. The Spectral Robustness Score captured these figures, and under partial modality loss, HUE-Net’s average performance only dropped by 3.2%, outperforming counterparts with a total average paradigm shift loss exceeding 7% and a glide ratio drop that is standard for RGB-trained architectures. This reflects HUE-Net’s ability to seamlessly integrate multi-channel information without excessive dependence on any single channel indicator for forming context Figure 5.

Beyond accuracy, HUE-Net’s exceptional uncertainty estimates further substantiated its well-calibrated predictions. In this respect, HUE-Net achieved the lowest Expected Calibration Error (ECE), at 2.8%, and Brier Score (0.032), surpassing all other models. Inaccurate models stand to benefit most in situations where they need to predict future events alongside uncertainty details.

Uncertainty estimate visualization confirmed HUE-Net’s performant high-variance assignment to ambiguous or partially occluded inputs and predictive low entropy for clean, frontal views, proving uncertainty outputs reasonable and trustworthy Figure 6.

Where efficiency is concerned, HUE-Net yielded competitive latency and model size inferences. With only 6.4 million parameters, it needed approximately 180 MFLOPs per inference to operate, grouping it with lightweight architectures like EdgeFace [34]. On edge devices like the NVIDIA Jetson Nano, HUE-Net image processing took only 198 ms, affirming the model’s suitability for low-power tasks. These performance metrics, together with deep learning advantages, set HUE-Net apart from bulkier ensemble models that tend to add high latency or memory burden Table 4.

An ablation analysis aided in verifying the validity of HUE-Net’s three core components. The model without the Adaptive Spectral Attention module showed a reduction of 2.1% in recognition accuracy on N-SpectralFace [14], while the SRS increased by almost 4%, supporting the assumption that spectral modulation significantly aids in scene-specific reliability tailoring Figure 7.

The exclusion of the PMB-VM variational branch resulted in ECE improving from 2.8% to 6.5%, illustrating the importance of predictive uncertainty modeling in decision-making and reliability. Disabling the spectral consistency loss component also caused a drop in performance during modality dropout scenarios, highlighting the importance of regularization provided by diverse spectral inputs. These conclusions were further validated by qualitative visualizations. Attention heatmaps revealed that the model, in occluded or low-light scenes, actively shifted attention away from unreliable RGB to NIR edges and dynamic event map contour-driven features. In unobstructed frontal views, attention shifted effectively toward the RGB texture and skin. More importantly, uncertainty visualizations captured the semantic difficulty well: samples assessed as distorted, blurred, or ambiguous registered significantly higher variance in output embeddings Table 5.

To determine whether simulated event channels contributed meaningful features or merely added redundancy, we trained a version of HUE-Net without event data. The resulting accuracy dropped by 1.9 pp, and the SRS nearly doubled, confirming that temporal contrast encoding via event simulation enhances both robustness and generalization. Importantly, we controlled for parameter count by padding the missing channels, ensuring that this gain was not due to capacity scaling Figure 8.

Combining these results confirms our primary working hypothesis: the architecture incorporating adaptive attention, variational inference, and multispectral inputs facilitates exceptional performance in real-world face recognition tasks. HUE-Net, in contrast to earlier models, which either depend on static fusion or disregard uncertainty altogether, adapts its representational strategy as a function of the input condition, providing high-confidence predictions that are interpretable and robust across different modalities.

To evaluate how the number of branches (M) impacts performance, we tested HUE-Net with M ∈ {1, 2, 4, 8}. As expected, both accuracy and calibration improve with larger M, though gains saturate beyond M = 5. Notably, M = 2 already yields strong ECE improvement over deterministic inference (6.8% to 4.2%), with a minimal latency penalty. This suggests that PMB-VM’s benefits can be gracefully tuned to deployment constraints, allowing efficient trade-offs between predictive confidence and real-time responsiveness Table 6.

We tested different reduction ratios r in the ASA block to assess their impact on model capacity and performance. While r = 2 yields the best accuracy, r = 4 maintains 94.6% accuracy with 6% fewer FLOPs and lower latency, making it a practical choice for edge deployment Table 7.

We applied temperature scaling (TS) to the logits of the ElasticFace and ArcFace models, using a held-out validation split to optimize the temperature (T). Then we recomputed the calibrated ECE on the IJB-C [17] and N-SpectralFace [14] test sets.

We applied temperature scaling to ElasticFace and ArcFace to test whether HUE-Net’s superior ECE arose purely from postprocessing. As shown in Table 8, while TS improved calibration for the baselines, HUE-Net with PMB-VM remained the most calibrated even without tuning, confirming that learned uncertainty outperforms static scaling methods.

6. Conclusions

We proposed HUE-Net, an innovative face recognition architecture that can function effectively in multi-modal and real-world uncertain environments. HUE-Net addresses the problem of occlusion, illumination change, and spectral degradation by taking advantage of each modality’s strength, integrating RGB, NIR, and event-based visuals through early spectral fusion and adaptive attention. With the inclusion of the perturbed multi-branch variational module (PMB-VM), the model can generate precise identity embeddings while estimating prediction uncertainty in a computationally efficient manner, which is critical for safety applications. HUE-Net has some limitations too, like its dependency on modality alignment and simulated event data. We aim to extend the model to the management of real-time sensor calibration with late fusion techniques and incorporate real-world event data at scale. In addition, we aim to devise solutions for temporally modeling video-based domains and adaptable domain shift-supported continuous adaptive learning in volatile environments.

For uncertainty estimation, which offers a form of algorithmic introspection, we recognize the necessity for operational safeguards in deployment. Future extensions of HUE-Net will integrate human-in-the-loop (HITL) mechanisms to enable manual review in low-confidence predictions, as quantified by our variational uncertainty estimates. Moreover, we envision the embedding of threshold-based fail-safe triggers that can defer decision-making or flag cases for offline forensic validation, especially in sensitive applications such as humanitarian identification and post-disaster triage.

HUE-Net is an important milestone toward creating sophisticated systems capable of face recognition in complex visuals and interpretable and robust biometrics, enabling socially responsible research mindful of uncertainty-sensitive identification.

Author Contributions

Methodology, A.A., S.U., E.B., D.Z., S.K., Z.T., W.J., H.C. and T.K.W.; software, A.A., S.U., E.B., D.Z. and H.C.; validation, A.A., S.U., E.B., D.Z. and S.K.; formal analysis, W.J., H.C. and T.K.W.; resources, D.Z., S.K. and Z.T.; data curation, S.K., Z.T., W.J. and H.C.; writing—original draft, A.A., S.U. and T.K.W.; writing—review and editing, S.K., Z.T., W.J., H.C. and T.K.W.; supervision, S.U., H.C. and T.K.W.; project administration, S.U. and A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Gyeonggido Regional Research Center (GRRC) Program of Gyeonggi Province (Development of AI-Based Medical Service Technology) under Grant GRRC-Gachon2023 (B02).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Laishram, L.; Shaheryar, M.; Lee, J.T.; Jung, S.K. Toward a privacy-preserving face recognition system: A survey of leakages and solutions. ACM Comput. Surv. 2025, 57, 1–38. [Google Scholar] [CrossRef]
Amirgaliyev, B.; Mussabek, M.; Rakhimzhanova, T.; Zhumadillayeva, A. A Review of Machine Learning and Deep Learning Methods for Person Detection, Tracking and Identification, and Face Recognition with Applications. Sensors 2025, 25, 1410. [Google Scholar] [CrossRef] [PubMed]
Ghani, M.A.N.U.; She, K.; Rauf, M.A.; Khan, S.; Alajmi, M.; Ghadi, Y.Y.; Alkahtani, H.K. Toward robust and privacy-enhanced facial recognition: A decentralized blockchain-based approach with GANs and deep learning. Math. Biosci. Eng. 2024, 21, 4165–4186. [Google Scholar] [CrossRef] [PubMed]
Abdusalomov, A.; Mirzakhalilov, S.; Umirzakova, S.; Kalandarov, I.; Mirzaaxmedov, D.; Meliboev, A.; Im Cho, Y. Optimized Lightweight Architecture for Coronary Artery Disease Classification in Medical Imaging. Diagnostics 2025, 15, 446. [Google Scholar] [CrossRef]
Zhang, Q.; Guo, Q.; Gao, R.; Juefei-Xu, F.; Yu, H.; Feng, W. Adversarial relighting against face recognition. IEEE Trans. Inf. Forensics Secur. 2024, 19, 9145–9157. [Google Scholar] [CrossRef]
Hiremath, J.S.; Patil, S.B. Optimizing Deep Learning for Accurate Age and Gender Classification in Real-World Applications. Int. J. Intell. Eng. Syst. 2025, 18, 3. [Google Scholar] [CrossRef]
Pham, T.D.; Holmes, S.B.; Coulthard, P. A review on artificial intelligence for the diagnosis of fractures in facial trauma imaging. Front. Artif. Intell. 2024, 6, 1278529. [Google Scholar] [CrossRef]
Abdusalomov, A.; Mirzakhalilov, S.; Dilnoza, Z.; Zohirov, K.; Nasimov, R.; Umirzakova, S.; Cho, Y.I. Lightweight Super-Resolution Techniques in Medical Imaging: Bridging Quality and Computational Efficiency. Bioengineering 2024, 11, 1179. [Google Scholar] [CrossRef]
Thakkar, C.; Shah, A.; Himabindu, K. A Review on the Identification of Face Mask Object from Multiple Environments Using Deep Learning Techniques. In Proceedings of the 2024 4th International Conference on Sustainable Expert Systems (ICSES), Kaski, Nepal, 15–17 October 2024; pp. 1332–1339. [Google Scholar]
Gururaj, H.L.; Soundarya, B.C.; Priya, S.; Shreyas, J.; Flammini, F. A Comprehensive Review of Face Recognition Techniques, Trends and Challenges. IEEE Access 2024, 12, 107903–107926. [Google Scholar] [CrossRef]
Serengil, S.; Özpınar, A. A benchmark of facial recognition pipelines and co-usability performances of modules. Bilişim Teknol. Derg. 2024, 17, 95–107. [Google Scholar] [CrossRef]
DeAndres-Tame, I.; Tolosana, R.; Melzi, P.; Vera-Rodriguez, R.; Kim, M.; Rathgeb, C.; Liu, X.; Morales, A.; Fierrez, J.; Ortega-Garcia, J.; et al. FRCSyn Challenge at CVPR 2024: Face Recognition Challenge in the Era of Synthetic Data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3173–3183. [Google Scholar]
Shahreza, H.O.; Marcel, S. HyperFace: Generating synthetic face recognition datasets by exploring face embedding hypersphere. arXiv 2024, arXiv:2411.08470. [Google Scholar]
Himmi, S.; Parret, V.; Chhatkuli, A.; Van Gool, L. MS-EVS: Multispectral Event-Based Vision for Deep Learning Based Face Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 616–625. [Google Scholar]
Liu, S.; Zhao, D.; Sun, Z.; Chen, Y. BPMB: BayesCNNs with perturbed multi-branch structure for robust facial expression recognition. Image Vis. Comput. 2024, 143, 104960. [Google Scholar] [CrossRef]
Xin, Y.; Zhou, Y.; Jiang, J. RobustFace: Adaptive Mining of Noise and Hard Samples for Robust Face Recognitions. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 5065–5073. [Google Scholar]
Maze, B.; Adams, J.; Duncan, J.A.; Kalka, N.; Miller, T.; Otto, C.; Jain, A.K.; Niggel, W.T.; Anderson, J.; Cheney, J.; et al. Iarpa Janus Benchmark-c: Face Dataset and Protocol. In Proceedings of the 2018 International Conference on Biometrics (ICB), Gold Coast, Australia, 20–23 February, 2018; pp. 158–165. [Google Scholar]
Malayinmel Purushothaman, M.; Adapa, S.R.; Bellamkonda, S. Presentation Attack Detection for Multispectral Face Biometric System Using Federated Learning. In International Conference on Innovations and Advances in Cognitive Systems; Springer Nature: Cham, Switzerland, 2024; pp. 285–303. [Google Scholar]
Adra, M.; Melcarne, S.; Mirabet-Herranz, N.; Dugelay, J.L. Event-based solutions for human-centered applications: A comprehensive review. arXiv 2025, arXiv:2502.18490. [Google Scholar]
Zhang, Y.; Wu, Y.; Yin, Z.; Shao, J.; Liu, Z. Robust face anti-spoofing with dual probabilistic modeling. Pattern Recognit. 2025, 167, 111700. [Google Scholar] [CrossRef]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4690–4699. [Google Scholar]
Boutros, F.; Damer, N.; Kirchbuchner, F.; Kuijper, A. ElasticFace: Elastic Margin Loss for Deep Face Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1578–1587. [Google Scholar]
Wang, H.; Wang, Y.; Zhou, Z.; Ji, X.; Gong, D.; Zhou, J.; Li, Z.; Liu, W. CosFace: Large Margin Cosine Loss for Deep Face Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5265–5274. [Google Scholar]
Wu, X.; He, R.; Sun, Z.; Tan, T. A light CNN for deep face representation with noisy labels. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2884–2896. [Google Scholar] [CrossRef]
Zhou, Y.; Wang, B.; He, X.; Cui, S.; Shao, L. DR-GAN: Conditional generative adversarial network for fine-grained lesion synthesis on diabetic retinopathy images. IEEE J. Biomed. Health Inform. 2020, 26, 56–66. [Google Scholar] [CrossRef]
Huang, R.; Zhang, S.; Li, T.; He, R. Beyond Face Rotation: Global and Local Perception Gan for Photorealistic and Identity Preserving Frontal View Synthesis. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2439–2448. [Google Scholar]
Li, W.; Cen, X.; Pang, L.; Cao, Z. HyperFace: A Deep Fusion Model for Hyperspectral Face Recognition. Sensors 2024, 24, 2785. [Google Scholar] [CrossRef]
Ai, D.; Jia, K.; Wang, Y.; Liu, Y. NIR-VIS Image Translation for the Cross-Spectral and Cross-Distance Face Recognition. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]
Ryan, C.; Elrasad, A.; Shariff, W.; Lemley, J.; Kielty, P.; Hurney, P.; Corcoran, P. Real-time multi-task facial analytics with event cameras. IEEE Access 2023, 11, 76964–76976. [Google Scholar] [CrossRef]
Sivamurugan, S.; Akila, A.; Saravanan, N.S.; Niji, P.S.; Ramya, R. The Smart Multispectral Image Processing for System Based Vision Applications. In Proceedings of the 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kamand, India, 24–28 June 2024; pp. 1–7. [Google Scholar]
Li, X.; Lai, S.; Qian, X. DBCFace: Towards pure convolutional neural network face detection. IEEE Trans. Circuits Syst. Video Technol. 2021, 324, 1792–1804. [Google Scholar] [CrossRef]
Zhao, D.; Liu, S.; Chen, Y.; Ji, W.; Ni, S. Uncertainty Learning Facial Expression Recognition Based on Monte-Carlo Dropout. In Proceedings of the 2023 7th Asian Conference on Artificial Intelligence Technology (ACAIT), Jiaxing, China, 10–12 November 2023; pp. 1529–1535. [Google Scholar]
Dwivedi, R.; Kothari, P.; Chopra, D.; Singh, M.; Kumar, R. An efficient ensemble explainable AI (XAI) approach for morphed face detection. Pattern Recognit. Lett. 2024, 184, 197–204. [Google Scholar] [CrossRef]
George, A.; Ecabert, C.; Shahreza, H.O.; Kotwal, K.; Marcel, S. Edgeface: Efficient face recognition model for edge devices. IEEE Trans. Biom. Behav. Identity Sci. 2024, 6, 158–168. [Google Scholar] [CrossRef]

Figure 1. HUE-Net architecture overview. Using aligned multispectral input data, including RGB, NIR, and event-based data, the model progresses through an early-fusion module and combines with a convolutional–transformer backbone with adaptive spectral attention. The feature maps go through stages hierarchically and through a Perturbated Multi-Branch Variational Module (PMB-VM) that produces identity embeddings and uncertainty estimates, which underscore trust and can be used for face recognition in challenging conditions. In the end face detected (in the image presented as box in red color).

Figure 2. Visualization of the alignment and fusion process of RGB, NIR, and simulated event channels into a 6-channel tensor.

Figure 3. This composite loss function breakdown line chart visualizes how each component of the total loss evolves over training epochs.

Figure 4. Difficult scenarios of face recognition from the IJB-C dataset [17]. Example images show high degrees of variability within the dataset, such as occlusions, extreme poses, lighting artifacts, low resolution, and motion blur. These scenarios demonstrate how face recognition performance suffers under varying conditions, and they illustrate the importance of multi-modal and uncertainty-aware methods like HUE-Net.

Figure 5. Exemplary cross-spectral face recognition results from HUE-Net. All rows show the respective RGB, NIR, and event frame simulation and their fused outputs. Successful face recognition and detection across all modalities are highlighted within the red bounding boxes. The figure showcases HUE-Net’s strengths in low-light, occluded, and motion blur scenarios, reflecting harsh illumination extremes due to spectral complementarity. The edges and motion information are integrated in the event-based modality, while NIR provides reliable capture in dark or heavily backlit scenes.

Figure 6. Standard deviation of identity embeddings varied across PMB-VM branches with separate angular distance processing. The plot shows the standard deviation concerning angular distance for stochastic identity embeddings under visual conditions. While clean inputs show low variance (high confidence), occluded or blurred conditions produce greater uncertainty. The red dashed line marks an empirical uncertainty threshold (0.3) above which predictions can be considered for deferred processing or flagged as unreliable.

Figure 7. Spectral attention weights across various conditions: visual and environmental. Rows represent different input conditions (clean, low light, occlusion, motion blur), while columns show the attention weight assigned to each modality by the Adaptive Spectral Attention (ASA) module. For example, during low-light conditions, the network relies more on NIR and event-based cues, whereas in clean inputs, RGB is prioritized. This type of adaptivity improves robustness and interpretability.

Figure 8. Within-class and between-class embedding distance variation with occlusion levels. This figure illustrates the comparison of intra-class embedding distances and inter-class pairs under different levels of occlusion. Increasing the level of occlusion results in both an increase in intra-class distance and a decrease in inter-class distance, which indicates a loss in separability. However, it is observed that HUE-Net still maintains a sufficient margin separation, which shows robustness in the embedding space.

Table 1. Comparison of recent models in face recognition and related domains.

Model	Modalities	Uncertainty Modeled	Fusion Strategy	Backbone Type	Edge-Deployable
MS-EVS (2024)	RGB + Event	✗	Late Fusion	CNN	✗
HyperFace (2024)	Hyperspectral	✗	Deep Fusion	CNN (ResNet)	✗
BPMB (2024)	RGB	✓ (Bayesian CNN)	Single Modality	Multi-Branch CNN	✗
RobustFace++	RGB	✓ (MC-Dropout)	Single Modality	ResNet + Ensemble	✗
FRCSyn (2024)	RGB + Synthetic	✗	Frame-Level Fusion	ViT-B	✗
EdgeFace (2024)	RGB	✗	Single Modality	MobileNetV3	✓
HUE-Net (Ours)	RGB + NIR + Event	✓ (Variational Branch)	Early Fusion + ASA	CNN–Transformer Hybrid	✓

Table 2. Summary of datasets used in experiments.

Dataset	Subjects	Images	Modalities	Resolution	Conditions Present	Used For
IJB-C [17]	3531	31,334	RGB	Varying	Occlusion, Pose, Blur, Low Light	Validation/Testing
N-SpectralFace [14]	1200	28,000	RGB, NIR, Event (sim)	224 × 224	Motion Blur, Low Light, Alignment Shift	Training/Testing

Table 3. Face recognition performance across datasets. FLOPs and inference times reflect the computational cost of the forward pass only, excluding data preprocessing. Latency measured with batch size = 1 on NVIDIA RTX 2080 GPU.

Model	IJB-C Accuracy (%)	N-SpectralFace Accuracy (%)	Parameters (M)	FLOPs (M)
ArcFace (ResNet-100) [21]	93.7	x	65.2	4480
ElasticFace	94.2	x	60.0	4200
EdgeFace [34]	92.3	x	1.8	150
BPMB [15]	91.9	x	19.7	2100
MS-EVS RetinaFace	x	87.3	18.2	2000
HUE-Net (Ours)	95.6	94.7	6.4	180

Table 4. Robustness and calibration metrics.

Model	SRS ↓ (%)	ECE ↓ (%)	Brier Score ↓	Inference Time (GPU, ms)	Inference Time (Jetson Nano, ms)
ArcFace [21]	7.8	8.2	0.049	98	>1000 (not suitable)
ElasticFace	6.9	7.4	0.045	93	>800
EdgeFace [34]	5.6	5.9	0.042	32	187
BPMB [15]	4.9	5.1	0.037	105	290
MS-EVS RetinaFace	5.8	6.5	0.040	64	245
HUE-Net (Ours)	3.2	2.8	0.032	41	198

Table 5. Ablation study on HUE-Net.

Model Variant	N-SpectralFace Accuracy (%)	SRS ↓ (%)	ECE ↓ (%)	Comment
HUE-Net (Full)	94.7	3.2	2.8	Full model with ASA, PMB-VM, and consistency
– w/o Adaptive Spectral Attention (ASA)	92.6	7.1	5.1	Reduced robustness under spectral dropout
– w/o PMB-VM	93.2	4.2	6.5	Poor uncertainty calibration
– w/o Spectral Consistency Loss	91.7	6.8	4.9	Higher variance in feature embedding alignment
– w/o Event Channel (RGB + NIR)	92.8	6.3	4.1	Verifies that event-derived maps improve robustness

Table 6. Accuracy and calibration vs. number of PMB-VM branches (M).

PMB-VM Branches (M)	Top-1 Accuracy (%)	ECE (%)	Inference Latency (Jetson Nano, ms)
M = 1 (deterministic)	93.3 ± 0.3	6.8 ± 0.3	153
M = 2	93.9 ± 0.3	4.2 ± 0.3	167
M = 4	94.5 ± 0.2	3.1 ± 0.2	185
M = 5 (default)	94.7 ± 0.3	2.8 ± 0.2	198
M = 8	94.8 ± 0.2	2.6 ± 0.2	236

Table 7. ASA reduction ratio study.

Reduction Ratio (r)	Accuracy (%)	FLOPs (M)	Latency (ms)	ΔAccuracy (vs. r = 2)
r = 2 (default)	94.7 ± 0.3	180	198	—
r = 4	94.6 ± 0.3	169	186	–0.1 pp
r = 8	94.3 ± 0.3	163	179	–0.4 pp
r = 16	93.8 ± 0.4	159	174	–0.9 pp

Table 8. Temperature-scaled calibration results.

Model	Pre-TS ECE (%)	Post-TS ECE (%)	Accuracy (%)
ArcFace-R100	6.8	4.4	93.7
ElasticFace-R100	5.9	3.8	93.9
HUE-Net (w/o PMB-VM)	6.5	3.7	93.3
HUE-Net (w/ PMB-VM)	2.8	2.8	94.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Abdusalomov, A.; Umirzakova, S.; Boymatov, E.; Zaripova, D.; Kamalov, S.; Temirov, Z.; Jeong, W.; Choi, H.; Whangbo, T.K. A Human-Centric, Uncertainty-Aware Event-Fused AI Network for Robust Face Recognition in Adverse Conditions. Appl. Sci. 2025, 15, 7381. https://doi.org/10.3390/app15137381

AMA Style

Abdusalomov A, Umirzakova S, Boymatov E, Zaripova D, Kamalov S, Temirov Z, Jeong W, Choi H, Whangbo TK. A Human-Centric, Uncertainty-Aware Event-Fused AI Network for Robust Face Recognition in Adverse Conditions. Applied Sciences. 2025; 15(13):7381. https://doi.org/10.3390/app15137381

Chicago/Turabian Style

Abdusalomov, Akmalbek, Sabina Umirzakova, Elbek Boymatov, Dilnoza Zaripova, Shukhrat Kamalov, Zavqiddin Temirov, Wonjun Jeong, Hyoungsun Choi, and Taeg Keun Whangbo. 2025. "A Human-Centric, Uncertainty-Aware Event-Fused AI Network for Robust Face Recognition in Adverse Conditions" Applied Sciences 15, no. 13: 7381. https://doi.org/10.3390/app15137381

APA Style

Abdusalomov, A., Umirzakova, S., Boymatov, E., Zaripova, D., Kamalov, S., Temirov, Z., Jeong, W., Choi, H., & Whangbo, T. K. (2025). A Human-Centric, Uncertainty-Aware Event-Fused AI Network for Robust Face Recognition in Adverse Conditions. Applied Sciences, 15(13), 7381. https://doi.org/10.3390/app15137381

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Human-Centric, Uncertainty-Aware Event-Fused AI Network for Robust Face Recognition in Adverse Conditions

Abstract

1. Introduction

2. Related Works

3. The Materials and Methods

3.1. Multispectral Event Fusion

3.2. Backbone Architecture

3.3. Perturbed Multi-Branch Variational Module

3.4. Adaptive Spectral Attention

3.5. Loss Function

4. Experiments

4.1. Dataset

4.2. Evaluation Metrics

5. Results and Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI