Age-Invariant Face Retrieval Based on Hybrid Metric Learning Framework (HMLF)

Cao, Jingtian; Zhang, Tingshuo; Wang, Ziyi; Lian, Bobo

doi:10.3390/electronics15091851

Open AccessArticle

Age-Invariant Face Retrieval Based on Hybrid Metric Learning Framework (HMLF)

School of Mathematical Sciences, Soochow University, Suzhou 215006, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(9), 1851; https://doi.org/10.3390/electronics15091851

Submission received: 25 February 2026 / Revised: 4 April 2026 / Accepted: 22 April 2026 / Published: 27 April 2026

Download

Browse Figures

Versions Notes

Abstract

Cross-age face analysis has emerged as an important topic in biometric recognition due to substantial facial appearance variations caused by aging. Nevertheless, most existing approaches primarily focus on face verification (1:1 matching) and frequently rely on explicit age annotations, which limit their applicability in large-scale retrieval scenarios. In this study, large-scale cross-age face retrieval (1:N matching) is investigated, and a Hybrid Metric Learning Framework (HMLF) is proposed to learn age-invariant and retrieval-oriented facial representations without requiring age labels. The proposed framework integrates Additive Angular Margin Loss (ArcFace) with supervised contrastive learning to enhance feature discriminability. Furthermore, a mixed triplet mining strategy is introduced to improve the effectiveness of hard sample selection. A memory bank-based InfoNCE formulation is incorporated to provide a large number of negative samples, and an uncertainty-based adaptive weighting scheme is designed to automatically balance multiple loss components during optimization. To better simulate realistic retrieval scenarios, an extended cross-age retrieval evaluation protocol is established. Extensive experimental results demonstrate that the proposed framework achieves superior retrieval performance across different backbone architectures. The results further provide systematic insights into the influence of backbone design, loss formulation, and optimization strategies on cross-age retrieval accuracy.

Keywords:

age-invariant face recognition; face retrieval; hybrid metric learning framework (HMLF); deep learning

1. Introduction

Over the past few decades, identifying a suspect in a crime has proven to be a difficult task. With closed-circuit television (CCTV) placed everywhere to monitor and halt illegal activity, the face has become the most often used biometric for identifying a person among all biometrics [1]. However, the trickiest points lying in this biometric, such as gigantic facial variations caused by aging progress, have posed challenges for age-invariant face recognition (AIFR) research [2].

In practical deployments, the process of AIFR typically involves three stages, as illustrated in Figure 1. Firstly, millions of faces are captured from the CCTV system, composing a huge photo dataset. The second stage is face retrieval, where the suspect face, i.e., the probe image, is searched against the gallery set (existing archives) to produce a ranked list of candidate matches. The third stage is face verification, deciding whether the query and the retrieved images belong to the same individual. Face retrieval, which can also be called face identification, calculates 1-N similarity to spot the specific identity of a probe face, whereas face verification computes 1-1 similarity between the gallery and probe to determine whether the two images are of the same subject [3]. In terms of criminal investigation, probe identities must emerge in the gallery set, so this is closed-set identification [4].

Among these stages, most existing academic studies focus on the third stage, i.e., face verification. The existing methods can be classified into two categories: discriminative methods and generative methods. Generative approaches, like [5,6,7,8,9], harness the idea of generative and adversarial networks (GANs) [10] and are dedicated to the modeling of the aging process to synthesize target facial images at different ages, in an effort to compare age-varied people at the same age level. Discriminative approaches, in contrast, aim to directly learn age-invariant feature representations that remain consistent across the lifespan [11,12,13,14,15]. These frameworks often incorporate dedicated modules for disentangling age-related and identity-related features, enabling the model to extract intrinsic facial characteristics that persist despite temporal changes [16,17]. While both discriminative and generative approaches can be adapted to address cross-age face retrieval, they are not inherently designed for this specific task.

On the other hand, research in image retrieval (IR) has largely focused on general face retrieval, overlooking the specific complexities introduced by age progression. Existing face retrieval methods typically assume relatively consistent facial appearance over time, employing similarity metrics [18,19] and indexing structures optimized for contemporaneous comparisons [20,21]. These approaches fail to account for the non-linear morphological changes characteristic of facial aging—such as craniofacial development in youth and tissue redistribution in adulthood—which fundamentally alter feature representations across decades.

In this paper, we have designed an explicit framework that orients towards both cross-age challenges and retrieval tasks. We first propose a Hybrid Metric Learning Framework (HMLF) that combines Additive Angular Margin Loss (ArcFace) [22] with supervised contrastive losses (Triplet Loss [23], InfoNCE Loss [24]), to learn retrieval-oriented face representations robust to severe appearance changes. The ArcFace Loss enforces globally separable identity decision boundaries in the angular space, providing strong inter-class discrimination, while the contrastive objectives explicitly optimize both intra-class and inter-class distances by learning age-invariant but pure-identity features. These two types of supervision are highly complementary: ArcFace stabilizes optimization and calibrates class margins, whereas contrastive learning refines the local structure of the embedding space, which is crucial for ranking-based retrieval. More specifically, we apply a mixed triplet mining scheme for Triplet Loss, and a memory bank-based sampling strategy for the InfoNCE Loss. Both of them can significantly enhance retrieval performance while maintaining high computational efficiency and moderate GPU memory usage. Furthermore, instead of using manually tuned fixed weights, we adopt an uncertainty-based adaptive weighting scheme to automatically balance the contributions of ArcFace and contrastive losses during training. As for the experiments, we conduct exhaustive experiments on five public cross-age datasets with several representative backbones (IResnet [22], FaceNet [25], MobileFaceNet [26], Swin Transformer (Swin-T) [27]), under the cross-age face retrieval protocol (leveraging mAP and Rank-k as evaluation metrics), and demonstrate that our method achieves competitive or improved retrieval performance while providing comprehensive insights into the impact of different backbone–loss combinations. The main contributions of our work are outlined as follows:

We propose a Hybrid Metric Learning Framework (HMLF) that mingles ArcFace with supervised contrastive learning for face retrieval under large intra-class appearance variations. Without explicitly using age annotations, the proposed loss effectively enhances identity discriminability under age progression, pose, and illumination changes.
Unlike most existing cross-age studies that focus on verification (1:1), we treat the problem as large-scale cross-age face retrieval (1:N) and utilize a unified evaluation protocol with gallery/query splits, mAP and Rank-k metrics, which better reflects practical search scenarios and provides reproducible benchmarks for future research.
Extensive experiments are conducted on five public cross-age datasets (CACD [28], MORPH Album 2 [29], FG-NET [30], AgeDB [31], IMDB-clean [32]) with several representative backbones under the HMLF. Our method demonstrates consistent improvements across multiple datasets and provides insights into how different backbone–loss combinations affect cross-age retrieval precision.

2. Related Work

2.1. The Evolution of Backbone Architectures for Age-Invariant Face Recognition (AIFR)

Early attempts in AIFR often adopted generic deep convolutional neural networks (CNNs) [33] and its variations. For instance, CNN was armed with an Inception-based backbone in [25], turning into FaceNet, so that it could capture both local texture details and global structural patterns. In [34], LF-CNN was developed to tackle the AIFR task by employing latent factor-guided deep convolutional neural networks and HFA algorithms. Deng et al. [22] upgraded the conventional ResNet by introducing the inverted residual (IR) block, which was more suitable for face training. After that, researchers from Insightface added the Squeeze-and-Excitation (SE) [35] attention mechanism to IR blocks, which effectively combined the robust representational power of deep residual learning with channel-wise attention. In recent years, CNN backbones were integrated into multi-task frameworks. A typical example is [36], in which the CNN was used for joint classification of age, gender, and facial shape from still facial images. In our research, we selected some of these CNN-based representative backbones, including IResnet, FaceNet and MobileFaceNet, to verify the effectiveness of our HMLF.

Inspired by the remarkable success of the Transformer architecture [37] in natural language processing (NLP), researchers have increasingly adapted this architecture for AIFR tasks. A significant milestone was the demonstration by Vision Transformers (ViTs) [38] that a pure Transformer architecture could outperform established convolutional networks on ImageNet when trained on sufficiently large datasets [39]. This breakthrough spurred the development of numerous Transformer variants. The authors of [40] introduced T2T-ViT, enhancing ViT’s tokenization process by recursively re-grouping neighboring tokens to recover local structures, thereby mitigating some limitations of the original design. The inventors of CvT [41] introduced convolutional token embeddings, effectively bridging convolutional inductive biases with Transformer self-attention. For AIFR, CvT’s hybrid design excels at capturing local invariant facial landmarks while modeling long-range aging effects through attention mechanisms. Due to the large number of parameters and training instability of ViT backbones on small datasets [42], we adopted Swin-T, a relatively lightweight Transformer network, to test whether our innovation can also be applied to Transformer backbones.

2.2. Contrastive Learning

Contrastive learning (CL) has evolved to address face recognition problems, with paradigms shifting from instance discrimination to identity-aware objectives. Triplet Loss [25] provided an early contrastive formulation for metric learning, where hard negative mining proved crucial for aging challenges. To address hard negative mining, ref. [23] proposed the batch-hard sampling strategy.

Simultaneously, another contrastive learning method called InfoNCE Loss [24] was invented, reframing contrastive learning as identifying the positive among many negatives. Its systematic framework, MoCo, originated from the paper written by He et al. in 2020 [43]. This paper also introduced a novel perspective where contrastive learning is treated as training an encoder for a dynamic dictionary with keys (negative samples) and queries (anchor samples). Apart from that, SimCLR [44] underscored the importance of augmentation composition, discovering that cropping combined with color distortion is crucial for learning good representations.

Recent advancements in contrastive learning began with supervised contrastive learning (SupCon) [45], which leveraged identity labels to pull together all images of the same person, forming a natural basis for age-invariant learning. To address the imbalance between positive and negative samples, Li et al. [46] proposed a negative adaptive weighting method to mine difficult negative samples. Liu et al. presented Bayesian Contrastive Loss (BCL) [47]—a unified Bayesian framework that simultaneously addresses false negative debiasing and hard negative mining in self-supervised contrastive learning. However, it remains highly challenging to generate these hard samples or even guarantee their authenticity and validity [48]. Therefore, in our paper, we have proposed the HMLF, attempting to enhance the robustness of our model against noisy or ambiguous negative samples by virtue of justifying the prominence of semi-hard triplets [25] and using adaptive parameters to balance semi-hard samples with hard samples. In addition, this framework combines ArcFace Loss with CL, which takes advantage of both discriminative classification capability and robustness to continuous age variations.

3. Approach

3.1. Overview Framework

In AIFR, age annotations in benchmark datasets are often missing or inaccurate. For example, CACD, FG-NET, IMDB-WIKI and MS-Celeb-1M [49] rely on estimated ages from metadata rather than precise ground-truth annotations, leading to inherent noise. More critically, datasets like MegaFace [50], VGGFace2 [51] do not have age labels. This poses significant challenges for training multi-task learning frameworks, like [2,16,36,52,53]. Therefore, we propose a more generalizable and applicable loss function design approach, termed the HMLF, in which age annotations are not needed.

Our proposed HMLF is an identity-supervised loss function design and sampling strategy, which can be applied to various backbone architectures and cross-age datasets. The overall pipeline is illustrated in Figure 2. Given an input face image, we first apply a standard face preprocessing pipeline (alignment, cropping and normalization), and then feed the result into a backbone network to extract a deep feature embedding. We consider several representative backbone architectures, including FaceNet, MobileFaceNet, Swin-T and IResnet, to cover both classical and modern architectures. The backbone outputs a fixed-dimensional feature vector, which is subsequently sent to the HMLF.

The HMLF component consists of two complementary supervision heads: the first head is ArcFace, and the second head implements metric learning objectives. If the loss function leverages Triplet Loss, it is called Triplet ArcFace Loss (TAL). If the loss function uses InfoNCE Loss, it is called InfoNCE ArcFace Loss (IAL). To achieve optimal balance, the hybrid loss—either TAL or IAL—is dynamically weighted using an uncertainty-based scheme and optimized end-to-end. At test time, we obtain ranked retrieval results evaluated by mAP and Rank–k protocols, which are tailored to face retrieval tasks.

3.2. TAL

3.2.1. ArcFace Loss

ArcFace (Additive Angular Margin Loss) improves face recognition by enhancing the discriminative power of deep features through angular margin penalties. As outlined in Figure 3, the process works as follows:

Given an input feature

x_{i} \in R^{d}

and the fully connected (FC) layer weight

W_{j} \in R^{d \times 1}

(treating the bias as zero for simplicity), we first obtain the normalized feature

\frac{x_{i}}{∥ x_{i} ∥}

and the normalized weight

\frac{W_{j}^{T}}{∥ W_{j} ∥}

for the

j \in {1, 2, \dots, y_{i}, \dots, n}

th class. The jth output logit is then computed as:

W_{j}^{T} \cdot x_{i} = ∥ W_{j} ∥ ∥ x_{i} ∥ cos θ_{j},

(1)

which represents the prediction score (original logit) of feature

x_{i}

for class j. For feature

x_{i}

with ground-truth class

y_{i}

, we have

W_{y_{i}}^{T} \cdot x_{i} = ∥ W_{y_{i}} ∥ ∥ x_{i} ∥ cos θ_{y_{i}}

. By extracting the target weight

\frac{W_{y_{i}}^{T}}{∥ W_{y_{i}} ∥}

, we compute the target angle:

θ_{y_{i}} = arccos (cos θ_{y_{i}}) = arccos (\frac{W_{y_{i}}^{T}}{∥ W_{y_{i}} ∥} \cdot \frac{x_{i}}{∥ x_{i} ∥}),

(2)

obtaining the angle between normalized feature

\frac{x_{i}}{∥ x_{i} ∥}

and normalized target weight

\frac{W_{y_{i}}^{T}}{∥ W_{y_{i}} ∥}

. An additive angular margin m is added to the target angle

θ_{y_{i}}

, yielding

θ_{y_{i}} + m

, which adjusts the decision boundary. Computing the cosine of the adjusted target angle gives the new target logit for the ground-truth class

y_{i}

of feature

x_{i}

:

cos (θ_{y_{i}} + m)

. A predefined feature scale s is used to rescale all logits (except the target logit becomes

cos (θ_{y_{i}} + m)

while others remain

cos θ_{j}

). This is equivalent to using a one-hot label mask to distinguish the new logits:

s * cos θ_{j}

,

j \in {1, 2, \dots, y_{i}, \dots, n}

. The resulting new logits are used to compute the standard Softmax Loss, which can be formulated as:

L_{arc} = - \frac{1}{N} \sum_{i = 1}^{N} log \frac{e^{s \cdot cos (θ_{y_{i}} + m)}}{e^{s \cdot cos (θ_{y_{i}} + m)} + \sum_{j = 1, j \neq y_{i}}^{n} e^{s \cdot cos θ_{j}}},

(3)

where N is the batch size, n is the number of classes, s is the feature scale, and m is the angular margin penalty.

By optimizing (3), the model learns identity-discriminative embeddings with explicitly enlarged angular margins between different classes. However, in cross-age scenarios, images of the same person often form several age-dependent sub-clusters (e.g., childhood, youth, middle age, and old age) that are not well represented by a single prototype. Forcing all age groups to share one center tends to either pull child and elderly faces aggressively toward the adult-dominated center, over-compressing the intra-class structure, or to relax the effective margin in order to accommodate extreme ages, thereby weakening inter-class separability. This limitation motivates the integration of CL with ArcFace Loss to preserve fine-grained intra-class structures.

3.2.2. Triplet Loss

To address the problems above, we employ the Triplet Loss to further refine the relative distances between genuine and impostor samples. For each anchor image

x_{a}

, we select a positive image

x_{p}

from the same identity and a negative image

x_{n}

from a different identity. Let

z_{a}

,

z_{p}

, and

z_{n}

denote their

ℓ_{2}

-normalized embeddings, and

D (\cdot, \cdot)

be a distance function (we use the cosine distance in our implementation). The Triplet Loss is defined as

L_{triplet} = \frac{1}{N} \sum_{i = 1}^{N} max (0, m_{tri} + D (z_{a}^{(i)}, z_{p}^{(i)}) - D (z_{a}^{(i)}, z_{n}^{(i)})),

(4)

where

m_{tri}

is a margin hyper-parameter. This objective explicitly enforces that the distance between an anchor and its positive counterpart is smaller than the distance to any negative sample by at least

m_{tri}

. In the cross-age setting, positives are naturally drawn from images of the same person under different ages, poses and illumination conditions, so minimizing

L_{triplet}

encourages the model to keep such age-variant faces close while pushing away visually similar but different identities. In practice, we adopt a mixed hard and semi-hard mining strategy to construct informative and robust triplets, as detailed in Section 3.2.3.

3.2.3. Mixed Online Sampling Strategy Centering Semi-Hard and Hard Samples

In the method of Schroff et al. [25], if there are B triplets in one batch, their images will amount to

3 B

. Correspondingly,

3 B

feature vectors will be taken into account while calculating the loss function. Since

3 B

images are disorganized and messy in label annotation, they have the potential to match up with each other in nearly

6 B^{2} - 4 B

ways. Namely, we did not take the most advantage of all these images.

Having realized that, we turn to the P-K online sampling strategy to calculate the Triplet Loss. The images of the dataset are first divided into different groups according to their identity labels. In each batch, we select P groups, from which we further extract K images per group for loss computation. All images in the dataset are endowed with brand-new feature vectors when a new epoch begins, and these

P K

vectors can be used to compute cosine distances efficiently via matrix operations.

After

P K

samples are determined, the next step is to work out the loss value. One plausible way is that, for each anchor image, only the nearest negative image and the furthest positive image participate in the loss formula. This is named Batch Hard [23]:

L_{BH} (θ; X) = \overset{all anchors}{\overset{︷}{\sum_{i = 1}^{P} \sum_{a = 1}^{K}}} [m + \overset{hardest positive}{\overset{︷}{max_{p = 1 \dots K} D (f_{θ} (x_{a}^{i}), f_{θ} (x_{p}^{i}))}} - \underset{hardest negative}{\underset{︸}{min_{\begin{matrix} j = 1 \dots P \\ n = 1 \dots K \\ j \neq i \end{matrix}} D (f_{θ} (x_{a}^{i}), f_{θ} (x_{n}^{j}))}}]_{+},

(5)

where

x_{j}^{i}

refers to the jth image of the ith person in the batch, and

f_{θ} (\cdot)

denotes the embedding function parameterized by

θ

.

However, as we can see in Figure 4, the quality of the hardest triplets cannot be guaranteed. The hardest pairs are sometimes dominated by annotation noise, undetected misalignment, occlusion, or extremely low-quality images, rather than truly informative examples. Actually, the blue bounding boxes in Figure 4c can demonstrate that. In the first triplet in Figure 4c, the anchor image is Scott Peterson, but the positive image is William Petersen, revealing a false annotation in this dataset. It is more obvious in the first, second and fifth triplet that the anchor images’ gender is different from the positive images’ gender. As a result, these hardest triplets tend to violate the margin severely and produce excessively large gradients, making optimization unstable and slowing or even preventing convergence.

Figure 4a shows the easy triplets we randomly selected. The majority of them are too easy for the model, which can hardly provide worthy samples tailored to AIFR tasks. For example, the red bounding box highlights worthless triplets whose anchor image and negative image vary in gender. This is too easy, which does not sufficiently challenge the model to bridge large age gaps or to distinguish between visually similar but different identities.

The one we put emphasis on is semi-hard triplets, illustrated in Figure 4b. The valuable triplets are highlighted in green. For example, the positive image of the fourth triplet is dark, and the anchor image of the second triplet is bright, which can enhance the illumination invariant ability of the model. The positive image of the first triplet is red, which can teach our model to adapt to different colors of images. Optimizing on semi-hard triplets yields non-trivial gradients that effectively tighten the decision boundary.

In order to maximize the advantage of both hard and semi-hard triplets, we propose TAL. This new formulation of sampling a batch is to select hard triplets and semi-hard triplets from all possible

P K (P K - K) (K - 1)

combinations of triplets, which corresponds to the strategy chosen in [25] and which we call the Mixed Online Sampling Strategy:

L_{hard} (θ; X) = \overset{all anchors}{\overset{︷}{\sum_{i = 1}^{P} \sum_{a = 1}^{K}}} \overset{all pos .}{\overset{︷}{\sum_{\begin{matrix} p = 1 \\ p \neq a \end{matrix}}^{K}}} \overset{all negatives}{\overset{︷}{\sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{P} \sum_{n = 1}^{K}}} \overset{hard triplets}{\overset{︷}{{[m + D (f_{θ} (x_{a}^{i}), f_{θ} (x_{p}^{i})) - D (f_{θ} (x_{a}^{i}), f_{θ} (x_{n}^{j}))]}_{+}}},

(6)

L_{semi - hard} (θ; X) = \overset{all anchors}{\overset{︷}{\sum_{i = 1}^{P} \sum_{a = 1}^{K}}} \overset{all pos .}{\overset{︷}{\sum_{\begin{matrix} p = 1 \\ p \neq a \end{matrix}}^{K}}} \overset{all negatives}{\overset{︷}{\sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{P} \sum_{n = 1}^{K}}} \overset{semi - hard triplets}{\overset{︷}{{[m + D (f_{θ} (x_{a}^{i}), f_{θ} (x_{p}^{i})) - D (f_{θ} (x_{a}^{i}), f_{θ} (x_{n}^{j}))]}_{+}}},

(7)

L_{tri} (θ; X) = λ_{1} L_{hard} (θ; X) + (1 - λ_{1}) L_{semi - hard} (θ; X),

(8)

where the scalar hyper-parameter

λ_{1}

is used to balance hard triplets and semi-hard triplets. Holistically, this is an online sampling strategy since the feature embeddings z in (4) are dynamically changing with the progress of epochs.

The total loss of TAL can be summarized as:

L_{TAL} = λ_{2} L_{arc} + (1 - λ_{2}) L_{tri},

(9)

where

λ_{2}

is an adaptive learnable parameter, which will be discussed in Section 3.3.3. The entire procedure of TAL is shown in Algorithm 1.

Algorithm 1: Adaptive Triplet–ArcFace hybrid loss for face recognition

Input: Mini-batch

B

with

P K

samples, network

f_{θ}

, margins

m_{t}, m_{a}

, fixed weight

λ_{1}

, learnable weight

λ_{2}

Output: Total loss

L_{TAL}

Extract and normalize embeddings:

e_{i} = f_{θ} (x_{i}) / ∥ f_{θ} (x_{i}) ∥

;
Compute cosine distances:

D_{i j} = 1 - e_{i}^{T} e_{j}

Initialize

L_{hard} \leftarrow 0

,

L_{semi - hard} \leftarrow 0

;

L_{tri} \leftarrow λ_{1} \cdot L_{hard} + (1 - λ_{1}) \cdot L_{semi - hard}

Compute ArcFace Loss with angular margin

m_{a}

:

L_{arc}

L_{TAL} \leftarrow λ_{2} \cdot L_{arc} + (1 - λ_{2}) \cdot L_{tri}

Update

θ, λ_{2}

via gradient descent on

L_{TAL}

return

L_{TAL}

3.3. IAL

3.3.1. InfoNCE Loss

The core of contrastive learning (CL) is to maximize the intra-class similarity and minimize the inter-class similarity. Specifically, we employ the commonly used cosine similarity

s_{i j} = z_{i}^{⊤} z_{j}

. Our major objective is to pull features representing the same person but in different ages closer and push features extracted from different identities farther apart. To achieve this, the InfoNCE Loss has been employed in previous works [44] as the contrastive loss. Different from their data augmentation method, we harness the identity labels to achieve supervised learning. Our supervised InfoNCE Loss is written as

L_{infoNCE} = - \frac{1}{N} \sum_{i = 1}^{N} log \frac{\sum_{p \in P (i)} exp (\frac{s_{i p}}{τ})}{\sum_{p \in P (i)} exp (\frac{s_{i p}}{τ}) + \sum_{n \in N (i)} exp (\frac{s_{i n}}{τ})},

(10)

where

τ > 0

is a temperature parameter,

P (i)

is the index set of positives for image i, and

N (i)

is the index set of negatives for image i. This loss pulls together all positive embeddings of the same identity while pushing them away from negatives, and can be viewed as a supervised variant of contrastive learning originally proposed for self-supervised representation learning. When trained on cross-age datasets, many positive pairs within

P (i)

naturally exhibit large age gaps, so minimizing

L_{\inf}

explicitly encourages age-invariant similarity while preserving strong discrimination against impostors.

3.3.2. Memory-Augmented InfoNCE Loss

The standard batch-wise InfoNCE Loss is limited by the number and diversity of negatives that can be extracted from a single mini-batch. To better approximate the large-scale retrieval setting, we extend the contrastive objective with a memory bank of embeddings, which is shown in Figure 5. Concretely, we maintain a first-in–first-out (FIFO) queue

M = {(z_{k}, y_{k})}_{k = 1}^{M}

that stores feature vectors and identity labels from recent mini-batches. For each anchor i, we still construct the positive set

P (i)

from samples of the same identity within the current batch, while extracting a much larger pool of negatives

N_{mem} (i)

from

M

:

L_{\inf} = - \frac{1}{N} \sum_{i = 1}^{N} log \frac{\sum_{p \in P (i)} exp (\frac{s_{i p}}{τ})}{\sum_{p \in P (i)} exp (\frac{s_{i p}}{τ}) + \sum_{n \in N_{mem} (i)} exp (\frac{s_{i n}}{τ})},

(11)

where

s_{i j} = z_{i}^{⊤} z_{j}

is the cosine similarity and

τ > 0

is a temperature parameter. After each iteration, the embeddings of the current mini-batch are enqueued into

M

, and the oldest entries are dequeued to keep a fixed size M. Compared with the purely batch-based formulation, the memory-augmented InfoNCE exposes each anchor to a much larger and more diverse set of negatives at almost no additional computational cost, which is particularly beneficial for learning discriminative and well-calibrated embeddings for large-scale cross-age face retrieval.

Combined with ArcFace Loss, the total loss can be represented as:

L_{IAL} = λ_{3} L_{arc} + (1 - λ_{3}) L_{\inf} .

(12)

3.3.3. Adaptive Parameters in TAL and IAL

To automatically balance the contributions of contrastive loss and ArcFace Loss during training, we employ an uncertainty-based adaptive weighting strategy inspired by multi-task learning [54]. Instead of manually tuning fixed weights, we introduce two learnable parameters that represent the task-dependent homoscedastic uncertainty for each loss component.

Let

s_{1}, s_{2}

denote the log-variance parameters, defined as:

s_{i} = log (σ_{i}^{2}), i \in {1, 2, 3, 4},

(13)

where

σ_{i}^{2}

represents the variance for the ith loss term. The combined loss function (TAL and IAL) is formulated as:

L_{TAL - adaptive} = \frac{1}{2} e^{- s_{1}} L_{tri} + \frac{1}{2} e^{- s_{2}} L_{arc} + \frac{1}{2} (s_{1} + s_{2}),

(14)

L_{IAL - adaptive} = \frac{1}{2} e^{- s_{3}} L_{\inf} + \frac{1}{2} e^{- s_{4}} L_{arc} + \frac{1}{2} (s_{3} + s_{4}) .

(15)

The first two terms represent the weighted losses, where the weights

w_{i} = \frac{1}{2} e^{- s_{i}}

are automatically learned. The last two terms act as a regularization to prevent the model from trivially minimizing the loss by increasing the uncertainties to infinity.

During training, both the network parameters

θ

and the uncertainty parameters

{s_{i}}

are jointly optimized via gradient descent. A lower uncertainty

σ_{i}^{2} = e^{s_{i}}

(i.e., smaller

s_{i}

) results in a higher weight

w_{i}

, indicating that the corresponding loss is more reliable and should contribute more to the total loss. This adaptive mechanism eliminates the need for manual hyper-parameter tuning and allows the model to dynamically balance multiple objectives throughout training.

For other learnable parameters, we employ stochastic gradient descent (SGD) with momentum to optimize the network parameters. To account for the different learning dynamics between the backbone network and the ArcFace classification head, we adopt a two-tier learning rate strategy, where the backbone parameters

θ_{b}

and head parameters

θ_{h}

are updated with different learning rates:

θ_{b}^{t + 1} = θ_{b}^{t} - η_{b} \frac{\partial}{\partial θ_{b}^{t}} L_{total}, θ_{h}^{t + 1} = θ_{h}^{t} - η_{h} \frac{\partial}{\partial θ_{h}^{t}} L_{total},

(16)

where

η_{b} = 0.001

and

η_{h} = 0.005

represent the learning rates for the backbone and classification head, respectively. By using a higher learning rate for the ArcFace head (

η_{h} = 5 η_{b}

), the model can adapt decision boundaries more rapidly while maintaining stable feature learning in the backbone network.

4. Experiments

4.1. Implementation Details

4.1.1. Network Architecture

Our network consists of the following components: (1) Backbone: We mainly adopt IResnet-50 (without the SE module) [22] as the feature extractor. It has four residual blocks and outputs a 512-dimensional feature vector through a FC layer. We also selected some backbones (FaceNet [25], MobileFaceNet [26], Swin-T [27]) with parameter counts between 1 M and 50 M to make exhaustive comparisons. (2) ArcFace Loss: The extracted features x are fed into the ArcFace classification head with an additive angular margin for identity recognition. (3) Contrastive loss: We employ TAL or IAL to enhance intra-class compactness and inter-class separability. (4) Loss combination: The two losses are combined using uncertainty-based adaptive weighting, where learnable log-variance parameters

s_{1}, s_{2}

automatically balance their contributions during training.

4.1.2. Data Preprocessing

We detect all training and testing sets by MTCNN [55], and perform similarity transformation according to the five landmarks (two eyes, nose and mouth corners). After face alignment, all faces are cropped to

112 \times 112

RGB images. Finally, each pixel of the processed faces is normalized by subtracting 127.5 and dividing by 128.

4.1.3. Training Details

We conducted experiments on several widely used AIFR datasets: MORPH Album 2, CACD, FG-NET, AgeDB and IMDB-clean. We first train the deep model on the wild datasets to learn basic knowledge about human faces. The training data includes MS-Celeb-1M [49] and CASIA-Webface [56], which we refer to as general face datasets (GFDs) in the following text. MS-Celeb-1M contains about 1 M images from 100 K individuals, while CASIA-Webface contains nearly 0.5 M images from 10 K individuals. We clean the data of its noise. Then we fine-tune the proposed model using experimental datasets.

The entire training process is jointly supervised by (14) or (15). Specifically, the experimental setting of hyper-parameters is

λ_{1} = 0.3

,

M = 16,384

. The training procedure is on a Tesla P100 GPU, and we set the batch size to 64 due to the GPU memory limitation. The whole training is performed for 40 epochs, using stochastic gradient descent (SGD) to optimize the loss function. We employ a two-tier learning rate strategy: the ArcFace classification head is initialized with a learning rate of 0.005, while the backbone network uses 0.001. Both learning rates are reduced by a factor of 0.1 in the 5th, 10th, 15th, and 20th epoch.

4.1.4. Evaluation Protocol

We take Mean Average Precision (mAP) as an evaluation metric in the CACD dataset. For the retrieval results of each query image, precision at every recall level is computed and averaged to get average precision (AP). mAP is then calculated over the whole query set Q, formulated as follows:

mAP (Q) = \frac{1}{| Q |} \sum_{i = 1}^{| Q |} \frac{1}{m_{i}} \sum_{k = 1}^{m_{i}} Precision (R_{i k}),

(17)

where

R_{i k}

is the retrieval results of

q_{i} \in Q

in descending order from the first image to the kth image, and

Precision (R_{i k})

is the ratio of positive images in

R_{i k}

.

Additionally, we employ Rank-k accuracy to evaluate the detailed retrieval performance. Rank-k accuracy measures the percentage of queries for which at least one correct match appears in the top-k retrieved results:

Rank - k (Q) = \frac{1}{| Q |} \sum_{i = 1}^{| Q |} I (\exists j \in {1, \dots, k} : R_{i j} \in P_{i}),

(18)

where

I (\cdot)

is the indicator function,

R_{i j}

is the jth retrieved image for query

q_{i}

, and

P_{i}

is the set of positive (relevant) images for query

q_{i}

. Specifically, Rank-1 accuracy is particularly important as it represents the probability that the top-1 retrieved result is correct.

4.2. Experiments on AIFR Datasets

We evaluate our HMLF on a series of benchmark face aging datasets, including FG-NET, AgeDB, MORPH Album 2, CACD and IMDB-clean for comparison with the cutting-edge and most authoritative methods.

4.2.1. Result on CACD

CACD is a large-scale dataset for face recognition and retrieval across ages, collected in the wild with diverse variations. It contains 163,336 face images from 2000 celebrities ranging from 16 to 62 years old. Following the experimental setting in [57], 1200 celebrities are used to fine-tune the HMLF, while the remaining 120 are used for testing. Among them, images taken in 2013 are used as probe images, and the remaining images taken in 2004–2006, 2007–2009 and 2010–2012 are partitioned into three groups as gallery images.

Table 1 shows the retrieval results on CACD compared to other state-of-the-art methods. Our method has an obvious performance increase, achieving an average mAP exceeding 97%, consistently showing effectiveness across different years.

4.2.2. Result on MORPH Album 2

MORPH is a large-scale public longitudinal face database. Album 2 has two versions for commercial and non-commercial use, which have almost identical data distribution and are used alternately in previous works. The non-commercial version contains 55,134 images of 13,618 individuals with ages between 17 and 77, while the version for commercial use contains 78,207 face images of 20,569 individuals. There are two benchmark settings where the testing set consists of 10,000 subjects and 3000 subjects respectively. Following [57], we define the testing set by randomly sampling 10,000 identities and choosing the youngest image and the oldest image for each identity. Thus, it adds up to 20,000 images, with the youngest 10,000 images becoming the gallery set and the oldest 10,000 images becoming the probe set. This is setting-1. Setting-2 follows a similar protocol but is restricted to 3000 identities to simulate small-scale or few-shot retrieval scenarios.

The recognition result is evaluated with the Rank-1 identification rate. As shown in Table 2, the HMLF has effectively improved the Rank-1 identification performance of MORPH Album 2. Notably, our method achieves a 0.4% improvement over MT-MIM [57]. Given the high baseline of 98%, this represents a significant relative reduction in the error rate, highlighting the efficacy of our mixed sampling and adaptive weighting.

4.2.3. Result on FG-NET Dataset

FG-NET is a popular public dataset for cross-age face recognition, collected in the wild with huge variability in age, covering from children to the elderly. It contains 1002 face images from 82 individuals, with ages ranging from 0 to 69. We follow the leave-one-out setting, the same as [57], for fair comparisons with previous methods. Specifically, we harness the images of 81 people to fine-tune the model, and the images of the remaining person are used for evaluation each time. This procedure is repeated 82 times until the last finetuning evaluation process ends.

The face retrieval performance comparison of the invented HMLF with other state-of-the-art methods on FG-NET is reported in Table 3. Our method breaks the best record by the elevation of 1%. It can be inferred that the model is robust even under small-scale datasets with tremendous age variation.

4.2.4. Result on AgeDB

AgeDB is an in-the-wild database containing 16,488 face images of 568 individuals with manually annotated age labels. It only provides four protocols for age-invariant face verification under different age gaps of face pairs. Thus, similar to CACD, we create a new retrieval protocol for fair comparison. Namely, we treat 508 people as the training set, with the remaining 60 people for testing. More specifically for the test set, we select all images with ages under 40 as the gallery set, while taking those aged above 55 as the probe set, to simulate the real-life scenario.

Different backbones (IResnet, FaceNet, Swin-T, MobileFaceNet) are compared as the baseline architectures. From the result in Table 4, we can find that IResnet-50 with our HMLF still tops the list, amounting to an average of 89% in mAP and 93% in Rank-1.

To visualize the global performance of our retrieval methods, we also plot the Rank-k curve. Figure 6 showcases the average accuracy of the top-k images retrieved by our model. Holistically, no matter which model it is, the Rank-k dwindles steadily as k increases. Consistently, four models all decrease by about 15% from top-1 to top-10, which means our method is stable and robust in top retrieval results. IResnet always takes the lead in precision, maintaining over 90% from top-1 to top-2 in both figures.

4.2.5. Result on IMDB-Clean

IMDB-clean is the cleaned version of the IMDB-WIKI dataset [68], which is a large-scale in-the-wild age database collected from celebrity images. The cleaned version contains 285,946 face images from 7041 individuals, with ages ranging from 1 to 95 years old.

In our retrieval protocol, the dataset is split into training and testing sets. The testing set consists of 200 individuals with 5952 images in total, divided into two subsets: the gallery set contains 5901 images of subjects younger than 40 years old, while the probe set includes 51 images of subjects older than 55 years old, creating a significant age gap for cross-age face recognition evaluation. The training set is composed of the remaining 5000 individuals (different from the testing set), with a total of 224,497 images.

Like AgeDB, comparative experiments are conducted to show the superiority of our HMLF. As expected, the result in Table 5 tells us that IResnet with TAL or IAL significantly outperforms existing baselines, with the second-best model achieving only 62% mAP and 65% Rank-1.

4.2.6. Comparison of Different Backbones

To comprehensively evaluate the effectiveness of different network architectures and loss functions for age-invariant face recognition, we conduct extensive experiments across multiple backbone networks and loss functions on five benchmark datasets: MORPH Album 2, CACD, FG-NET, AgeDB, and IMDB-clean.

We compare four backbone architectures with parameter counts between 1 M and 50 M: IResnet-50, FaceNet, Swin-T and MobileFaceNet. Each backbone is trained with three different loss functions: ArcFace, TAL and IAL to test the effectiveness of our HMLF.

To ensure fair comparison, all input images are resized according to the standard configuration of each backbone: FaceNet uses

160 \times 160

resolution, Swin-T uses

224 \times 224

resolution, while others use

112 \times 112

resolution.

The results in Table 6 and Table 7 illustrate that the HMLF (TAL or IAL) has limited, minor and even negative improvement on cross-age datasets whose age gap is not significant, like CACD and MORPH. But it is conspicuous that our model was successful in datasets like AgeDB and IMDB-clean, whose age variance is huge. For example, analyzing the results for FaceNet, we observe that the mAP and Rank-1 percentage surges nearly 5%, revealing that our method has facilitated the ability of capturing and decoupling age-invariant features. The consistent improvement of IAL over fixed-weight TAL across all backbones validates our hypothesis that adaptive weighting can better balance multiple objectives during training.

It is noteworthy that the Swin-T backbone experiences a severe performance drop, particularly on the IMDB-clean dataset. This collapse is primarily attributed to the massive scale of the IMDB-clean gallery set (5901 images) compared to AgeDB (771 images). The sheer volume of distractors in the larger gallery overwhelms the Transformer’s patch-based attention mechanism. Lacking the strong local inductive biases inherent to CNNs like IResnet, Swin-T struggles to distinguish invariant structural features from transient aging textures when faced with highly complex, large-scale retrieval interference.

4.3. Ablation Study

To verify the contribution of each component, a systematic ablation study is designed by us on the HMLF, as shown in Table 8, Table 9 and Table 10. Namely, we conduct two sets of experiments to verify TAL and IAL respectively. The baseline setting represents the backbone with ArcFace Loss only. The first set is numbers 1, 2, and 3, the purpose of which is to explore whether the mix mining strategy and adaptive weight can enhance the performance of Triplet Loss. The second set is numbers 1, 4, and 5, aiming to figure out if the memory bank and adaptive weight can contribute to InfoNCE Loss.

As is revealed in Table 8, introducing the mix mining strategy improves performance from the baseline to 66.72%/80.27%, indicating that balancing semi-hard and hard samples makes training more target-oriented and efficient. Incorporating the adaptive weight brings a marginal improvement to 66.78%/80.34%, while adding the memory bank to InfoNCE Loss yields a moderate gain of 68.21%/81.98%, demonstrating that memory bank sampling can remarkably enhance robustness against occlusion and pose variations.

From Table 10, it can be seen that the HMLF has saturated in its performance on IResnet-50, with only a 1% increase in mAP and Rank-1. Despite that, remarkable improvement has been made after applying the HMLF to lightweight models such as MobileFaceNet. Table 10 reveals a moderate elevation of nearly 6% mAP, which is directly driven by TAL and IAL. This demonstrates and magnifies the necessity of all three modules: mix mining, memory bank and adaptive weight.

The introduction of adaptive weight contributes to a marginal 1% increase in mAP and Rank-1. Since this minor improvement works consistently in nearly all backbones, this weighting strategy can still be optimized in the future so that it could have broader applications.

4.4. Visualization Analysis

4.4.1. t-SNE Feature Distribution Visualization Analysis

Figure 7 presents the t-SNE visualization of face feature embedding distributions for our HMLF. In both pictures, samples from the same identity consistently form compact clusters, while different identities are completely separated without noticeable overlap. This indicates that our method learns strongly identity-preserving and age-robust representations, even when the query and gallery faces differ significantly in age. But we can also notice some isolated points in Figure 7b, which means our model still needs to be remedied in extreme-age-gap cases.

4.4.2. Retrieval Visualization

Figure 8 illustrates the top-10 retrieval results of our HMLF in the cross-age retrieval task. As is shown, almost every box is green, demonstrating that our method consistently retrieves correct matches within the top ranks, even under variations in pose and expression. Another surprising fact is that, although the skin color of the second person changes over time, our model can still retrieve him effortlessly. These observations further verify the effectiveness of our approach in disentangling identity-related and age-related features.

4.4.3. Visualization of Intra-Class and Inter-Class Distance Distributions

As shown in Figure 9, we compare the intra-class and inter-class distances of the baseline and our HMLF by randomly sampling 5000 pairs of images. Remarkably, the gap between the two center lines is 0.518 in Figure 9a, which is enlarged to 0.619 in Figure 9b. In addition, there is a small area of overlap between the red part and the blue part originally. But after the introduction of the HMLF, that obvious overlap vanishes, with both the blue part and red part gathering closer to their center. These phenomena indicate more compact samples from the same person and a greater separating boundary between different identities, reflecting the obvious merit of our HMLF in disentangling identity-related and age-related features.

4.4.4. Attention Map Comparison

Using the GradCAM method [69], we plot normalized attention maps for each method, which are shown in Figure 10. As shown in Figure 10, the baseline model tends to focus heavily on the lower jaw and cheek regions. These areas are highly susceptible to age-related variations, such as skin sagging, wrinkles, and changes in facial hair, which degrade the model’s ability to match images of the same identity across large age gaps.

In contrast, our proposed methods, particularly the IAL module, significantly shift the network’s attention towards age-invariant structural features. As shown in the IAL heatmap, the network exhibits a strong, symmetrical focus on the periocular region (eyes and eyebrows) and the upper nasal bridge. These regions correspond to the rigid underlying skeletal structure of the face, which remains remarkably stable throughout a person’s lifespan.

4.5. Hyper-Parameter Analysis

Figure 11 presents the visualization results of the two hyper-parameters

λ_{1}

and M in our loss function. First, we fix

M = 16,384

and increase

λ_{1}

from 0 to 1 with a step size of 0.1. Both Rank-1 and mAP first experience a tiny decrease, and then rebound to a peak point, before descending steadily to the bottom. We find that

λ_{1} = 0.3

is the optimal point, which provides a better balance between moderate semi-hard samples and valuable hard samples. Similarly, when fixing

λ_{1} = 0.5

and sweeping M from 512 to 262,144 by doubling its value at each step, we find the curve culminated at about

M = 16,384

. Based on these observations, we adopt

λ_{1} = 0.3

and

M = 16,384

as the final hyper-parameter settings in the following experiments to achieve the best performance.

4.6. Model Comparison and Analysis

In this section, we present a comprehensive comparison of four backbone networks used in our age-invariant face recognition experiments. Table 11 summarizes the model parameters, computational complexity (FLOPs), inference latency, and peak GPU memory.

It is shown in Table 11 that IResnet-50 provides a strong baseline performance with 43.6 M parameters. Swin Transformer Tiny introduces a shifted window attention mechanism with 28.3 M parameters, demonstrating efficient long-range dependency modeling. FaceNet (Inception-ResNet-v1) balances capacity and efficiency with 23.0 M parameters. MobileFaceNet achieves the most lightweight design with only 0.99 M parameters, suitable for resource-constrained scenarios.

5. Conclusions

In this paper, we address large-scale cross-age face retrieval by introducing a Hybrid Metric Learning Framework that combines ArcFace with supervised contrastive losses, equipped with online mixed mining, a memory bank-based InfoNCE formulation, and uncertainty-based adaptive loss weighting. By circumventing the need for explicit age labels, our framework learns identity-discriminative yet age-robust representations that are well suited for ranking-based retrieval. We further treat cross-age face recognition as a 1:N retrieval problem and promote this evaluation protocol to more datasets. Extensive experiments on five public cross-age datasets and multiple representative backbones demonstrate that our method achieves new state-of-the-art retrieval performance and offers comprehensive insights into how backbone–loss combinations and optimization strategies influence cross-age generalization of face retrieval.

Author Contributions

Conceptualization, J.C. and B.L.; methodology, Z.W. and T.Z.; validation, J.C. and T.Z.; formal analysis, B.L.; writing—original draft, J.C. and T.Z.; writing—review and editing, B.L. and Z.W.; supervision, B.L. and Z.W.; project administration, T.Z.; funding acquisition, J.C. All authors have read and agreed to the published version of this manuscript.

Funding

This work was supported by the Undergraduate Training Program for Innovation and Entrepreneurship, Soochow University (Project No. 202510285035).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data is contained within this article.

Acknowledgments

The authors would like to thank Huanfei Ma and Minxin Chen for their valuable guidance and support throughout this research. Moreover, the authors would like to thank the anonymous reviewers for their valuable comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jalal, A.S.; Sharma, D.K.; Sikander, B. Suspect face retrieval system using multicriteria decision process and deep learning. Multimed. Tools Appl. 2023, 82, 38189–38216. [Google Scholar] [CrossRef]
Zhang, Z.; Yin, S.; Cao, L. Age-invariant face recognition based on identity-age shared features. Vis. Comput. 2024, 40, 5465–5474. [Google Scholar] [CrossRef]
Wang, M.; Deng, W. Deep face recognition: A survey. Neurocomputing 2021, 429, 215–244. [Google Scholar] [CrossRef]
Ouyang, Y.; Shao, Y.; Shi, B. Research on Open Set Recognition Method for Aerial Infrared Targets. In Proceedings of the 2024 2nd International Conference on Artificial Intelligence and Automation Control (AIAC); IEEE: New York, NY, USA, 2024; pp. 517–521. [Google Scholar] [CrossRef]
Chandaliya, P.K.; Nain, N. ChildGAN: Face aging and rejuvenation to find missing children. Pattern Recognit. 2022, 129, 108761. [Google Scholar] [CrossRef]
Shen, Y.; Yang, C.; Tang, X.; Zhou, B. InterFaceGAN: Interpreting the Disentangled Face Representation Learned by GANs. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 2004–2018. [Google Scholar] [CrossRef] [PubMed]
Ritharson P, I.; Vidhya, K.; G, M.; D, B.; Sathish Kumar, K. GAN-Based Facial Feature Reconstruction for Improved Masked Face Recognition During Covid. In Proceedings of the 2023 International Conference on Circuit Power and Computing Technologies (ICCPCT); IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
Park, U.; Tong, Y.; Jain, A.K. Age-Invariant Face Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 947–954. [Google Scholar] [CrossRef]
Liao, J.; Sanchez, V.; Guha, T. Self-Supervised Frontalization and Rotation Gan with Random Swap for Pose-Invariant Face Recognition. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP); IEEE: New York, NY, USA, 2022; pp. 911–915. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar] [CrossRef]
Wang, H.; Gong, D.; Li, Z.; Liu, W. Decorrelated Adversarial Learning for Age-Invariant Face Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2019; pp. 3522–3531. [Google Scholar] [CrossRef]
Wang, Y.; Gong, D.; Zhou, Z.; Ji, X.; Wang, H.; Li, Z.; Liu, W.; Zhang, T. Orthogonal Deep Features Decomposition for Age-Invariant Face Recognition. In Proceedings of the Computer Vision–ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 764–779. [Google Scholar] [CrossRef]
Han, Y.; Zhou, Y.; Li, M.; Zhang, X. Cross-Age Face Recognition Based on Feature Disentanglement and Age Regression. In Proceedings of the 2025 7th International Academic Exchange Conference on Science and Technology Innovation (IAECST); IEEE: New York, NY, USA, 2025; pp. 229–232. [Google Scholar] [CrossRef]
Dong, Z.; Zhu, H.; Li, Y. An Innovative Approach to Cross-Age Face Recognition: Combining Deformable Convolutional Networks with VGG-16 Network. In Proceedings of the 2024 5th International Conference on Computers and Artificial Intelligence Technology (CAIT); IEEE: New York, NY, USA, 2024; pp. 100–104. [Google Scholar] [CrossRef]
Ermao, L.; Min, Z. Review of Cross-Age Face Recognition in Discriminative Models. In Proceedings of the 2023 8th International Conference on Image, Vision and Computing (ICIVC); IEEE: New York, NY, USA, 2023; pp. 124–130. [Google Scholar] [CrossRef]
Huang, Z.; Zhang, J.; Shan, H. When Age-Invariant Face Recognition Meets Face Age Synthesis: A Multi-Task Learning Framework and a New Benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 7917–7932. [Google Scholar] [CrossRef]
Yan, C.; Meng, L.; Li, L.; Zhang, J.; Zhan, W.; Yin, J.; Zhang, J.; Sun, Y.; Zheng, B. Age-Invariant Face Recognition by Multi-Feature Fusion and Decomposition with Self-attention. ACM Trans. Multimed. Comput. Commun. Appl. 2022, 18, 1–18. [Google Scholar] [CrossRef]
Elboushaki, A.; Hannane, R.; Afdel, K. Similarity-based face image retrieval using sparsely embedded deep features and binary code learning. Int. J. Multimed. Inf. Retr. 2024, 13, 28. [Google Scholar] [CrossRef]
Jang, Y.K.; Cho, N.I. Similarity Guided Deep Face Image Retrieval. arXiv 2021, arXiv:2107.05025. [Google Scholar]
Yang, Y.; Tian, X.; Ng, W.W.Y.; Wang, R.; Gao, Y.; Kwong, S. Generative face inpainting hashing for occluded face retrieval. Int. J. Mach. Learn. Cybern. 2023, 14, 1725–1738. [Google Scholar] [CrossRef]
Prathiba, T.; Selva Kumari, R.; Murugesan, C.S. ALMEGA-VIR: Face video retrieval system. Imaging Sci. J. 2023, 72, 766–776. [Google Scholar] [CrossRef]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2019; pp. 4685–4694. [Google Scholar] [CrossRef]
Hermans, A.; Beyer, L.; Leibe, B. In Defense of the Triplet Loss for Person Re-Identification. arXiv 2017, arXiv:1703.07737. [Google Scholar] [CrossRef]
Fang, L. Lightweight face recognition neural network based on MobileNetV4. In Proceedings of the Sixth International Conference on Signal Processing and Computer Science (SPCS 2025); Mathiopoulos, P.T., Feng, Y., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 2025; Volume 13978, p. 1397817. [Google Scholar] [CrossRef]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2015; pp. 815–823. [Google Scholar] [CrossRef]
Chen, S.; Liu, Y.; Gao, X.; Han, Z. MobileFaceNets: Efficient CNNs for Accurate Real-Time Face Verification on Mobile Devices. In Biometric Recognition; Zhou, J., Wang, Y., Sun, Z., Jia, Z., Feng, J., Shan, S., Ubul, K., Guo, Z., Eds.; Springer: Cham, Switzerland, 2018; pp. 428–438. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Chen, B.C.; Chen, C.S.; Hsu, W.H. Face Recognition and Retrieval Using Cross-Age Reference Coding With Cross-Age Celebrity Dataset. IEEE Trans. Multimed. 2015, 17, 804–815. [Google Scholar] [CrossRef]
Ricanek, K.; Tesafaye, T. MORPH: A longitudinal image database of normal adult age-progression. In Proceedings of the 7th International Conference on Automatic Face and Gesture Recognition (FGR06); IEEE: New York, NY, USA, 2006; pp. 341–345. [Google Scholar] [CrossRef]
Lanitis, A.; Taylor, C.; Cootes, T. Toward automatic simulation of aging effects on face images. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 442–455. [Google Scholar] [CrossRef]
Moschoglou, S.; Papaioannou, A.; Sagonas, C.; Deng, J.; Kotsia, I.; Zafeiriou, S. AgeDB: The First Manually Collected, In-the-Wild Age Database. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); IEEE: New York, NY, USA, 2017; pp. 1997–2005. [Google Scholar] [CrossRef]
Lin, Y.; Shen, J.; Wang, Y.; Pantic, M. FP-Age: Leveraging Face Parsing Attention for Facial Age Estimation in the Wild. IEEE Trans. Image Process. 2025, 34, 4767–4777. [Google Scholar] [CrossRef]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Wen, Y.; Li, Z.; Qiao, Y. Latent Factor Guided Convolutional Neural Networks for Age-Invariant Face Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2016; pp. 4893–4901. [Google Scholar] [CrossRef]
Agarwal, A.; Susan, S. Attention-augmented squeeze-and-excitation enhanced mobile network for occluded facial expression recognition in resource-constrained environments. Signal Image Video Process. 2025, 19, 687. [Google Scholar] [CrossRef]
Chandhana, G.; Hemalatha, P.; Reddy, P.; Sesadri, U.; Madhavi, K. CNN-Powered Facial Analysis for Gender, Age and Facial shape Classification. In Proceedings of the 2025 2nd Asia Pacific Conference on Innovation in Technology (APCIT); IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the NIPS’17, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; J’egou, H. Training data-efficient image transformers & distillation through attention. arXiv 2020, arXiv:2012.12877. [Google Scholar] [CrossRef]
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Tay, F.E.H.; Feng, J.; Yan, S. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2021; pp. 538–547. [Google Scholar] [CrossRef]
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. CvT: Introducing Convolutions to Vision Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2021; pp. 22–31. [Google Scholar] [CrossRef]
Darcet, T.; Oquab, M.; Mairal, J.; Bojanowski, P. Vision Transformers Need Registers. arXiv 2024, arXiv:2309.16588. [Google Scholar] [CrossRef]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2020. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. arXiv 2020, arXiv:2106.09215v5. [Google Scholar] [CrossRef]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised Contrastive Learning. arXiv 2021, arXiv.2004.11362. [Google Scholar] [CrossRef]
Li, H.; Zhu, J.; Wen, G.; Zhong, H. Structural self-contrast learning based on adaptive weighted negative samples for facial expression recognition. Vis. Comput. 2025, 41, 579–590. [Google Scholar] [CrossRef]
Liu, B.; Wang, B.; Li, T. Bayesian Self-Supervised Contrastive Learning. arXiv 2024, arXiv:2301.11673. [Google Scholar] [CrossRef]
Dong, H.; Long, X.; Li, Y. Synthetic Hard Negative Samples for Contrastive Learning. Neural Process. Lett. 2024, 56, 33. [Google Scholar] [CrossRef]
Guo, Y.; Zhang, L.; Hu, Y.; He, X.; Gao, J. MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition. In Proceedings of the Computer Vision–ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 87–102. [Google Scholar] [CrossRef]
Kemelmacher-Shlizerman, I.; Seitz, S.M.; Miller, D.; Brossard, E. The MegaFace Benchmark: 1 Million Faces for Recognition at Scale. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2016; pp. 4873–4882. [Google Scholar] [CrossRef]
Cao, Q.; Shen, L.; Xie, W.; Parkhi, O.; Zisserman, A. VGGFace2: A Dataset for Recognising Faces across Pose and Age. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 67–74. [Google Scholar] [CrossRef]
Li, J.; Zhou, L.; Chen, J. Age-invariant face network (AFN): A discriminative model towards age-invariant face recognition. Neural Comput. Appl. 2024, 36, 13689–13702. [Google Scholar] [CrossRef]
Zhu, J.; Cai, Z.; Lin, S. Multi-scale Feature Fusion and Multi-task Learning for Cross-age Face Recognition. In Proceedings of the 2024 5th International Conference on Computers and Artificial Intelligence Technology (CAIT); IEEE: New York, NY, USA, 2024; pp. 76–81. [Google Scholar] [CrossRef]
Yuan, H.; He, Y.; Du, P.; Song, L. Multi-Task Learning Using Uncertainty to Weigh Losses for Heterogeneous Face Attribute Estimation. IEEE Trans. Affect. Comput. 2024, 14, 2033–2047. [Google Scholar] [CrossRef]
Ali, H.; Ijaz, A. Machine and Deep Learning Based CCTV Surveillance Using FaceNet, MTCNN, and Haar Cascade for Enhanced Security. In Proceedings of the 2025 6th International Conference on Computer Vision, Image and Deep Learning (CVIDL); IEEE: New York, NY, USA, 2025; pp. 672–677. [Google Scholar] [CrossRef]
Yi, D.; Lei, Z.; Liao, S.; Li, S.Z. Learning Face Representation from Scratch. arXiv 2014, arXiv.1411.7923. [Google Scholar] [CrossRef]
Hou, X.; Li, Y.; Wang, S. Disentangled Representation for Age-Invariant Face Recognition: A Mutual Information Minimization Perspective. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2021; pp. 3672–3681. [Google Scholar] [CrossRef]
Gong, D.; Li, Z.; Lin, D.; Liu, J.; Tang, X. Hidden Factor Analysis for Age Invariant Face Recognition. In Proceedings of the 2013 IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2013; pp. 2872–2879. [Google Scholar] [CrossRef]
Lin, L.; Wang, G.; Zuo, W.; Feng, X.; Zhang, L. Cross-Domain Visual Matching via Generalized Similarity Measure and Feature Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1089–1102. [Google Scholar] [CrossRef]
Xu, C.; Liu, Q.; Ye, M. Age invariant face recognition and retrieval by coupled auto-encoder networks. Neurocomputing 2017, 222, 62–71. [Google Scholar] [CrossRef]
Zheng, T.; Deng, W.; Hu, J. Age Estimation Guided Convolutional Neural Network for Age-Invariant Face Recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); IEEE: New York, NY, USA, 2017; pp. 503–511. [Google Scholar] [CrossRef]
Yu, J.; Jing, L. A Joint Multi-Task CNN for Cross-Age Face Recognition. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP); IEEE: New York, NY, USA, 2018; pp. 2411–2415. [Google Scholar] [CrossRef]
Gong, D.; Li, Z.; Tao, D.; Liu, J.; Li, X. A maximum entropy feature descriptor for age invariant face recognition. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2015; pp. 5289–5297. [Google Scholar] [CrossRef]
Zhao, J.; Cheng, Y.; Cheng, Y.; Yang, Y.; Zhao, F.; Li, J.; Liu, H.; Yan, S.; Feng, J. Look across elapse: Disentangled representation learning and photorealistic cross-age face synthesis for age-invariant face recognition. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’19/IAAI’19/EAAI’19, 2019; AAAI Press: Washington, DC, USA, 2019. [Google Scholar] [CrossRef]
Wang, H.; Sanchez, V.; Li, C.T. Cross-Age Contrastive Learning for Age-Invariant Face Recognition. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2023; pp. 4600–4604. [Google Scholar] [CrossRef]
Li, Z.; Park, U.; Jain, A.K. A Discriminative Model for Age Invariant Face Recognition. IEEE Trans. Inf. Forensics Secur. 2011, 6, 1028–1037. [Google Scholar] [CrossRef]
Truong, T.D.; Duong, C.N.; Quach, K.G.; Le, N.; Bui, T.D.; Luu, K. LIAAD: Lightweight attentive angular distillation for large-scale age-invariant face recognition. Neurocomputing 2023, 543, 126198. [Google Scholar] [CrossRef]
Kuprashevich, M.; Tolstykh, I. MiVOLO: Multi-input Transformer for Age and Gender Estimation. In Proceedings of the Analysis of Images, Social Networks and Texts: 11th International Conference, AIST 2023, Yerevan, Armenia, 28–30 September 2023; Revised Selected Papers; Springer: Berlin/Heidelberg, Germany, 2023; pp. 212–226. [Google Scholar] [CrossRef]
Balamurugan, R.; Sekar, K.T.A.; Nandalal, V.; Wu, C.P. Face-Based Kinship Verification Using Deep Embeddings for Low-Cost Health Record Linkage. In Proceedings of the 2025 IEEE First International Conference on Innovations in Engineering and Next-Generation Technologies for Sustainability (ICINVENTS); IEEE: New York, NY, USA, 2025; Volume 1, pp. 1–6. [Google Scholar] [CrossRef]

Figure 1. Demonstration of practical AIFR task in real-life scenarios. It can be divided into 3 stages: photo capture, face retrieval and face verification.

Figure 2. Overview of the proposed Hybrid Metric Learning Framework (HMLF). The architecture comprises a feature extraction backbone (e.g., IResnet-50) and two complementary supervision heads balanced by an uncertainty-based adaptive weighting scheme.

Figure 3. The entire procedure of ArcFace Loss, which maps facial feature embeddings onto a hypersphere manifold, where semantic similarity is directly represented by geodesic distance.

Figure 4. Examples of easy, semi-hard and hard face image triplets. (a) Easy triplets. (b) Semi-hard triplets. (c) Hard triplets. Columns from left to right: anchor, positive, and negative images.

Figure 5. The memory bank is a dynamic queue storing a number of feature vectors.

Figure 6. Rank-k curves comparison on AgeDB. (a) is the Rank-k values (from 1 to 10) using the TAL method under different backbones. (b) is the Rank-k values (from 1 to 10) using the IAL method under different backbones.

Figure 7. We randomly sample 10 people with 10 images per person from CACD test set to make t-SNE visualization. (a) is chosen from gallery set (2010–2012) and probe set (2013). (b) is chosen from gallery set (2004–2006) and probe set (2013). Different colors represent different people.

Figure 8. We randomly select 4 people to showcase the top-10 cross-age retrieval results. If the box is green, it means that the retrieved image is right. If the box is red, it means that the retrieved image is wrong.

Figure 9. This is the visualizing histogram of inter-class and intra-class distances on the CACD dataset. We randomly sample 30 people and randomly select 5000 pairs from them to compute distances. The red part refers to inter-class distances, while the blue part refers to intra-class distances. The HMLF largely improves the gap, reducing overlap and improving feature separability.

Figure 10. These are attention maps on the original image, the baseline method, the TAL method and the IAL method. The original face image is from CACD. These maps can reflect which part of the face the model pays most attention to.

Figure 11. Visualization of sensitivity to hyper-parameters. (a) is about the semi-hard and hard mixing coefficient

λ_{1}

, ranging from 0 to 1; (b) is about the memory bank size M, ranging from 512 to 262,144.

Figure 11. Visualization of sensitivity to hyper-parameters. (a) is about the semi-hard and hard mixing coefficient

λ_{1}

, ranging from 0 to 1; (b) is about the memory bank size M, ranging from 512 to 262,144.

Table 1. Comparison of retrieval results on CACD dataset across different time periods (mAP %). The text in bold highlights our proposed methods and the best results.

Method	2004–2006	2007–2009	2010–2012
HFA [58]	50.58	53.01	56.12
CARC [28]	52.72	55.48	61.38
GSM-1 [59]	53.79	57.83	63.92
GSM-2 [59]	55.45	58.74	64.58
CAN [60]	62.33	67.69	73.24
AE-CNN [61]	70.01	72.87	78.25
JM-CNN [62]	82.53	85.26	88.28
MT-MIM [57]	92.63	93.95	96.09
MFD [17]	92.41	95.44	97.51
IResnet + TAL	96.43	97.38	98.23
IResnet + IAL	96.23	97.44	98.34

Table 2. Rank-1 accuracy (%) comparisons on MORPH Album 2. The symbol - indicates that data are not available.

Method	Setting-1/Setting-2
HFA [58]	91.14/-
CARC [28]	92.80/-
MEFA [63]	93.80/-
MEFA + SIFT + MLBP [63]	94.59/-
LF-CNN [34]	97.51/-
GSM [59]	-/94.40
AE-CNN [61]	-/98.13
OE-CNN [12]	98.55/98.67
DAL [11]	98.93/98.97
AIM [64]	99.13/98.81
AIM + CAFR [64]	99.65/99.26
MT-MIM [57]	-/99.43
CACon [65]	99.57/99.52
IResnet + TAL	99.73/99.78
IResnet + IAL	99.83/99.80

Table 3. Retrieval Rank-1 identification results on FG-NET strictly under the leave-one-out protocol.

Method	Rank-1 (%)
Park et al. [8]	37.40
Li et al. [66]	47.50
HFA [58]	69.00
MEFA [63]	76.20
LF-CNN [34]	88.10
CAN [60]	86.50
DAL [11]	94.50
AIM [64]	93.20
MT-MIM [57]	94.21
ISF [2]	94.67
VGG16-DCN [14]	80.50
CACon [65]	94.61
MFNR-LIAAD [67]	95.11
IResnet + TAL	95.43
IResnet + IAL	95.30

Table 4. mAP and Rank-1 accuracy (%) evaluation on AgeDB.

Method	mAP (%)	Rank-1 (%)
FaceNet	56.12	72.30
Swin-T	86.60	92.62
MobileFaceNet	71.19	86.62
IResnet + TAL	89.43	93.21
IResnet + IAL	88.78	93.43

Table 5. mAP and Rank-1 accuracy (%) evaluation on IMDB-clean.

Method	mAP (%)	Rank-1 (%)
FaceNet	16.15	23.53
Swin-T	61.87	64.7
MobileFaceNet	21.25	29.41
IResnet + TAL	91.23	96.01
IResnet + IAL	91.33	95.87

Table 6. Comparison of different backbones equipped with TAL or IAL (Rank-1 %).

Method	CACD (2011)	CACD (2008)	CACD (2005)	MORPH	FG-NET	AgeDB	IMDB-Clean
IResnet + ArcFace	99.88	99.56	99.54	99.75	93.44	91.40	94.12
IResnet + TAL	99.87	99.54	99.43	99.78	95.43	93.21	96.01
IResnet + IAL	99.77	99.56	99.34	99.80	95.30	93.43	95.87
FaceNet + ArcFace	99.23	99.05	99.01	87.23	79.57	72.90	23.53
FaceNet + TAL	99.30	99.12	99.23	88.43	84.23	80.34	27.37
FaceNet + IAL	99.43	99.22	99.03	89.23	85.01	82.23	27.89
Swin-T + ArcFace	99.08	99.08	99.15	99.86	93.81	92.62	64.70
Swin-T + TAL	99.06	99.06	99.14	99.76	95.45	92.78	67.19
Swin-T + IAL	99.00	99.08	99.15	99.87	96.01	92.97	66.56
MobileFaceNet + ArcFace	99.15	99.0	98.92	98.93	91.69	86.62	29.41
MobileFaceNet + TAL	99.03	98.88	98.78	98.89	93.23	89.14	35.43
MobileFaceNet + IAL	99.04	98.78	98.77	98.78	94.21	88.68	36.78

Table 7. Comparison of different backbones equipped with TAL or IAL (mAP %).

Method	CACD (2005)	CACD (2008)	CACD (2011)	MORPH	FG-NET	AgeDB	IMDB-Clean
IResnet + ArcFace	98.33	98.01	96.54	99.7	81.28	88.01	88.87
IResnet + TAL	98.23	97.38	96.43	99.75	87.23	89.43	91.23
IResnet + IAL	98.34	97.44	96.23	99.77	85.34	88.78	91.33
FaceNet + ArcFace	96.44	93.12	92.32	88.92	44.00	56.23	16.15
FaceNet + TAL	96.33	93.11	92.12	89.55	50.12	66.78	21.23
FaceNet + IAL	95.99	93.01	92.33	90.12	52.23	68.34	20.89
Swin-T + ArcFace	98.62	97.60	97.24	99.88	76.42	86.60	61.87
Swin-T + TAL	98.44	97.54	97.15	99.78	79.32	86.88	65.32
Swin-T + IAL	98.58	97.5	97.17	99.67	80.12	86.89	66.09
MobileFaceNet + ArcFace	97.87	96.26	95.66	99.07	65.62	71.19	21.25
MobileFaceNet + TAL	97.45	96.02	95.44	98.90	72.23	77.01	33.87
MobileFaceNet + IAL	97.54	96.11	95.43	99.05	74.10	76.19	34.09

Table 8. Ablation study of the HLMF framework using FaceNet. The symbol - indicates that the corresponding part is not included in that setting. The symbol ✓ indicates that the corresponding part is included in that setting.

#	Settings				AgeDB
#	Baseline	Mix Mining	Memory Bank	Adaptive Weight	mAP	Rank-1
1	✓	-	-	-	56.23	72.90
2	✓	✓	-	-	66.72	80.27
3	✓	✓	-	✓	66.78	80.34
4	✓	-	✓	-	68.21	81.98
5	✓	-	✓	✓	68.34	82.23

Table 9. Ablation study of the HLMF framework using MobileFaceNet. The symbol - indicates that the corresponding part is not included in that setting. The symbol ✓ indicates that the corresponding part is included in that setting.

#	Settings				AgeDB
#	Baseline	Mix Mining	Memory Bank	Adaptive Weight	mAP	Rank-1
1	✓	-	-	-	71.19	86.62
2	✓	✓	-	-	75.14	88.27
3	✓	✓	-	✓	77.01	89.14
4	✓	-	✓	-	74.58	87.98
5	✓	-	✓	✓	76.19	88.68

Table 10. Ablation study of the HLMF framework using IResnet-50. The symbol - indicates that the corresponding part is not included in that setting. The symbol ✓ indicates that the corresponding part is included in that setting.

#	Settings				AgeDB
#	Baseline	Mix Mining	Memory Bank	Adaptive Weight	mAP	Rank-1
1	✓	-	-	-	88.01	91.40
2	✓	✓	-	-	89.01	92.88
3	✓	✓	-	✓	89.43	93.21
4	✓	-	✓	-	88.60	92.78
5	✓	-	✓	✓	88.78	93.43

Table 11. Comparison of different backbone networks on age-invariant face recognition.

Backbone	Type	Params (M)	FLOPs (G)	Latency (ms)	Peak GPU Mem (MB)
IResnet-50	CNN	43.6	6.36	8.5	1250.3
Swin-T	Transformer	28.3	4.51	12.3	1580.7
FaceNet	CNN	23.0	3.78	7.2	980.2
MobileFaceNet	Efficient CNN	0.99	0.45	3.8	420.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cao, J.; Zhang, T.; Wang, Z.; Lian, B. Age-Invariant Face Retrieval Based on Hybrid Metric Learning Framework (HMLF). Electronics 2026, 15, 1851. https://doi.org/10.3390/electronics15091851

AMA Style

Cao J, Zhang T, Wang Z, Lian B. Age-Invariant Face Retrieval Based on Hybrid Metric Learning Framework (HMLF). Electronics. 2026; 15(9):1851. https://doi.org/10.3390/electronics15091851

Chicago/Turabian Style

Cao, Jingtian, Tingshuo Zhang, Ziyi Wang, and Bobo Lian. 2026. "Age-Invariant Face Retrieval Based on Hybrid Metric Learning Framework (HMLF)" Electronics 15, no. 9: 1851. https://doi.org/10.3390/electronics15091851

APA Style

Cao, J., Zhang, T., Wang, Z., & Lian, B. (2026). Age-Invariant Face Retrieval Based on Hybrid Metric Learning Framework (HMLF). Electronics, 15(9), 1851. https://doi.org/10.3390/electronics15091851

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Age-Invariant Face Retrieval Based on Hybrid Metric Learning Framework (HMLF)

Abstract

1. Introduction

2. Related Work

2.1. The Evolution of Backbone Architectures for Age-Invariant Face Recognition (AIFR)

2.2. Contrastive Learning

3. Approach

3.1. Overview Framework

3.2. TAL

3.2.1. ArcFace Loss

3.2.2. Triplet Loss

3.2.3. Mixed Online Sampling Strategy Centering Semi-Hard and Hard Samples

3.3. IAL

3.3.1. InfoNCE Loss

3.3.2. Memory-Augmented InfoNCE Loss

3.3.3. Adaptive Parameters in TAL and IAL

4. Experiments

4.1. Implementation Details

4.1.1. Network Architecture

4.1.2. Data Preprocessing

4.1.3. Training Details

4.1.4. Evaluation Protocol

4.2. Experiments on AIFR Datasets

4.2.1. Result on CACD

4.2.2. Result on MORPH Album 2

4.2.3. Result on FG-NET Dataset

4.2.4. Result on AgeDB

4.2.5. Result on IMDB-Clean

4.2.6. Comparison of Different Backbones

4.3. Ablation Study

4.4. Visualization Analysis

4.4.1. t-SNE Feature Distribution Visualization Analysis

4.4.2. Retrieval Visualization

4.4.3. Visualization of Intra-Class and Inter-Class Distance Distributions

4.4.4. Attention Map Comparison

4.5. Hyper-Parameter Analysis

4.6. Model Comparison and Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI