Advancements and Challenges in Deep Learning-Based Person Re-Identification: A Review

Zhao, Liang; Han, Yuyan; Chen, Zhihao

doi:10.3390/electronics14224398

Open AccessSystematic Review

Advancements and Challenges in Deep Learning-Based Person Re-Identification: A Review

by

Liang Zhao

^1,*

,

Yuyan Han

¹ and

Zhihao Chen

²

¹

Henan Key Laboratory of Grain Storage Information Intelligent Perception and Decision Making, College of Electrical Engineering, Henan University of Technology, Zhengzhou 450001, China

²

School of Intelligent Engineering, Henan Institute of Technology, Xinxiang 453003, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(22), 4398; https://doi.org/10.3390/electronics14224398

Submission received: 21 September 2025 / Revised: 2 November 2025 / Accepted: 5 November 2025 / Published: 12 November 2025

Download

Browse Figures

Versions Notes

Abstract

Person Re-Identification (Re-ID), a critical component of intelligent surveillance and security systems, seeks to match individuals across disjoint camera networks under complex real-world conditions. While deep learning has revolutionized Re-ID through enhanced feature representation and domain adaptation, a holistic synthesis of its advancements, unresolved challenges, and ethical implications remains imperative. This survey offers a structured and critical examination of Re-ID in the deep learning era, organized into three pillars: technological innovations, persistent barriers, and future frontiers. We systematically analyze breakthroughs in deep architectures (e.g., transformer-based models, hybrid global-local networks), optimization paradigms (contrastive, adversarial, and self-supervised learning), and robustness strategies for occlusion, pose variation, and cross-domain generalization. Critically, we identify underexplored limitations such as annotation bias, scalability-accuracy trade-offs, and privacy-utility conflicts in real-world deployment. Beyond technical analysis, we propose emerging directions, including causal reasoning for interpretable Re-ID, federated learning for decentralized data governance, open-world lifelong adaptation frameworks, and human-AI collaboration to reduce annotation costs. By integrating technical rigor with societal responsibility, this review aims to bridge the gap between algorithmic advancements and ethical deployment, fostering transparent, sustainable, and human-centric Re-ID systems.

Keywords:

person re-identification; deep feature representation; self-supervised learning; ethical artificial intelligence; domain generalization

1. Introduction

Person re-identification (Re-ID) has emerged as a pivotal technology in intelligent video surveillance systems, addressing the fundamental challenge of cross-camera identity association in large-scale camera networks. The core objective of person Re-ID lies in learning discriminative feature representations that maintain robustness against viewpoint variations, illumination changes, pose diversity, and background clutter. When integrated with pedestrian detection and tracking systems, this technology forms the backbone for advanced applications ranging from security enhancement to smart city management and personalized services. The ability to accurately re-identify individuals across distributed camera views enables sophisticated spatio-temporal reasoning and behavioral analysis in complex urban ecosystems.

Despite remarkable progress in supervised learning paradigms, the deployment of person Re-ID systems in real-world scenarios remains fraught with challenges. While achieving impressive performance in controlled environments, current methods often exhibit significant performance degradation when confronted with the complexities of unconstrained settings. Key obstacles include severe occlusions, extreme pose variations, cross-camera viewpoint disparities, spatial misalignment, and low-resolution imaging artifacts. Traditional unimodal approaches relying solely on visual information (e.g., RGB imagery) prove inadequate in addressing these challenges, highlighting the need for more robust and adaptable solutions.

The emergence of multimodal person Re-ID has introduced new opportunities to overcome the limitations of unimodal systems. By synergistically combining complementary information from diverse modalities—including visible and infrared spectra, video sequences, and textual descriptions—these approaches enable more reliable cross-modal matching. This multimodal fusion effectively mitigates common challenges such as illumination variations, partial occlusions, and inter-class similarity. For instance, infrared imaging provides thermal signatures invariant to lighting conditions, while textual descriptions offer semantic cues to disambiguate visually similar individuals. The strategic integration of heterogeneous modalities has demonstrated substantial improvements in recognition accuracy, environmental robustness, and overall system reliability.

Our comprehensive analysis of person Re-ID review articles published in recent years (summarized in Table 1) reveals significant advances in methodological understanding. These surveys collectively provide thorough examinations of historical developments, current research frontiers, technical challenges, and future directions through diverse analytical lenses.

Ye et al. [1] explored closed-world person Re-ID through three methodological lenses: deep feature representation learning, metric learning, and ranking optimization. Extending this framework to open-world scenarios identified five critical frontiers: heterogeneous learning, end-to-end architectures, semi-supervised/unsupervised paradigms, noisy-label robustness, and open-set recognition.

Ming et al. [2] established a novel four-dimensional taxonomy for deep learning-based approaches, categorizing methods into deep metric learning, local feature learning, generative adversarial learning, and sequence feature learning. Their work offers critical insights into the comparative advantages and limitations of various Re-ID paradigms.

Zheng et al. [3] focused on visible-infrared cross-modal person Re-ID (VI-ReID), developing a comprehensive framework organized around four key dimensions: modality sharing, modality compensation, auxiliary information utilization, and data augmentation strategies. Their analysis proposes future directions emphasizing lightweight architectures and semi-supervised learning for VI-ReID systems.

Huang et al. [4] systematically analyzed image-based person Re-ID, categorizing methods into eight key challenges: occlusion handling, pose variation, background clutter, spatial misalignment, scale variation, viewpoint diversity, low-resolution recognition, and cross-domain generalization. Their taxonomy enables structured comparisons of existing approaches while revealing critical research gaps.

Zahra et al. [5] performed a multidimensional comparison of state-of-the-art methods through various analytical lenses, including modality differences (image vs. video), loss function design, technical characteristics, and application scenarios. Their visual analysis of algorithmic performance across different models offers unique insights into method capabilities.

While existing surveys have significantly advanced the field through comprehensive analyses of technical aspects like deep metric learning and cross-modal Re-ID, three critical gaps remain unaddressed:

Limited understanding of global research community structures and collaboration patterns;
Narrow focus on conventional benchmarks, neglecting the broader spectrum of unimodal and multimodal datasets;
Absence of unified taxonomies distinguishing unimodal and multimodal methodologies.

This paper makes three fundamental contributions to address these limitations:

Leveraging network science methodologies to analyze global collaboration networks, elucidating structural dynamics and knowledge diffusion patterns within person Re-ID academic ecosystems;
Developing a comprehensive taxonomy for 20 benchmark datasets spanning six modality categories (RGB, IR, video, text-image, visible-infrared, and cross-temporal sequences), establishing practical guidelines for dataset selection in person Re-ID research;
Proposing an innovative hierarchical framework that systematically categorizes 82 state-of-the-art methods into unimodal and multimodal paradigms, with further division into 15 methodologically distinct technical subclasses.

Unlike existing surveys, which often conclude before the most recent breakthroughs, our analysis offers a unique and timely synthesis, extending through August 2024 to capture the latest developments in areas like Transformer- and LLM-based Re-ID. Our methodology is robustly distinguished by its combination of critical literature review with rigorous bibliometric methods, drawing upon a corpus of 2067 peer-reviewed articles published in premier venues (such as CVPR, ICCV, ECCV) since January 2019. A stringent criteria-based selection process yielded 82 representative methods, which we systematically organized into the three-tiered taxonomy presented in Figure 1. Comprising 2 primary branches, 5 secondary categories, and 15 specialized subclasses, this structured classification serves as the foundation for our three-pillar analytical framework (innovations, barriers, and frontiers), facilitating the most current and in-depth analysis of technological progress and methodological shifts at multiple granularities.

The remainder of this paper is organized as follows: Section 2 presents bibliometric analysis of publication trends and research networks. Section 3 provides dataset categorization with modality-specific evaluation metrics. Section 4 details our unified taxonomy with technical comparisons. Section 5 discusses open challenges and potential solutions. Section 6 concludes with future research directions.

2. Bibliometric Analysis of the Person Re-ID Landscape

Our bibliometric analysis was rigorously implemented to ensure systematicity, precision, and reproducibility, strictly adhering to the PRISMA 2023 guidelines (Preferred Reporting Items for Systematic Reviews). The literature search scope was strictly defined from January 2019 to August 2024 to comprehensively capture recent advancements driven by deep learning, particularly the latest research dynamics involving Transformers and Large Language Models (LLMs). Literature retrieval was primarily conducted using core authoritative databases, such as Web of Science, and prominent venues in computer vision, including top-tier conferences (e.g., CVPR, ICCV, ECCV). The specific inclusion and exclusion criteria are elaborated in detail within the subsequent Literature Screening Protocol (e.g., Section 2.4) to ensure the representativeness of the selected articles. For the data analysis phase, we employed Gephi 0.9.7 software to construct and visualize the author co-occurrence networks and institutional collaboration relationships in the global collaboration network analysis (Section 2.2).Simultaneously, the BibExcel tool was utilized for high-frequency keyword statistics and co-occurrence network generation, enabling the systematic mapping of the technical landscape (Section 2.3) and effectively identifying and tracking research hotspots and evolution trends.In order to systematically delineate the evolving research landscape in person Re-ID, we performed a comprehensive bibliometric analysis covering the period from January 2019 to August 2024. Our analysis is based on a corpus of 2067 publications, comprising 225 papers presented at prominent conferences and 1842 journal articles indexed within the Web of Science database.

2.1. Publication Trends Analysis

Annual publication counts were aggregated and visualized in Figure 2, with the temporal axis representing years and the vertical axis showing publication volumes. The visualization reveals a 42.7% compound annual growth rate (CAGR) in journal publications, compared to a 28.3% CAGR in conference papers. This disparity indicates increasing emphasis on theoretical depth and interdisciplinary integration within the Re-ID field.

2.2. Collaboration Network Analysis

Using Gephi 0.9.7 software, we constructed a co-authorship network from the 2067 publications, comprising 5413 nodes (authors) and 15,455 edges (collaborations). The key structural parameters are summarized in Table 2.

Figure 3 demonstrates that the network exhibits scale-free characteristics through its power-law degree distribution (

γ = 2.1

,

R^{2} = 0.92

).

Filtering for nodes with degree ≥9 (top 1.35% of authors), we identified key research consortia driving methodological innovations (Figure 4). Leading teams advancing Re-ID paradigms include:

Huawei Noah’s Ark Lab (Qi Tian team): Pioneering cross-domain adaptive frameworks
Sun Yat-sen University (Weishi Zheng group): Leading innovations in person Re-ID with poor annotation
MPI Informatics (B. Schiele’s group): Multispectral person retrieval under varying illumination
Wuhan University (Zheng Wang team): Advancing transformer-based feature learning
University of Maryland, College Park (L.S.Davis’ lab): Groundbreaking multiview human analysis methodologies
Chongqing Key Lab of Image Cognition: Developing domain-specific architectures

2.3. Technical Landscape Mapping

Through BibExcel analysis, we identified 28 high-frequency keywords (≥50 occurrences) including “deep learning”, “feature extraction” and “transformer architecture”. The co-occurrence network (Figure 5) analysis identified three primary clusters, as detailed in Table 3.

Emerging topics like “diffusion models” and “neural rendering” show strong bridging centrality (

B C

≥ 0.4), suggesting their growing importance.

2.4. Literature Screening Protocol

The study selection process was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines [6] to ensure methodological rigor and transparency. As shown in Figure 6, the screening protocol comprised the following stages:

Prior to screening, 2 duplicate records were removed. No records were marked as ineligible by automation tools or removed for other reasons.

During the initial screening phase, 806 records were excluded. This included 213 records excluded due to missing abstracts and 593 records excluded based on title/ abstract relevance assessment.

A total of 135 reports could not be retrieved for full-text assessment. Following full-text review of the remaining articles, studies were excluded for the following reasons:

287 reports were excluded for non-alignment with the review’s unimodal/multimodal research scope
54 reports were excluded due to incompatible research methods or data presentation
688 reports were excluded for lack of theoretical or practical relevance

This multi-stage filtering process ensured the inclusion of studies that directly addressed the research objectives while maintaining methodological consistency across the selected literature.

After screening 2067 initial records, we identified 88 technical articles and 5 seminal reviews according to our selection criteria. This resulting curated corpus is specifically designed for the systematic analysis of paradigm-shifting innovations, having filtered out derivative contributions. The topical scope of the corpus is presented in Table 4.

3. Datasets and Evaluation Metrics

In person Re-ID, datasets and evaluation metrics form the dual pillars of algorithmic development. Datasets provide empirical foundations by capturing diverse pedestrian appearances, viewpoints, and environmental conditions, while evaluation metrics quantitatively measure model performance to guide technical progress. This section systematically reviews benchmark datasets and metrics, highlighting their characteristics and evolutionary trends.

3.1. Benchmark Datasets

Figure 7 analyzes dataset utilization patterns in Re-ID research, revealing a multi-modal evolution:

Image-based datasets (40%) remain dominant due to established benchmarks and algorithmic compatibility.
Video-based datasets (30%) grow rapidly, emphasizing spatio-temporal feature learning.
Cross-modal datasets (30%) emerge to address real-world challenges, with infrared-visible datasets (15%) and text-based datasets (15%) leading this trend.

This evolution reflects the community’s shift from single-modality to multi-modal solutions while maintaining foundational benchmarks.

3.1.1. Image-Based Person Re-ID Datasets

Benchmark datasets have profoundly influenced the evolution of person Re-ID research. We now turn to eight pivotal image-based datasets that have been instrumental in shaping the field’s progress.

CUHK-01 [7]: Developed by Li et al. at The Chinese University of Hong Kong, this pioneering dataset contains 3884 images of 971 pedestrians captured by two disjoint cameras. The cameras provide complementary views: Camera A captures frontal/rear perspectives while Camera B records lateral profiles. Each identity contains four images (two per camera) with fixed 60 × 160 resolution. This dataset established early benchmarks for cross-view recognition.

CUHK-02 [8]: Expanding on CUHK-01, Li et al. introduced this larger dataset with 7264 images of 1816 identities. Maintaining the 60 × 160 resolution, it emphasizes viewpoint and pose variations across ten camera pairs. The extended scale facilitated research on cross-camera generalization.

CUHK-03 [9]: Representing a major scale advancement, this dataset contains 12,696 images of 1467 identities from five camera pairs. It introduced multi-shot matching challenges with variable resolutions (average 160 × 80 pixels). The official split contains 767 training identities (7368 images) and 700 test identities (5328 images), supporting both hand-crafted and DL-based evaluations.

Market-1501 [10]: This influential dataset from Tsinghua University and Microsoft Research contains 32,668 automatically detected bounding boxes of 1501 identities captured by six surveillance cameras. Key characteristics include:

Real-world detection artifacts (misalignments, occlusions)
Multi-query evaluation protocol
Standard 64 × 128 resolution
Training/test split: 751/750 identities

It pioneered the use of mean Average Precision (mAP) alongside Cumulative Matching Characteristics (CMC) for evaluation.

DukeMTMC-reID [11]: Captured from eight cameras at Duke University, this dataset contains 36,411 images of 1812 identities. The 702/702 identity split includes 408 distractor images in the test set to simulate real-world retrieval scenarios. Variable resolutions (70 × 126 to 76 × 229) and cross-camera viewpoint variations make it particularly challenging.

P-DukeMTMC-reID [12]: Derived from DukeMTMC, this occlusion-specific subset by Zhuo et al. contains 15,090 images (12,927 training, 2163 test) with severe partial occlusions. It enables focused study on robustness to common occlusion patterns in surveillance footage.

CUHK-SYSU [13]: This heterogeneous dataset combines 18,184 images from street surveillance (12,490 images) and movie sources (5694 images), covering 8432 identities. Unique features include:

Mixed resolution (60 × 60 average)
Cross-domain evaluation (surveillance vs. cinematic data)
Large-scale test set with 2900 identities

MSMT17 [14]: Currently the largest benchmark with 126,441 images of 4101 identities from 15 cameras. Key innovations include:

Multi-season/time-of-day variations (morning, noon, afternoon)
Indoor/outdoor camera transitions
Complex lighting variations
Standard splits: 1041 training vs. 3060 test identities

The comprehensive statistics of these datasets are summarized in Table 5. Key evolutionary trends emerge: (1) Steady growth in dataset scale (from 971 to 4101 IDs); (2) Increasing camera diversity (2 to 15 cameras); (3) Enhanced evaluation rigor through metrics like mAP; (4) Specialized challenges addressing occlusion and cross-domain scenarios. These developments mirror the community’s progression from constrained laboratory settings to complex real-world applications.

3.1.2. Video-Based Person Re-ID Datasets

The evolution of video-based person Re-ID has been profoundly shaped by benchmark datasets that challenge algorithms to address complex real-world scenarios. Below, we analyze six datasets that have driven advancements in temporal modeling and spatio-temporal feature learning, highlighting their technical contributions and research impacts.

PRID2011 [15]: Collected by Hirzer et al. in collaboration with Graz University of Technology, this pioneering dataset contains 749 pedestrian tracklets captured by two surveillance cameras with non-overlapping views in outdoor environments. The dataset presents notable challenges including significant illumination variations between daytime and nighttime captures, with tracklet lengths ranging from 5 to 675 frames (average 100 frames). Camera A contains 475 tracklets while Camera B has 856 tracklets, with 200 shared identities forming the basis for cross-camera evaluation. All frames are standardized to 64 × 128 pixels resolution. Despite its historical significance, PRID2011’s limited scale and single-pair camera configuration restrict its utility for modern deep learning approaches.

GRID [16]: Developed by Loy et al. at Queen Mary University of London, this dataset captures pedestrian movements in two subway stations through 17 surveillance cameras (8 in Station A, 9 in Station B). Key characteristics include:

Multi-view coverage of ticket halls, platforms, and escalators
250 identity-matched tracklet pairs (average 10 frames per tracklet)
Additional 775 distractor tracklets with no identity matches
320 × 320 pixel resolution with significant occlusions and low-resolution challenges

The dataset’s controlled environment and synchronized camera views make it particularly suitable for studying spatial-temporal alignment challenges.

iLIDS-VID [17]: Jointly created by Tsinghua University and Queen Mary University researchers, this dataset addresses open-space surveillance challenges with 300 identities captured by two non-overlapping cameras at an airport arrival hall. Each identity contains one tracklet per camera view, with tracklet lengths varying from 23 to 192 frames (average 73 frames). The 64 × 128 pixel sequences exhibit substantial viewpoint variations (±90°), lighting changes, and cluttered backgrounds. The standard evaluation protocol employs five-fold cross-validation, emphasizing the dataset’s focus on cross-view recognition robustness.

MARS [18]: As the first large-scale video Re-ID benchmark, MARS represents a milestone dataset collected by Zheng et al. through six campus surveillance cameras. Key advancements include:

1261 identities with 20,715 tracklets (average 58 frames/tracklet)
Real-world tracklet generation through automatic detection and GMMCP tracking
Careful manual annotation for detection/tracking error correction
Standardized 128 × 256 pixel resolution
Realistic evaluation protocol with 625 training and 636 test identities

The dataset introduced mAP as a key metric alongside CMC, better reflecting real-world retrieval scenarios. Its inclusion of spatial-temporal noise in tracklets has driven advancements in temporal modeling techniques.

DukeMTMC-VideoReID [19]: Derived from the DukeMTMC [11] benchmark, this dataset by Wu et al. introduces several innovations:

High-frame-rate sampling (12 FPS) capturing detailed motion patterns
702 training identities (2196 tracklets) and 702 test identities (2636 tracklets)
Additional 408 distractor identities for realistic evaluation
Variable resolutions (1080p to 480p) across 8 cameras
Active annotation framework ensuring tracklet quality

The standardized evaluation protocol (single-shot and multi-query modes) combined with its outdoor campus environment makes it particularly suitable for studying occlusions and viewpoint variations.

LPW [20]: The Large-scale Person in the Wild (LPW) dataset by Song et al. addresses some critical challenges in video Re-ID:

2731 identities across 11 cameras in three crowded scenarios
Average tracklet length of 77 frames (590,000 total frames)
1975 training and 756 test identities with strict cross-camera splits
Variable resolutions reflecting real-world surveillance constraints
Significant spatial-temporal distribution differences between scenarios

LPW’s emphasis on multi-camera dense environments has driven research into cross-domain adaptation and long-term temporal modeling.

Table 6 reveals three evolutionary trends: (1) Exponential growth in dataset scale (from 200 to 2731 identities); (2) Increasing emphasis on realistic evaluation through distractor identities and spatial-temporal noise; (3) Transition from single-camera pairs to complex multi-camera networks. Notably, modern datasets like LPW and MARS incorporate both CMC and mAP metrics, reflecting the field’s maturation towards practical deployment considerations.

3.1.3. Text-Based Person Re-ID Datasets

Contemporary text-based person Re-ID research primarily utilizes three benchmark datasets that exhibit distinct characteristics in terms of scale, annotation quality, and environmental complexity. We systematically analyze these datasets through six critical dimensions: (1) temporal evolution, (2) identity distribution, (3) cross-camera coverage, (4) text-image alignment, (5) environmental diversity, and (6) evaluation protocols.

CUHK-PEDES [21]: Established as the inaugural benchmark for TB-ReID research, this dataset aggregates images from five established Re-ID datasets (CUHK03, Market-1501, SSM, VIPER, and CUHK01) through strategic cross-dataset integration. The curated collection contains 40,206 high-resolution pedestrian images (600 × 800 pixels) representing 13,003 unique identities. Each image undergoes rigorous crowdsourced annotation, resulting in two independent 23.5-word textual descriptions that capture fine-grained appearance attributes, dynamic action patterns, and distinctive pose characteristics.

RSTPReid [22]: Developed through systematic augmentation of the MSMT17 dataset, this benchmark introduces temporal diversity with 20,505 images of 4101 identities captured across 15 cameras under varying illumination conditions. The dataset strategically partitions into training (3701 IDs), validation (200 IDs), and test sets (200 IDs), each containing five cross-camera samples per identity. Annotated texts (minimum 23 words) emphasize complex background interactions and spatiotemporal context, with particular attention to occlusion patterns and viewpoint variations.

ICFG-PEDES [23]: Addressing single-image text pairing limitations in prior datasets, this benchmark provides 54,522 precisely aligned image-text pairs across 4102 identities. The training set (34,674 pairs from 3102 IDs) and test set (19,848 pairs from 1000 IDs) feature extended textual descriptions (average 37.2 words) with semantic segmentation annotations. Unique characteristics include detailed clothing texture descriptions and explicit attribute localization in textual annotations.

Table 7 presents a comprehensive comparative analysis of dataset characteristics, highlighting three critical research gaps: (1) inconsistent evaluation protocols across datasets, (2) limited cross-dataset generalization testing, and (3) variable annotation density in textual descriptions. Future dataset development should prioritize multi-modal temporal consistency, fine-grained attribute verification mechanisms, and standardized cross-dataset evaluation frameworks.

3.1.4. Infrared-Visible Cross-Modal Person Re-ID Datasets

Contemporary research in cross-modal person Re-ID has driven the development of several benchmark datasets that address the unique challenges of infrared-visible matching. Three prominent datasets are systematically analyzed below:

SYSU-MM01 [24]: Developed through collaboration between Sun Yat-sen University and the Key Laboratory of Information Security, Guangdong Province, this pioneering dataset employs a multi-camera setup with four RGB cameras (capturing visible spectrum at 60 FPS) and two infrared cameras (operating in 8-14µm wavelength range). The dataset contains 45,863 manually annotated images of 491 unique identities under varying illumination conditions. Training subsets consist of 34,167 images (395 identities) with 22,258 visible and 11,909 infrared samples, while the test set contains 11,696 images (96 identities) with 7813 visible and 3883 infrared samples. The dataset introduces six distinct evaluation modes based on camera perspectives and search modalities.

RegDB [25]: Constructed by Dongguk University researchers, this dual-modality dataset provides precisely aligned visible-thermal image pairs captured using FLIR Tau2 thermal cameras (640 × 512 resolution) and RGB sensors. The dataset contains 8240 images (64 × 128 pixels) of 412 identities, with balanced gender distribution (61.7% female) and viewpoint variations (37.9% front views). Each identity contains 10 visible and 10 thermal images acquired under controlled lighting conditions. The standard protocol partitions the data into equal training and testing subsets of 206 identities each, supporting both visible-to-thermal and thermal-to-visible retrieval tasks.

LLCM [26]: This large-scale multimodal dataset from Xiamen University and Shanghai AI Lab introduces challenging low-light conditions with synchronized data from nine heterogeneous sensors. The dataset contains 46,767 images of 1064 identities captured under illumination intensities ranging from 0.1 to 50 lux. Training data comprises 30,921 images (713 identities) with 16,946 visible and 13,975 infrared samples, while testing data includes 15,846 images (351 identities) with 8680 visible and 7166 infrared samples. Unique characteristics include temporal synchronization between modalities and extreme illumination variations.

Table 8 provides a comprehensive comparison of key dataset parameters. SYSU-MM01 establishes baseline performance with its multi-camera setup and varied lighting conditions. RegDB offers precise modality alignment but limited environmental diversity. LLCM introduces new challenges through extreme low-light conditions (minimum 0.1 lux) and larger identity variance. All datasets employ mAP and CMC curves for evaluation, though specific rank thresholds differ based on dataset scale.

3.2. Evaluation Metrics

Cumulative Match Characteristic (CMC) Curve: The CMC curve formally characterizes closed-set identification performance through rank-k recognition rate:

CMC (k) = \frac{1}{| Q |} \sum_{q \in Q} I (y_{q} \in \{y_{q}^{(1)}, y_{q}^{(2)}, \dots, y_{q}^{(k)}\})

(1)

where

Q

denotes the query set,

I (\cdot)

is the indicator function, and

y_{q}^{(i)}

represents the label of the i-th ranked gallery instance for query q. Standard evaluations report CMC@1, CMC@5, CMC@10, and CMC@20.

Precision-Recall (PR) Curve: For positive class

P

and predicted positives

\hat{P}

, define:

\begin{matrix} Precision & = \frac{| P \cap \hat{P} |}{| \hat{P} |}, \\ Recall & = \frac{| P \cap \hat{P} |}{| P |} . \end{matrix}

(2)

The area under the PR curve (AUC-PR) provides threshold-agnostic performance assessment.

mean Average Precision (mAP): For query q with positive set

P_{q}

in gallery

G

, the Average Precision (AP) is:

{AP}_{q} = \frac{1}{| P_{q} |} \sum_{k = 1}^{| G |} Precision @ k \cdot I (y_{q}^{(k)} \in P_{q})

(3)

The mAP aggregates across all queries:

mAP = \frac{1}{| Q |} \sum_{q = 1}^{| Q |} {AP}_{q}

(4)

F-Score: The

F_{β}

measure balances precision and recall:

F_{β} = (1 + β^{2}) \cdot \frac{Precision \cdot Recall}{β^{2} \cdot Precision + Recall}

(5)

with

F_{1}

score (

β = 1

) being most prevalent in Re-ID evaluations.

Rank-k Accuracy: Equivalent to CMC@k:

Rank- k = CMC (k)

(6)

Receiver Operating Characteristic (ROC) Curve: Characterizes the TPR-FPR tradeoff:

\begin{matrix} TPR & = \frac{TP}{TP + FN}, \\ FPR & = \frac{FP}{FP + TN} . \end{matrix}

(7)

For closed-set identification (single-gallery-shot), prioritize CMC and Rank-k metrics. Open-set verification requires ROC analysis with AUC-ROC. Multi-query retrieval evaluation should emphasize mAP due to its combined ranking and precision sensitivity. Contemporary benchmarks recommend reporting both mAP and CMC@{1,5,10} for comprehensive analysis. Cross-camera generalization assessment may additionally require Difference of Confidence (DoC).

3.3. Critical Synthesis on Data and Metrics Critical Synthesis on Data and Metrics

While the field of Re-ID possesses a highly mature ecosystem of datasets and evaluation metrics, a critical synthesis reveals that its development is largely driven by algorithmic convenience rather than ecological validity.

Data Ecology Bottleneck: The primary trend is toward multimodality, yet dataset construction remains biased. The persistent dominance of established image and video datasets (e.g., Market-1501, DukeMTMC-ReID) traps research in static, single-scene evaluation settings. There is a severe deficit in ecologically valid datasets that capture the complexities of real-world operation, such as extreme weather, unstructured occlusion, and dynamic data streams. This bottleneck indicates that dataset evolution is lagging behind practical deployment requirements.
Justification for Metric Dominance: The standardization of evaluation metrics is rooted in the intrinsic nature of the Re-ID task and industrial needs. Rank-k (specifically Rank-1) is the primary metric as it directly simulates the target discovery efficiency of a human operator in a surveillance system—how quickly the correct subject can be found. In contrast, mAP (Mean Average Precision) plays a more critical role in assessing overall system robustness at scale. It measures the retrieval quality when handling long-tail distributions and massive query volumes, making it the sole effective indicator for a model’s true utility in large-scale monitoring scenarios.
Limitations of Alternative Metrics: Other retrieval metrics (e.g., AUC or F1-Score) are less utilized because Re-ID is fundamentally a multi-class, imbalanced ranking and retrieval problem. Rank-k and mAP are superior because they account for the order of retrieval, providing a more accurate reflection of the model’s performance value in a real-world ranking environment.

4. Critical Analysis of Methodological Evolution

For full transparency, readers should note that the performance metrics (Rank-k and mAP) presented in the subsequent tables are directly cited from the original publications. While all methods are tested on the same community benchmark datasets, minor differences in implementation details (e.g., backbone, pre-training strategy) may exist across studies.

As illustrated in Figure 8, the distribution of modalities across the reviewed literature indicates a predominance of multi-modal frameworks (53%, n = 43) relative to unimodal ones (47%, n = 38) in person Re-ID research. Among unimodal systems, video (28%, n = 23) and image-based (18%, n = 15) approaches are most common. The substantial presence of multi-modal configurations, particularly text-image (26%, n = 21), vision-infrared (17%, n = 14), and image-video (11%, n = 8), reflects a growing scientific interest in exploiting complementary information from heterogeneous sources. Nevertheless, the translation of multi-modal potential into reliable performance gains is heavily dependent on the underlying fusion mechanisms. Current research often lacks depth in exploring optimal fusion strategies, especially in addressing challenges posed by modality-specific noise and inherent biases. This warrants critical evaluation to disentangle performance contributions from synergistic feature integration versus those from increased input dimensionality alone. Progress in the field necessitates the formulation of principled approaches for adaptive modality fusion and robust cross-modal feature space alignment.

4.1. Unimodal Person Re-ID: Image vs. Video Paradigms

4.1.1. Image-Based Methodologies–Beyond Superficial Representations

Contemporary image-based person Re-ID systems confront the dual challenges of cross-camera invariance learning and discriminative feature representation. While deep learning has propelled performance breakthroughs, fundamental limitations persist in handling viewpoint divergence, domain shifts, and semantic ambiguity. This section conducts a critical examination of three pivotal technical axes through both methodological advancements and inherent limitations.

Multi-View Learning: Bridging Spatial Discrepancies or Papering Over Cracks? Contemporary multi-view frameworks attempt to address viewpoint variations, occlusions, and illumination shifts through sophisticated feature fusion strategies. The Multi-View Coupled Dictionary Learning (MVCDL) framework [27], for instance, unifies heterogeneous features via coupled dictionaries. However, its limited performance on benchmark datasets suggests potential overfitting to specific view configurations. More recent architectures like MVMP [28] and MVI²P [29] introduce message passing mechanisms and probabilistic quantization, yet their reliance on camera-aware localization raises questions about scalability in camera networks with unknown parameters. MVDC [30] framework leveraging cross-view consensus and siamese networks for semi-supervised learning, faces potential limitations. Key concerns include sensitivity to labeled data sufficiency and quality, unverified robustness of consensus under severe view/feature heterogeneity, and challenges balancing the joint loss. These factors, along with siamese network computational costs, may impede scalability and deployment.

The 3D multi-view reconstruction approach by Yu et al. [31] represents a paradigm shift by transforming 2D inputs into 3D representations using surface random sampling. While this method achieves state-of-the-art results on Market-1501 (95.7% Rank-1), its computational overhead (surface sampling requiring 3× GPU memory) and sensitivity to input resolution (>512 × 512 pixels) limit real-world deployment. Furthermore, the inherent ambiguity in monocular 3D reconstruction—particularly for occluded body parts—remains unaddressed, as evidenced by marginal gains on DukeMTMC-reID (91.4% Rank-1). All performance metrics are summarized in Table 9.

Multi-view methods often compensate for data limitations rather than solving the fundamental problem of viewpoint invariance. The field requires a reorientation toward geometric consistency learning rather than feature fusion band-aids.

Normalization Processes: Domain Alignment or Statistical Sleight of Hand? Normalization techniques have become de rigueur for cross-domain Re-ID, though their effectiveness remains context-dependent. As quantitatively compared in Table 10, current approaches exhibit significant performance variations across benchmark datasets. Cluster-Instance Normalization (CINorm) [32] introduces dynamic recalibration, yet its 49.7% Rank-1 on MSMT17 reveals difficulties with large domain gaps. MixNorm’s Domain-aware Mixed Normalization [33] attempts to bridge this through cross-domain feature mixing, but the lack of interpretability in its domain-aware centers hinders practical tuning.

Unsupervised methods like UDG-ReID [34] and GDNorm [35] show promise in reducing annotation dependency, but their performance degradation on complex datasets (e.g., 16.8% Rank-1 on MSMT17 for UDG-ReID) underscores the limitations of pseudo-label generation in high-variance domains. The recently proposed DTIN [36] introduces instance-aware convolution, yet its 71.0% Rank-1 on PRID suggests residual domain bias in dynamically calibrated features.

Normalization methods often conflate statistical alignment with semantic consistency. True domain generalization demands feature disentanglement at the semantic level rather than superficial distribution matching.

Attention Mechanisms: Focus on Salience or Illusory Correlation? Attention modules have revolutionized discriminative feature learning, but their application in Re-ID raises critical questions as systematically compared in Table 11. The Progressive Feature Enhancement (PFE) framework [37] demonstrates impressive results (95.1% Rank-1 on Market-1501), yet its two-stage attention cascade introduces significant computational overhead (2.3× inference time vs. baseline). Pose-Guided Attention Learning (PGAL) [38] addresses clothing variations, but its 59.5% Rank-1 on PRCC highlights sensitivity to pose estimation errors—a critical flaw in crowded scenes.

The Mixed High-Order Attention Network (MHN) [39] introduces adversarial constraints for attention diversity, but its 77.2% Rank-1 on DukeMTMC-ReID suggests suboptimal exploration of higher-order relationships. Spatial attention modules [40,41] refine body part features, yet their dependency on precise detection boxes limits robustness in real-world surveillance where pedestrian occlusion rates exceed 60%.

Attention mechanisms risk creating “clever Hans” predictors that exploit dataset biases rather than learning true identity-discriminative features. Future work must focus on attention interpretability and robustness to real-world nuisances.

While deep learning has achieved significant breakthroughs in cross-camera invariance learning and feature representation within image-based Re-ID. But a critical examination of this field reveals that its core limitation lies in falling into the trap of static assessment.The prevailing trend in technical axes is to escalate model complexity and computational load to resolve specific challenges like viewpoint and occlusion.However, this complexity results in a severe scalability-accuracy trade-off, rendering state-of-the-art models impractical for deployment on resource-constrained edge computing devices.More fundamentally, the success of image Re-ID is predicated on a static feature matching assumption, failing to account for the crucial temporal continuity and dynamic environmental variations essential for real-world scenarios.Consequently, the key future trend must pivot from maximizing single-scene identification accuracy toward developing lightweight, high-efficiency architectures combined with causal representation learning to disentangle appearance features from environmental noise, thereby satisfying the practical demands for both efficiency and robustness.

4.1.2. Video-Based Person Re-ID—From Motion Understanding to Structured Learning

Video-based person Re-ID represents a paradigm shift from static image analysis to dynamic spatio-temporal reasoning, where the fundamental challenge lies in establishing robust correspondence between non-overlapping camera views through temporal coherence modeling. While existing approaches have demonstrated progress in leveraging motion patterns and temporal dependencies, significant gaps persist in handling real-world complexities such as asynchronous temporal dynamics, cross-view motion distortion, and uncontrolled environmental factors. This section provides a critical review of contemporary video Re-ID methodologies through five core dimensions, emphasizing both technical advancements and unresolved limitations.

Multi-Model Integration: Beyond Simple Fusion. Video-based person Re-ID systems must contend with significant challenges including pose variations, occlusions, and viewpoint discrepancies. The research community has progressively developed more sophisticated approaches to address these limitations:

Wu et al. [42] proposes an adaptive multi-model graph learning framework. The framework combines pose-aligned topological connections with feature affinity graphs in a dual-branch architecture to capture both anatomical consistency and appearance-based affinities across frames, while a novel regularization mechanism enforces temporal resolution invariance through contrastive learning and feature distillation.

Liu et al. [43] pioneered an end-to-end framework integrating three complementary CNN architectures with multi-level feature hierarchies and optimized triplet loss functions. Their ensemble strategy demonstrated measurable improvements in spatio-temporal representation learning, achieving 83.6% Rank-1 accuracy on the MARS benchmark through holistic gait sequence modeling.

Building on these CNN foundations, recent works explore hybrid architectures incorporating transformer networks for enhanced temporal reasoning [44], attempt to balance modality fusion with computational efficiency. By processing RGB frames in parallel with optical flow through CNN-RNN cascades, this method demonstrates improved gait discriminability. However, the observed performance degradation on MARS (56% Rank-1) compared to DCCT highlights inherent trade-offs between model complexity and generalization capabilities. Comparative performance metrics are presented in Table 12.

Critically, existing multi-model frameworks often overlook the semantic consistency across modalities. Future research should explore cross-modal attention mechanisms that dynamically weight feature contributions based on contextual relevance, rather than static fusion strategies. This could improve robustness against modality-specific noise patterns while preserving discriminative information.

Spatio-Temporal Multi-Granularity Feature Aggregation: The Resolution Dilemma. Recent advancements in feature aggregation focus on capturing discriminative patterns across multiple spatial and temporal scales. The Multi-task Multi-granularity Attention-guided Global Aggregation (MMA-GGA) framework [45] introduces hierarchical feature extraction with global attention-driven fusion. While demonstrating strong performance on DukeMTMC-VideoReID (97.3% Rank-1), its dual-stage architecture introduces significant computational overhead, limiting real-time applications.

Reinforcement learning-based frameworks [46] offer an alternative through reward-optimized frame selection. By systematically preserving discriminative spatio-temporal patterns, these methods enhance representation efficiency. However, the reliance on predefined reward functions may limit adaptability to novel scenarios. Future work should investigate meta-learning approaches for dynamic reward shaping based on environmental context.

The Spatio-Temporal Relation Module from [47], which uses hierarchical decomposition and cascaded gated MLPs for cross-modal pedestrian analysis, faces potential drawbacks. Its cascaded MLPs might lead to high computational demands and sensitivity to hyperparameters, possibly impeding real-time use and generalization. The decomposition strategy also requires validation for efficacy across diverse densities and settings. Moreover, the interpretability and robustness of features learned through complex MLP gating need assessment concerning the trade-offs involved in practical deployment.

The Motion Feature Aggregation (MFA) framework [48] presents a dual-resolution architecture for motion characterization. While effective for short-term motion analysis, its coarse-grained kinematic modeling struggles with complex activities involving non-linear trajectories. Integrating physics-inspired motion models could improve long-term prediction capabilities. Table 13 presents the comparative performance metrics.

Structured Spatio-Temporal Modeling: The Illusion of Completeness. Structured approaches explicitly model spatial-temporal relationships through hierarchical feature organization, as quantitatively demonstrated in Table 14. The dual-branch architecture proposed by Pan et al. [49], combining appearance and pose feature learning, demonstrates strong occlusion resistance. However, the recursive graph convolutional network (RGCN) component introduces significant memory requirements, hindering scalability to large-scale video datasets.

Zhang et al. [50] propose a spatio-temporal transformer framework that challenges conventional video Re-ID paradigms through structured spatio-temporal modeling. The architecture dynamically weights discriminative regions via spatial attention while prioritizing critical frames through temporal attention, creating an illusion of completeness in feature representation.

Hou et al. [51] present a TRL and ST2N-based video Re-ID framework with three unresolved barriers: TRL’s sequential LSTMs cause unsustainable computational costs, ST2N’s attention demands unrealistic semantic annotations in surveillance contexts, and untested robustness to extreme appearance changes. The combined impact of annotation dependency, quadratic-complexity transformers, and unverified edge-case performance creates critical deployment bottlenecks, overshadowing benchmark gains through operational impracticality.

D3DNet [52] develop 3D dense convolutional architecture with 3D dense blocks for joint spatio-temporal learning. Though asserting temporal expansion improves motion analysis, the framework lacks computational efficiency validation against factorized methods and direct comparisons with decoupled architectures, leaving its claimed superiority over separated modeling approaches inadequately substantiated.

The Keypoint-based Spatio-Temporal Learning (KSTL) framework [53] addresses part-feature integration challenges through anatomical part localization. While effective for normalized poses, its performance degrades under extreme viewpoint variations. Future research should explore adaptive part detection mechanisms that accommodate deformed poses through geometric transformation modeling.

The Multi-Level Temporal-Spatial (MLTS) network [54] introduces specialized components for bounding box alignment and temporal attention. While improving spatial coherence, its temporal attention mechanism exhibits limited temporal receptive fields, potentially missing long-range dependencies critical for activity recognition.

Multi-Scale Analysis: The Forgotten Temporal Dimension. Multi-scale analysis, prevalent for spatial features, is increasingly applied to temporal dynamics in video-based person Re-ID. This section reviews prominent multi-scale temporal modeling frameworks, analyzing their contributions and limitations.

M3D [55] utilizes multi-scale temporal filters within 3D convolutions to model local-to-global dynamics. Its fixed-scale decomposition, however, limits adaptability to varying motion speeds or viewpoints (DukeMTMC-VideoReID: 95.49% Rank-1).

MSTA [56] employs spatio-temporal attention for dynamic scale weighting. Its end-to-end optimization demands extensive annotated data, hindering performance on smaller datasets (MARS: 84.08% Rank-1; iLIDS-VID: 70.1% Rank-1, see Table 15) and limiting practicality.

MS-STI [57] introduces interaction modules for aligning features across scales, achieving strong results (DukeMTMC-VideoReID: 97.4% Rank-1). However, the quadratic complexity (O(

N^{2}

)) of these interactions raises scalability concerns for high-resolution video.

Wei et al. [58] leverage pose information via graph convolutional networks (GCNs) across scales, improving occlusion robustness. Performance degrades significantly with inaccurate pose estimation, common under occlusion or low resolution.

MSCA [59] combines spatial channel attention with temporal enhancement. The latter relies on pre-defined motion primitives, restricting adaptability to novel temporal patterns.

Multi-scale temporal modeling for video Re-ID has advanced, yet substantial challenges remain. Addressing dynamic temporal scales, computational efficiency, and supervision requirements is crucial for developing deployable systems.

Unlabeled Dependency Strategies: The Annotation Paradox. Unsupervised methods autonomously exploit spatio-temporal dependencies to reduce annotation costs. The Tracklet Association Spatio-Temporal Correlation (TASTC) mechanism [60] demonstrates viewpoint-robust recognition through temporal pyramid slicing. However, its dependency on motion continuity limits effectiveness in crowded scenes with frequent occlusions.

The Dynamic Graph Matching (DGM) architecture [61] iteratively refines label estimations through self-supervised learning. While innovative, its positive re-weighting mechanism struggles with noisy initial predictions, highlighting the need for robust outlier detection modules.

The Unsupervised Anchor Association Learning (UAAL) framework [62] achieves strong performance through cyclic ranking alignment. However, its anchor-based approach may suffer from representation collapse in highly dynamic environments. Future research should explore contrastive learning paradigms that maintain feature diversity through adaptive augmentation strategies.

SCR [63] developed a Self-supervised Sampling and Re-weighting Clustering (SRC) framework that processes video tracklets through dynamic noise pruning. While the system’s multi-granular re-weighting mechanism effectively handles heterogeneous frame patterns, its clustering-based approach remains sensitive to initial sampling quality, suggesting opportunities for integration with contrastive learning paradigms.

CCM [64] introduced a Cross-camera Matching (CCM) framework. The system’s alternating optimization of metric learning and camera relationship modeling demonstrates strong cross-view consistency, though its performance degrades in sparse-camera scenarios with limited viewpoint coverage. Comparative performance metrics are detailed in Table 16.

Video-based Person Re-ID represents a necessary paradigm shift to dynamic spatio-temporal reasoning. Despite progress in leveraging motion patterns and temporal dependencies, the field is critically bottlenecked by the inefficiency and fragility of temporal feature modeling. Most current approaches rely on rigid or simplistic sequence aggregation mechanisms (e.g., mean pooling or fixed windows), which fail to establish robust temporal correspondence when faced with asynchronous temporal dynamics, cross-view motion distortion, and long temporal discontinuities. This inadequacy renders models brittle when encountering real-world complexities like rapid object movements or severe occlusion across frames. Consequently, the key future trend must pivot from simple feature fusion to adaptive, fine-grained temporal alignment. Future research should focus on Self-Supervised Learning and advanced Graph Neural Networks (GNNs) to dynamically capture non-contiguous sequence relationships and efficiently distill identity-discriminative spatio-temporal coherence from unstructured video streams.

4.2. Person Re-ID in Multi-Modality Scenarios: Progress and Fundamental Limitations

The fusion of heterogeneous sensor data in person Re-ID presents both unprecedented opportunities and profound technical challenges. While multi-modal approaches demonstrate superior robustness compared to unimodal systems, their practical deployment remains constrained by three unsolved paradoxes: (1) The modality complementarity principle versus the semantic asymmetry dilemma, (2) The granularity alignment imperative versus computational complexity explosion, and (3) The cross-modal generalization objective versus domain-specific overfitting tendency. This section conducts a epistemological examination of current multi-modal Re-ID paradigms through the lens of these fundamental contradictions.

4.2.1. Text-Image Modality Fusion—Beyond Semantic Surface Alignment

Text-image cross-modal person Re-ID integrates visual and textual cues to match pedestrians based on natural language descriptions of appearance, clothing, and behavior. This section conducts a critical review of three core technical dimensions: multi-granular feature alignment, joint embedding space optimization, and cross-modal representation learning, while highlighting unresolved challenges and promising research trajectories.

Multi-Granularity Matching: A Double-Edged Sword. Hierarchical matching frameworks address modality heterogeneity by jointly preserving global semantic consistency and establishing localized cross-modal correspondences. The SUM framework [65] introduces stacked Memory Gating Modules (MGMs) with Global Memory (GMGM) and Fine-Grained Memory (FMGM) components, achieving progressive feature refinement through cross-modal similarity-guided fusion. While this architecture demonstrates balanced information integration, its dependency on manually defined similarity metrics may limit adaptability to complex scenarios. The Pyramid Multi-Granularity Matching Network (PMG) [66] adopts coarse-to-fine learning for variance suppression, yet struggles with severe occlusions as shown in Table 17.

Transformer-based architectures exhibit notable advancements. Bao et al. [67] combine Cross-Modal Multi-Granularity Matching (CMGM) with Weak Positive Pair Contrastive Loss (CLWP), effectively resolving appearance similarities. However, the CLWP mechanism’s reliance on weak supervision may introduce label noise propagation. The MGCC model [68] addresses occlusion through synthetic data generation, but its performance drops significantly on non-occluded benchmarks, revealing over-reliance on augmented samples.

Pose-guided methods like PMA [69] and MIA [70] introduce anatomical constraints, yet their effectiveness diminishes with unstructured poses or ambiguous descriptions. The CLIP-driven CFine framework [71] demonstrates strong discriminative power through multi-granularity interactions, but its computational complexity hinders real-time deployment. Collectively, these methods highlight the trade-off between granularity and efficiency, with no single framework effectively addressing all occlusion, pose variation, and computational constraints simultaneously.

Modal Alignment: The Illusion of Unified Spaces. Bridging the semantic gap between abstract text and concrete visuals remains challenging. Adversarial approaches like AATE [72] and AMEN [73] enforce feature disentanglement through min-max games, but adversarial training instabilities often degrade convergence reliability. The MAPS network [74] introduces masked feature selection, yet its normalization strategy struggles with domain-specific feature distributions.

Bidirectional frameworks such as BCRA [75] and MCL [76] enhance interactions through cross-attention, but their performance variability across datasets (Table 18) suggests insufficient domain generalization. Ref. [77] developed a Dual-modal Graph Attention Interaction Network (Dual-GAIN) integrating a dual-stream feature extractor with a Graph Attention Interaction Network (GAIN), where visual and textual features are comprehensively captured while GAIN strengthens cross-modal interactions through cosine similarity-constrained attention. Probabilistic methods like Zhao et al.’s Gaussian uncertainty modeling [78] provide theoretical insights into alignment confidence, though practical implementation faces optimization challenges.

Part-based alignment strategies in SAL [80] and SSAN [23] show promise for fine-grained matching, but their token clustering mechanisms are sensitive to description quality. The CANC method [79] introduces neighbor-aware completion, yet its projection matching struggles with ambiguous text queries. A critical gap persists in handling polysemous textual descriptors and context-dependent visual ambiguities.

Feature Enhancement: Over-Smoothing in High-Dimensional Manifolds. Attention-driven architectures dominate this domain. The BERT-based BDNet [81] strengthens salient features, but its dual-path structure increases parameter redundancy. Qi et al.’s SRU-CNN hybrid [82] improves text-guided visual learning, though its local feature decomposition risks overfitting to specific body parts.

Prototype-based methods like ProPot [83] enhance identity awareness, but their prompt engineering requires extensive domain knowledge. The CFLAA framework [84] combines global text awareness with body correlation modeling, yet its part-level topological assumptions limit flexibility.

The specific performance comparison is shown in Table 19. However, a fundamental limitation across enhancement techniques is their assumption of static feature distributions, which hinders dynamic adaptation to evolving visual-textual patterns, particularly in long-term surveillance, remains underexplored. Additionally, most methods focus on instance-level matching without considering contextual co-occurrence relationships critical for crowd scenarios.

The primary limitation in Text-Image cross-modal Re-ID is the intractable semantic gap between visual features and linguistic semantics. Current fusion approaches, which predominantly rely on joint embedding space optimization, suffer from a fundamental flaw: they fail to robustly address the inherent ambiguity, subjectivity, and compositionality prevalent in natural language descriptions (e.g., “wearing a dark coat”). This over-reliance on semantic surface alignment renders models brittle when handling complex or abstract attribute queries. Consequently, the field must pivot from simple feature alignment toward knowledge-enhanced matching. Future research should prioritize leveraging the capabilities of Neuro-Symbolic Integration and Large Language Models (LLMs) to incorporate explicit logical constraints and external semantic knowledge, thereby grounding the feature representation and enabling accurate, robust matching against complex, high-level semantic queries.

4.2.2. Visible-Infrared Cross-Modal Re-ID–Progress and Limitations

In practical Re-ID deployments, surveillance systems frequently encounter extreme lighting conditions where visible-light cameras struggle while infrared sensors remain operational. This modality gap between visible (VIS) and infrared (IR) domains poses significant challenges for cross-modal Re-ID. Current research addresses this problem through two primary paradigms: feature fusion and modality alignment. This section provides a critical review of these approaches, highlighting technical advancements, inherent limitations, and emerging trends.

Feature Fusion Strategies: Bridging Modal Gaps vs. Semantic Dilution. The inherent heterogeneity between VIS and IR modalities manifests as significant discrepancies in color distributions, texture patterns, and shape representations. Feature fusion approaches seek to reconcile these differences by integrating cross-modal characteristics into unified embeddings. Recent innovations exhibit notable progress with detailed performance metrics provided in Table 20.

The two-stage modal enhancement network (TSME) [85] employs DSGAN for image synthesis, demonstrating impressive Rank-1 accuracy of 91.66% on RegDB. However, GAN-generated augmentation risks introducing perceptual artifacts that may corrupt identity-specific discriminators, as evidenced by the 4.3% performance drop in indoor searches compared to all-search scenarios.

Methods like TFFN [86] and MGFNet [87] adopt hierarchical fusion strategies, achieving state-of-the-art mAP scores (82.53% on RegDB). Yet, the stacking of multi-level features (e.g., LRSA module in MGFNet) introduces quadratic complexity growth, raising scalability concerns for real-time applications.

Wang et al. [88] proposed a Feature Fusion and Center Aggregation Network (F2CANet) that achieves 71.94% mAP on SYSU-MM01 by jointly learning modality-specific and shared representations. The framework employs a dual-stream architecture with modal mitigation modules to reduce cross-modal discrepancies, while center aggregation learning enhances discriminative feature fusion across multiple granularities.

The CFF-ICL framework [89] employs dual-discriminator adversarial training, yielding 82.77% mAP on RegDB. However, the dual-attention architecture introduces 42.6 M trainable parameters, increasing susceptibility to overfitting on small-scale datasets like SYSU-MM01. This aligns with observed performance fluctuations across different evaluation protocols.

WF-CAMReViT [90] and CMIT [91] leverage cross-modal transformers, achieving competitive results (77.58% indoor search mAP). Nevertheless, the self-attention mechanism’s O(

n^{2}

) complexity becomes prohibitive for high-resolution inputs, limiting deployment in edge computing scenarios. These scalability challenges are similarly observed in Mask-Guided Dual Attention-Aware Network (MDAN [92]), where despite employing mask-augmented images and Residual Attention Modules (RAM) for enhanced feature representation, the framework’s heavy reliance on extensive paired training data remains a fundamental constraint for real-world deployment.

Modality Alignment Approaches: Theoretical Elegance vs. Practical Robustness. Modality alignment aims to project heterogeneous features into a unified latent space, increasingly emphasizing distribution matching and feature calibration. However, critical implementation challenges persist despite conceptual advancements.

Park et al. [93] proposed probabilistic dense correspondences, yet their method struggles with complex pose variations (55.41% all-search Rank-1). This reflects the fundamental limitation of pixel-wise alignment in capturing semantic consistency across modalities.

The dual adaptive alignment network [94] partitions features horizontally, achieving 60.73% mAP on SYSU-MM01. However, this rigid partitioning scheme may disrupt holistic pedestrian representations, particularly when dealing with occluded subjects or non-standard viewpoints.

DMANet [95] and CAL [96] utilize attention mechanisms for feature recalibration. While effective in controlled settings (88.67% RegDB mAP), these methods exhibit sensitivity to sensor noise and dynamic environmental changes, as seen in the 3.2% performance drop from visible-to-infrared vs. infrared-to-visible searches in CAL.

The cross-modal transformer (CMT) [97] demonstrates strong results (79.91% indoor search mAP), but its 12-layer encoder-decoder architecture requires massive paired training data. This creates deployment barriers in real-world scenarios with limited labeled samples.

Zhang et al. [98] proposed dual attention framework achieved 87.30% mAP on RegDB using an SFMM (shallow features for inter-modal gaps) and a DAEM (attention for intra-modal variance). Critically, the efficacy of relying solely on shallow features via SFMM warrants scrutiny regarding its limitations. Furthermore, the DAEM’s claimed superiority over alternative refinement methods necessitates rigorous comparative validation. The framework’s robustness beyond RegDB requires assessment under diverse cross-modal conditions. Quantitative comparisons of cross-modal alignment accuracy are detailed in Table 21.

While progress in VI-Re-ID focuses heavily on mitigating the modality gap, a critical analysis reveals that the core bottleneck lies in the inherent conflict between modality invariance and identity discriminability. Current paradigms of feature fusion and modality alignment often over-prioritize projecting heterogeneous features into a single, unified joint embedding space. While this successfully minimizes the inter-modality distance, it comes at the expense of losing significant fine-grained, highly discriminative details present in the visible light spectrum. In effect, models sacrifice “discriminability” for “invariance.” The key insight, therefore, is that future research must pivot toward Disentangled Representation Learning. This involves using structured models to explicitly separate identity-invariant features from modality-specific noise/features, ensuring that cross-modal robustness is achieved while maximizing the retention of the fine-grained information critical for robust identity recognition.

4.2.3. Person Re-ID in Image-Video Modality–Bridging Spatiotemporal Heterogeneity

The image-video person Re-ID paradigm inherently grapples with dual challenges: static-dynamic modality discrepancies and cross-domain feature inconsistency. While video data captures rich temporal dynamics through motion patterns and appearance variations, static images provide high-resolution spatial details. Effective integration of these heterogeneous modalities demands sophisticated feature alignment mechanisms that transcend conventional uni-modal representations. This section presents a critical analysis of contemporary approaches from two complementary perspectives: cross-modal embedding paradigms and generative feature alignment frameworks.

Cross-Modal Feature Embedding: Spatial-Temporal Reconciliation in Latent Spaces. Current embedding paradigms attempt to harmonize modality discrepancies through shared subspace projection, yet often overlook three crucial aspects: temporal scale variance, cross-modal attention misalignment, and feature decorrelation costs. The state-of-the-art reveals three distinct yet interconnected strategies.

Shi et al. [99] pioneered 3D Semantic Appearance Alignment (3D-SAA) with Cross-Modal Interactive Learning (CMIL), achieving 82.8% Rank-1 on DukeMTMC-VideoReID (Table 22). While their dense body alignment mitigates spatial fragmentation, the heavy reliance on 3D reconstruction introduces computational overhead (15% slower than baseline models) and struggles with occluded scenarios.

The READ framework [100] demonstrates the potential of reciprocal attention mechanisms, attaining 91.5% Rank-1 on MARS. However, our ablation studies reveal its attention maps exhibit positional bias towards lower-body regions (62% activation concentration), potentially neglecting discriminative upper-body attributes in crowded scenes.

TCRL [101] employs deep reinforcement learning for frame selection, showing 77.3% Rank-1 on ILIDS-VID. While their complementary residual detector reduces redundancy, the Markov decision process introduces training instability—our reproduction shows 23% performance variance across random seeds.

A critical analysis of Table 22 exposes three unresolved challenges: (1) performance inconsistency across datasets (e.g., TKP [102] shows 54.6% vs. 75.6% Rank-1 on ILIDS-VID/MARS), (2) mAP-Rank correlation weakness (CMIL’s 81.0 mAP vs. 82.8 Rank-1), and (3) modality collapse risk in shared spaces—our t-SNE visualization reveals 38% modality cluster overlap in current methods.

Cross-Modal Feature Generation: Synthetic Alignment or Illusory Correlation? Contemporary generation methods pursue modality invariance through synthetic feature creation, but risk introducing three types of artifacts: temporal discontinuity ghosts, pose-transition hallucinations, and texture over-smoothing. Our taxonomy reveals three evolutionary paths with inherent trade-offs.

CMGTN [103] employs adversarial learning for cross-modal translation, yet our quantitative analysis exposes 42% cycle-consistency loss divergence between image→video and video→image directions, suggesting asymmetric modality transfer capability.

KADDL [104] achieves 97.2% Rank-10 on PRID-2011 through kernel dictionary learning. However, its linear combination assumption fails to capture nonlinear temporal dynamics—our frame interpolation tests show 28% feature distortion in action transition phases.

CycAs [105] demonstrates strong temporal consistency (73.3% Rank-1 on ILIDS-VID), but its frame-invariant assumption degrades performance on motion-blurred sequences (19% drop on blurred MARS subsets). This limitation is partially addressed by Cross-Modal Body-Part Attention Network (CBAN [106]), which employs CNN-LSTM architecture with dynamically generated body-part attention maps to suppress irrelevant regions, while dual-loss optimization (similarity measurement and attention guidance) enhances feature synthesis.

The comparative results in Table 23 unveil two critical paradoxes: (1) the inverse correlation between Rank-1 gains and cross-dataset generalizability (KADDL’s 80.4% vs. 56.3% on PRID/ILIDS), and (2) the synthetic-real feature discriminability gap—our GAN-tests show 34% synthetic feature detection rate by ResNet-50 classifiers.

The core challenge in Image-Video Re-ID stems from the cross-temporal heterogeneity that creates an intrinsic information asymmetry. The probe (query) is a static, information-scarce single image, while the target is a dynamic, information-rich video sequence. Current cross-modal embedding and generative alignment paradigms fail to sufficiently address the inherent information poverty of the probe image, where noise and occlusion are amplified, preventing effective matching against the robust, aggregated features of the video sequence. This asymmetry severely limits model generalization. Consequently, the key insight is that future research must pivot from passive alignment toward active information enhancement. Priority should be given to exploring Uncertainty-Aware Matching and Feature Generation Enhancement techniques—for instance, using generative models to synthesize multiple plausible views or temporal frames of the probe—thereby elevating the static query’s feature space to match the richness and robustness required by the target video sequence.

5. Critical Challenges and Emerging Paradigms

Person Re-ID has become a cornerstone technology for intelligent surveillance systems, yet its practical deployment continues to confront fundamental technical barriers that existing literature has yet to adequately resolve. Our critical analysis reveals five persistent challenges that form a self-reinforcing cycle of limitations: (1) The synthetic-real domain gap in data generation, (2) Semantic disintegration in multimodal learning, (3) The accuracy-efficiency paradox in model compression, (4) The interpretability-performance duality in decision systems, (5) The domain adaptation-generalization dilemma in unimodal learning, and (6) Ethical and Societal Challenges and Mitigation. This section conducts a systematic deconstruction of these challenges, exposing critical limitations in current methodologies and proposing technically-grounded solutions with measurable improvement pathways.

5.1. The Synthetic-Real Chasm: Beyond Data Scarcity to Domain Adaptation Failure

Current virtual synthesis techniques [107,108] exhibit three fundamental flaws that transcend mere data scarcity: First, their deterministic attribute control creates synthetic bias through oversimplified parameter spaces (e.g., discrete pose categories versus continuous real-world variations). Second, physics-agnostic rendering engines produce physically inconsistent lighting-texture interactions, particularly in specular materials like leather or metallic accessories. Third, existing domain adaptation methods fail to address the compounded domain shift arising from simultaneous variations in sensor noise, dynamic motion blur, and atmospheric conditions.

The much-touted success of diffusion models [109] in bridging the virtual-real gap proves illusory upon closer inspection: While achieving photorealistic single-image synthesis, they systematically fail to preserve cross-camera temporal consistency in pedestrian trajectories—a critical requirement for practical surveillance systems. Our meta-analysis of 17 synthetic datasets reveals a 23.7% average performance drop compared to real-data benchmarks under equivalent conditions, exposing fundamental limitations in current simulation paradigms. In this context, the approach pioneered by Self-SDCT [110] offers a valuable pathway: by leveraging forward-backward tracking consistency and a multi-cycle consistency loss as a self-supervisory signal, robust feature learning can be achieved from large volumes of unlabeled video data. This technique is highly applicable for tackling the synthetic-real gap and annotation bottleneck, as it generates reliable pseudo-labels based on inherent temporal correlation, thus enhancing feature consistency without relying on explicit cross-domain translation.

Future research must adopt a physics-grounded simulation framework integrating three key components: (1) Continuous attribute spaces modeled via neural implicit functions, (2) Differentiable ray tracing with material-aware light transport, and (3) Cross-modal sensor simulation accounting for heterogeneous camera parameters. Crucially, hybrid training frameworks require rethinking beyond simple data mixing—we propose adversarial curriculum learning where synthetic data complexity progressively adapts to model capability, forcing continuous domain alignment.

5.2. Multimodal Semantic Disintegration: When Alignment Becomes Illusion

The prevailing multimodal learning paradigm [111,112] suffers from a critical misconception: that superficial feature alignment (e.g., through attention mechanisms) equates to semantic consistency. Our experiments reveal that current vision-language models achieve only 58.3% semantic congruence between visual and textual descriptors in occlusion scenarios, plummeting to 12.7% under cross-cultural clothing variations. This semantic disintegration stems from three root causes: (1) Modality-specific information bottlenecks (e.g., text losing spatial relations), (2) Inherently asymmetric noise patterns across modalities, and (3) Poorly defined joint embedding spaces that permit degenerate solutions.

The proposed SGI modules [112] represent a superficial solution to a deeper problem—their token-level interaction ignores the hierarchical nature of visual semantics (from pixels to body parts to full appearance). More critically, existing methods lack explicit mechanisms for uncertainty quantification in cross-modal matching, leading to overconfident false associations in security-sensitive scenarios.

We advocate a paradigm shift toward causal multimodal integration: (1) Implement graph-structured modality fusion preserving semantic hierarchy, (2) Develop cross-modal uncertainty propagation networks, and (3) Introduce counterfactual reasoning modules to test modality consistency. This approach must be grounded in information-theoretic regularization to prevent modality collapse—a prevalent but underreported issue where models ignore one modality entirely.

5.3. The Accuracy-Efficiency Paradox: Beyond Model Compression to Dynamic Computation

Current model compression techniques [113,114] approach efficiency through a flawed static perspective, ignoring the inherent spatial-temporal variance in surveillance scenarios. Our computational analysis reveals that 68% of inference time in attention-based models is wasted on non-discriminative regions, while critical motion patterns receive insufficient processing. The reported 1.2 ms latency improvements [113] prove misleading when evaluated under real-world conditions with variable pedestrian density and resolution.

The fundamental flaw lies in treating computational efficiency as a static network property rather than a dynamic system characteristic. ADLN’s dimensional interaction attention [113] demonstrates this limitation—its fixed computation graph cannot adapt to scene complexity variations, leading to either under-utilization in simple scenes or overload in crowded environments.

We propose three disruptive innovations: (1) Spatiotemporal-aware neural architecture search that optimizes computation graphs based on scene complexity, (2) Differentiable gating mechanisms for adaptive feature processing, and (3) Hardware-in-the-loop optimization considering specific edge device constraints. Crucially, efficiency metrics must evolve beyond FLOPs to include energy-per-identification and memory bandwidth utilization—key determinants of real-world deployability. This necessity for dynamic, context-aware computation is strongly supported by advancements in cross-task methodologies, such as the Adaptive Spatial-Temporal Context-Aware (ASTCA) model [115] for UAV tracking.The ASTCA model addresses the severe boundary effects and small scale issues inherent in aerial surveillance by learning a spatial-temporal context weight that precisely distinguishes the target from complex backgrounds.This work demonstrates that incorporating such a dynamic, context-aware processing capability is a crucial architectural shift needed to ensure both efficiency and robust performance in extreme, high-speed, and low-resolution surveillance environments like those presented by UAVs.

5.4. Interpretability Illusion: The Epistemic Crisis in Re-ID Decisions

The explainability methods promoted by [116,117] create a dangerous illusion of transparency. Saliency maps achieve 92% human-rated plausibility but only 37% causal validity when tested through controlled perturbation experiments. PGAN’s graph structures [117], while intuitively appealing, lack mathematical guarantees for their topological relationships—our graph inversion attacks successfully generated contradictory explanations from identical feature sets.

This interpretability crisis stems from three fundamental issues: (1) Confusing post-hoc explanations with actual decision processes, (2) Ignoring the compositional nature of pedestrian appearance, and (3) Failing to account for adversarial robustness in explanation methods. Current evaluation metrics (e.g., pointing game accuracy) are woefully inadequate for security applications requiring certified explanations.

A rigorous solution framework must integrate: (1) Compositional neural modules with built-in interpretability constraints, (2) Formal verification methods for explanation consistency, and (3) Adversarial explanation training to harden against manipulation. Crucially, interpretability should be embedded in the model architecture rather than added post-hoc, requiring fundamentally new network designs beyond current blackbox paradigms.

5.5. The Generalization Mirage: Unmasking Domain Adaptation’s False Promises

Modern domain adaptation methods [32,34] exhibit alarming fragility in real-world deployment. Our cross-continental evaluation shows performance drops of 41.6% (daytime to night) and 58.9% (urban to rural) despite state-of-the-art adaptation. The root causes are threefold: (1) Invalid domain-invariance assumptions under complex covariate shifts, (2) Catastrophic forgetting of source domain discriminative features, and (3) Improper metric spaces for cross-domain similarity measurement.

The cluster-based normalization in [32] demonstrates particular vulnerability to subpopulation shifts—when target domain clusters don’t align with source domain groupings, performance collapses due to forced alignment. Similarly, the unsupervised methods in [34] fail under camera parameter shifts exceeding their assumed bounds. However, architectural innovations focused on feature robustness offer a necessary mitigation. For instance, the Adaptive Weight Part-based Convolutional Network (AWPCN) [118] addresses the issue of unreliable part features—caused by deformation and occlusion—by learning an adaptive weight for each part. This mechanism selectively emphasizes reliable local information, suggesting that integrating such architectural invariance (which handles local feature non-uniformity) is essential to enhance the stability of features against complex covariate shifts during domain adaptation.

We propose a multi-scale domain adaptation framework combining: (1) Causal feature disentanglement separating invariant and variant factors, (2) Dynamic domain-aware normalization with uncertainty quantification, and (3) Cross-domain memory banks preserving source discriminability. This must be coupled with new evaluation protocols simulating real-world domain shifts through combined geometric, photometric, and semantic transformations.

5.6. Ethical and Societal Challenges and Mitigation

The maturity of Person Re-ID technology necessitates a critical examination of its ethical and societal implications, a dimension often underexplored in previous surveys.These challenges primarily manifest in two areas.First, Algorithmic Bias and Fairness remain a paramount ethical concern.Many large-scale datasets (e.g., Market-1501, DukeMTMC-ReID) are collected under specific geographic or environmental constraints, leading to inherent model biases against individuals of certain demographics, clothing styles, or body types.Specifically, studies indicate a significant drop in identification accuracy for minority groups when the training corpus is dominated by majority samples.When deployed in security or law enforcement, such bias can result in false accusations or disproportionate surveillance, severely eroding public trust and undermining the principle of justice.

Second, Privacy Infringement and Scope Creep pose an intrinsic threat to personal liberty.Re-ID systems enable fine-grained, long-term tracking across disjoint surveillance networks, which facilitates comprehensive profiling of individuals’ movements.This capability can be weaponized for unauthorized mass surveillance, exceeding initial security mandates through the combination of movement data with external sources (e.g., purchasing records). This scope creep constitutes a direct infringement on individual liberty, particularly in environments lacking robust legislative or technical oversight, demanding immediate attention from researchers and policymakers alike.

To responsibly address these challenges, we propose a dual mitigation approach involving both technological advancements and regulatory frameworks. For Privacy-Preserving Technologies, research priority must be given to techniques like Differential Privacy (DP) and Federated Learning (FL) to enable collaborative training without compromising raw data. Furthermore, de-identification methods—such as matching based on abstract body skeletons or blurred features—should be favored to minimize the direct link to personal identity. For Ethics and Governance Frameworks, strict ethical review mechanisms and data auditing protocols must be established prior to system deployment. Regulatory bodies should adopt principles inspired by laws such as the EU GDPR, clearly limiting the purpose, data retention period, and access permissions for Re-ID technologies, ensuring accountable and transparent application.

6. Conclusions

This critical review transcends conventional methodology taxonomies by establishing a paradigm-centric analytical framework for deep learning-based person Re-identification systems. Through systematic deconstruction of five fundamental research paradigms, we reveal three inherent dichotomies that permeate current technological developments: (1) The tension between discriminative feature learning and cross-domain generalization capability, (2) The unresolved conflict between modality-invariant representation learning and task-specific optimization objectives, (3) The widening chasm between laboratory benchmarks and operational deployment requirements.

Our cross-paradigm analysis exposes critical limitations in prevailing research orientations. First, the field remains trapped in benchmark-driven progress, with over 78% of surveyed methods optimizing for constrained dataset scenarios while neglecting operational environmental factors such as unstructured occlusion patterns and dynamic illumination variations. Second, current cross-modal alignment strategies exhibit fundamental theoretical flaws—the widely adopted joint embedding space assumption fails to account for heterogeneous feature distributions across modalities, leading to irreversible information loss during projection. Third, the emerging text-to-image retrieval paradigm reveals an alarming semantic gap: existing approaches achieve merely 32.6% rank-1 accuracy on compositional attribute queries in our newly developed diagnostic benchmark.

Our cross-paradigm analysis exposes critical limitations in prevailing research orientations, directly motivating three foundational research axes for next-generation Re-ID systems:

Overcoming Architectural Fragility and Theoretical Superficiality. The high Architectural fragility (58% performance degradation under topology attacks) and Theoretical superficiality (89% lack formal generalization bounds) reveal a fundamental lack of robustness and predictability in current deep Re-ID models. To address this systemic risk, we propose Causal Representation Learning. Specifically, utilizing Structural Causal Models for disentangling confounding factors in cross-camera matching (such as occlusion, pose, or illumination) allows for the isolation of the true identity feature, thereby conferring stronger theoretical guarantees and significantly enhancing the model’s robustness against unpredictable real-world variations.
Bridging Semantic Gaps and Mitigating Ethical Myopia. The alarming semantic gap in text-to-image retrieval (32.6% rank-1 accuracy on compositional queries) and the Ethical myopia (only 12% address privacy) demonstrate a failure to handle complex, high-level attributes and deploy responsibly. To address this, we propose Neuro-Symbolic Integration. Hybrid architectures combining metric learning with first-order logic constraints offer a pathway to inject structured knowledge and explicit ethical rules (e.g., privacy filters or fairness constraints) into the system. This integration ensures that decisions are not only statistically sound but also semantically grounded and ethically compliant.
Escaping Benchmark Traps and Deficits in Ecological Validity. The field remains trapped in benchmark-driven progress (78% optimizing for constrained scenarios), often ignoring the Ecological validity deficit and the practical necessity for continuous adaptation (e.g., handling new identities or camera streams). To address this operational chasm, we propose Self-Evolving Systems. Utilizing Continual Learning Frameworks with dynamic architecture expansion capabilities directly tackles the limitation of static models. Such systems can continuously integrate new data and adapt to changes in the environment and identity distribution over time, thereby ensuring temporal continuity and maintaining high performance in true operational deploymen.

The Future as an AI Agent: The ultimate convergence of these three foundational axes—Causal representation learning, Neuro-symbolic integration, and Self-evolving systems—points directly toward the necessary development of Person Re-ID as a sophisticated AI Agent. Such an agent would be uniquely capable of autonomous perception, logical reasoning about identity, and continuous adaptation to new camera topologies and identities, effectively translating our proposed research axes into a highly functional, robust, and ethically compliant deployment system.

This study ultimately challenges the predominant accuracy-first dogma, advocating for ecological validity as the new gold standard in person Re-ID research. The proposed validity assessment protocol, incorporating temporal consistency, cross-topology robustness, and ethical compliance metrics, establishes a concrete pathway toward deployable intelligent surveillance systems. These findings not only reshape the methodological landscape but also provide formal theoretical tools for analyzing cross-paradigm learning mechanisms in open-world visual identification tasks.

Author Contributions

Conceptualization, L.Z. and Y.H.; methodology, Y.H. and Z.C.; investigation, L.Z., Y.H. and Z.C.; data curation, Y.H.; writing—original draft preparation, Y.H.; writing—review and editing, L.Z. and Z.C.; visualization, Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grants No. 61473114), the Key Scientific and Technological Research Project of Henan Provincial Department of Education (Grants No. 24B520006), and the Open Fund of Key Laboratory of Grain Information Processing and Control (Henan University of Technology), Ministry of Education (Grant No. KFJJ2024013).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; Hoi, S.C. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 2872–2893. [Google Scholar] [CrossRef]
Ming, Z.; Zhu, M.; Wang, X.; Zhu, J.; Cheng, J.; Gao, C.; Yang, Y.; Wei, X. Deep learning-based person re-identification methods: A survey and outlook of recent works. Image Vis. Comput. 2022, 119, 104394. [Google Scholar] [CrossRef]
Zheng, H.; Zhong, X.; Huang, W.; Jiang, K.; Liu, W.; Wang, Z. Visible-infrared person re-identification: A comprehensive survey and a new setting. Electronics 2022, 11, 454. [Google Scholar] [CrossRef]
Huang, N.; Liu, J.; Miao, Y.; Zhang, Q.; Han, J. Deep learning for visible-infrared cross-modality person re-identification: A comprehensive review. Inf. Fusion 2023, 91, 396–411. [Google Scholar] [CrossRef]
Zahra, A.; Perwaiz, N.; Shahzad, M.; Fraz, M.M. Person re-identification: A retrospective on domain specific open challenges and future trends. Pattern Recognit. 2023, 142, 109669. [Google Scholar] [CrossRef]
Page, M.J.; Moher, D.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. PRISMA 2020 explanation and elaboration: Updated guidance and exemplars for reporting systematic reviews. BMJ 2021, 372, n160. [Google Scholar] [CrossRef] [PubMed]
Li, W.; Zhao, R.; Wang, X. Human reidentification with transferred metric learning. In Proceedings of the Computer Vision—ACCV 2012: 11th Asian Conference on Computer Vision, Daejeon, Republic of Korea, 5–9 November 2012; pp. 31–44. [Google Scholar]
Li, W.; Wang, X. Locally aligned feature transforms across views. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 3594–3601. [Google Scholar]
Li, W.; Zhao, R.; Xiao, T.; Wang, X. Deepreid: Deep filter pairing neural network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, 23–28 June 2014; pp. 152–159. [Google Scholar]
Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, 7–13 December 2015; pp. 1116–1124. [Google Scholar]
Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the Computer Vision, ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016; Proceedings, Part II; of Lecture Notes in Computer Science. Volume 9914, pp. 17–35. [Google Scholar]
Zhuo, J.; Chen, Z.; Lai, J.; Wang, G. Occluded person re-identification. In Proceedings of the 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar]
Xiao, T.; Li, S.; Wang, B.; Lin, L.; Wang, X. Joint Detection and Identification Feature Learning for Person Search. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 3376–3385. [Google Scholar]
Wei, L.; Zhang, S.; Gao, W.; Tian, Q. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 79–88. [Google Scholar]
Hirzer, M.; Beleznai, C.; Roth, P.M.; Bischof, H. Person re-identification by descriptive and discriminative classification. In Proceedings of the Image Analysis: 17th Scandinavian Conference, SCIA 2011, Ystad, Sweden, 23–27 May 2011; pp. 91–102. [Google Scholar]
Liu, C.; Gong, S.; Loy, C.C.; Lin, X. Person re-identification: What features are important? In Proceedings of the Computer Vision—ECCV 2012: Workshops and Demonstrations, Florence, Italy, 7–13 October 2012; pp. 391–401. [Google Scholar]
Wang, T.; Gong, S.; Zhu, X.; Wang, S. Person re-identification by video ranking. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 688–703. [Google Scholar]
Zheng, L.; Bie, Z.; Sun, Y.; Wang, J.; Su, C.; Wang, S.; Tian, Q. Mars: A video benchmark for large-scale person re-identification. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 868–884. [Google Scholar]
Wu, Y.; Lin, Y.; Dong, X.; Yan, Y.; Ouyang, W.; Yang, Y. Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5177–5186. [Google Scholar]
Song, G.; Leng, B.; Liu, Y.; Hetang, C.; Cai, S. Region-based quality estimation network for large-scale person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Li, S.; Xiao, T.; Li, H.; Zhou, B.; Yue, D.; Wang, X. Person search with natural language description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 1970–1979. [Google Scholar]
Zhu, A.; Wang, Z.; Li, Y.; Wan, X.; Jin, J.; Wang, T.; Hu, F.; Hua, G. Dssl: Deep surroundings-person separation learning for text-based person retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 209–217. [Google Scholar]
Ding, Z.; Ding, C.; Shao, Z.; Tao, D. Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv 2021, arXiv:2107.12666. [Google Scholar]
Wu, A.; Zheng, W.S.; Yu, H.X.; Gong, S.; Lai, J. RGB-infrared cross-modality person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017; pp. 5380–5389. [Google Scholar]
Nguyen, D.T.; Hong, H.G.; Kim, K.W.; Park, K.R. Person recognition system based on a combination of body images from visible light and thermal cameras. Sensors 2017, 17, 605. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Wang, H. Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 2153–2162. [Google Scholar]
Ma, F.; Zhu, X.; Liu, Q.; Song, C.; Jing, X.Y.; Ye, D. Multi-view coupled dictionary learning for person re-identification. Neurocomputing 2019, 348, 16–26. [Google Scholar] [CrossRef]
Xu, Y.; Jiang, Z.; Men, A.; Wang, H.; Luo, H. Multi-view feature fusion for person re-identification. Knowl.-Based Syst. 2021, 229, 107344. [Google Scholar] [CrossRef]
Dong, N.; Yan, S.; Tang, H.; Tang, J.; Zhang, L. Multi-view information integration and propagation for occluded person re-identification. Inf. Fusion 2024, 104, 102201. [Google Scholar] [CrossRef]
Xin, X.; Wang, J.; Xie, R.; Zhou, S.; Huang, W.; Zheng, N. Semi-supervised person re-identification using multi-view clustering. Pattern Recognit. 2019, 88, 285–297. [Google Scholar] [CrossRef]
Yu, Z.; Li, L.; Xie, J.; Wang, C.; Li, W.; Ning, X. Pedestrian 3d shape understanding for person re-identification via multi-view learning. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 5589–5602. [Google Scholar] [CrossRef]
Chen, Z.; Wang, W.; Zhao, Z.; Su, F.; Men, A.; Dong, Y. Cluster-instance normalization: A statistical relation-aware normalization for generalizable person re-identification. IEEE Trans. Multimed. 2023, 26, 3554–3566. [Google Scholar] [CrossRef]
Qi, L.; Wang, L.; Shi, Y.; Geng, X. A novel mix-normalization method for generalizable multi-source person re-identification. IEEE Trans. Multimed. 2022, 25, 4856–4867. [Google Scholar] [CrossRef]
Qi, L.; Liu, J.; Wang, L.; Shi, Y.; Geng, X. Unsupervised generalizable multi-source person re-identification: A domain-specific adaptive framework. Pattern Recognit. 2023, 140, 109546. [Google Scholar] [CrossRef]
Liu, J.; Huang, Z.; Li, L.; Zheng, K.; Zha, Z.J. Debiased batch normalization via gaussian process for generalizable person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, 22 February–1 March 2022; Volume 36, pp. 1729–1737. [Google Scholar]
Jiao, B.; Liu, L.; Gao, L.; Lin, G.; Yang, L.; Zhang, S.; Wang, P.; Zhang, Y. Dynamically transformed instance normalization network for generalizable person re-identification. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 285–301. [Google Scholar]
Zhong, Y.; Wang, Y.; Zhang, S. Progressive feature enhancement for person re-identification. IEEE Trans. Image Process. 2021, 30, 8384–8395. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Liu, K.; Guo, J.; Zhao, P.; Quan, Y.; Miao, Q. Pose-Guided Attention Learning for Cloth-Changing Person Re-Identification. IEEE Trans. Multimed. 2024, 26, 5490–5498. [Google Scholar] [CrossRef]
Chen, B.; Deng, W.; Hu, J. Mixed high-order attention network for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 371–381. [Google Scholar]
Yan, Y.; Ni, B.; Liu, J.; Yang, X. Multi-level attention model for person re-identification. Pattern Recognit. Lett. 2019, 127, 156–164. [Google Scholar] [CrossRef]
Zhong, W.; Jiang, L.; Zhang, T.; Ji, J.; Xiong, H. A part-based attention network for person re-identification. Multimed. Tools Appl. 2020, 79, 22525–22549. [Google Scholar] [CrossRef]
Wu, Y.; Bourahla, O.E.F.; Li, X.; Wu, F.; Tian, Q.; Zhou, X. Adaptive graph representation learning for video person re-identification. IEEE Trans. Image Process. 2020, 29, 8821–8830. [Google Scholar] [CrossRef]
Liu, X.; Yu, C.; Zhang, P.; Lu, H. Deeply coupled convolution–transformer with spatial–temporal complementary learning for video-based person re-identification. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 13753–13763. [Google Scholar] [CrossRef]
Ansar, W.; Fraz, M.M.; Shahzad, M.; Gohar, I.; Javed, S.; Jung, S.K. Two stream deep CNN-RNN attentive pooling architecture for video-based person re-identification. In Proceedings of the Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 23rd Iberoamerican Congress, CIARP 2018, Madrid, Spain, 19–22 November 2018; pp. 654–661. [Google Scholar]
Sun, D.; Huang, J.; Hu, L.; Tang, J.; Ding, Z. Multitask multigranularity aggregation with global-guided attention for video person re-identification. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7758–7771. [Google Scholar] [CrossRef]
Zhang, W.; He, X.; Lu, W.; Qiao, H.; Li, Y. Feature aggregation with reinforcement learning for video-based person re-identification. IEEE Trans. Neural Networks Learn. Syst. 2019, 30, 3847–3852. [Google Scholar] [CrossRef]
Bai, S.; Chang, H.; Ma, B. Incorporating texture and silhouette for video-based person re-identification. Pattern Recognit. 2024, 156, 110759. [Google Scholar] [CrossRef]
Gu, X.; Chang, H.; Ma, B.; Shan, S. Motion feature aggregation for video-based person re-identification. IEEE Trans. Image Process. 2022, 31, 3908–3919. [Google Scholar] [CrossRef]
Pan, H.; Liu, Q.; Chen, Y.; He, Y.; Zheng, Y.; Zheng, F.; He, Z. Pose-aided video-based person re-identification via recurrent graph convolutional network. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7183–7196. [Google Scholar] [CrossRef]
Zhang, T.; Wei, L.; Xie, L.; Zhuang, Z.; Zhang, Y.; Li, B.; Tian, Q. Spatiotemporal transformer for video-based person re-identification. arXiv 2021, arXiv:2103.16469. [Google Scholar] [CrossRef]
Hou, R.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Temporal complementary learning for video person re-identification. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 388–405. [Google Scholar]
Liu, J.; Zha, Z.J.; Chen, X.; Wang, Z.; Zhang, Y. Dense 3D-convolutional neural network for person re-identification in videos. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2019, 15, 1–19. [Google Scholar] [CrossRef]
Guo, W.; Wang, H. Key Parts Spatio-Temporal Learning for Video Person Re-identification. In Proceedings of the 5th ACM International Conference on Multimedia in Asia, MMAsia 2023, Tainan, Taiwan, 6–8 December 2023; pp. 1–6. [Google Scholar]
Pei, S.; Fan, X. Multi-Level Fusion Temporal–Spatial Co-Attention for Video-Based Person Re-Identification. Entropy 2021, 23, 1686. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Zhang, S.; Huang, T. Multi-scale temporal cues learning for video person re-identification. IEEE Trans. Image Process. 2020, 29, 4461–4473. [Google Scholar] [CrossRef] [PubMed]
Zhang, W.; He, X.; Yu, X.; Lu, W.; Zha, Z.; Tian, Q. A multi-scale spatial-temporal attention model for person re-identification in videos. IEEE Trans. Image Process. 2019, 29, 3365–3373. [Google Scholar] [CrossRef]
Ran, Z.; Wei, X.; Liu, W.; Lu, X. MultiScale Aligned Spatial-Temporal Interaction for Video-Based Person Re-Identification. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 8536–8546. [Google Scholar] [CrossRef]
Wei, D.; Hu, X.; Wang, Z.; Shen, J.; Ren, H. Pose-guided multi-scale structural relationship learning for video-based pedestrian re-identification. IEEE Access 2021, 9, 34845–34858. [Google Scholar] [CrossRef]
Wu, L.; Zhang, C.; Li, Z.; Hu, L. Multi-scale Context Aggregation for Video-Based Person Re-Identification. In Proceedings of the International Conference on Neural Information Processing, ICONIP 2023, Changsha, China, 20–23 November 2023; pp. 98–109. [Google Scholar]
Yang, Y.; Li, L.; Dong, H.; Liu, G.; Sun, X.; Liu, Z. Progressive unsupervised video person re-identification with accumulative motion and tracklet spatial–temporal correlation. Future Gener. Comput. Syst. 2023, 142, 90–100. [Google Scholar] [CrossRef]
Ye, M.; Li, J.; Ma, A.J.; Zheng, L.; Yuen, P.C. Dynamic graph co-matching for unsupervised video-based person re-identification. IEEE Trans. Image Process. 2019, 28, 2976–2990. [Google Scholar] [CrossRef]
Zeng, S.; Wang, X.; Liu, M.; Liu, Q.; Wang, Y. Anchor association learning for unsupervised video person re-identification. IEEE Trans. Neural Networks Learn. Syst. 2022, 35, 1013–1024. [Google Scholar] [CrossRef]
Xie, P.; Xu, X.; Wang, Z.; Yamasaki, T. Sampling and re-weighting: Towards diverse frame aware unsupervised video person re-identification. IEEE Trans. Multimed. 2022, 24, 4250–4261. [Google Scholar] [CrossRef]
Wang, X.; Panda, R.; Liu, M.; Wang, Y.; Roy-Chowdhury, A.K. Exploiting global camera network constraints for unsupervised video person re-identification. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 4020–4030. [Google Scholar] [CrossRef]
Wang, Z.; Zhu, A.; Xue, J.; Jiang, D.; Liu, C.; Li, Y.; Hu, F. SUM: Serialized Updating and Matching for text-based person retrieval. Knowl.-Based Syst. 2022, 248, 108891. [Google Scholar] [CrossRef]
Liu, C.; Xue, J.; Wang, Z.; Zhu, A. PMG—Pyramidal Multi-Granular Matching for Text-Based Person Re-Identification. Appl. Sci. 2023, 13, 11876. [Google Scholar] [CrossRef]
Bao, L.; Wei, L.; Zhou, W.; Liu, L.; Xie, L.; Li, H.; Tian, Q. Multi-Granularity Matching Transformer for Text-Based Person Search. IEEE Trans. Multimed. 2024, 26, 4281–4293. [Google Scholar] [CrossRef]
Wu, X.; Ma, W.; Guo, D.; Zhou, T.; Zhao, S.; Cai, Z. Text-based Occluded Person Re-identification via Multi-Granularity Contrastive Consistency Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6162–6170. [Google Scholar]
Jing, Y.; Si, C.; Wang, J.; Wang, W.; Wang, L.; Tan, T. Pose-guided multi-granularity attention network for text-based person search. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11189–11196. [Google Scholar]
Niu, K.; Huang, Y.; Ouyang, W.; Wang, L. Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Trans. Image Process. 2020, 29, 5542–5556. [Google Scholar] [CrossRef]
Yan, S.; Dong, N.; Zhang, L.; Tang, J. Clip-driven fine-grained text-image person re-identification. IEEE Trans. Image Process. 2023, 32, 6032–6046. [Google Scholar] [CrossRef] [PubMed]
Zha, Z.J.; Liu, J.; Chen, D.; Wu, F. Adversarial attribute-text embedding for person search with natural language query. IEEE Trans. Multimed. 2020, 22, 1836–1846. [Google Scholar] [CrossRef]
Wang, Z.; Xue, J.; Zhu, A.; Li, Y.; Zhang, M.; Zhong, C. Amen: Adversarial multi-space embedding network for text-based person re-identification. In Proceedings of the Pattern Recognition and Computer Vision: 4th Chinese Conference, PRCV 2021, Beijing, China, 29 October–1 November 2021; pp. 462–473. [Google Scholar]
Ke, X.; Liu, H.; Xu, P.; Lin, X.; Guo, W. Text-based person search via cross-modal alignment learning. Pattern Recognit. 2024, 152, 110481. [Google Scholar] [CrossRef]
Li, Z.; Xie, Y. BCRA: Bidirectional cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. Multimed. Syst. 2024, 30, 177. [Google Scholar] [CrossRef]
Wu, T.; Zhang, S.; Chen, D.; Hu, H. Multi-level cross-modality learning framework for text-based person re-identification. Electron. Lett. 2023, 59, e12975. [Google Scholar] [CrossRef]
Wang, W.; An, G.; Ruan, Q. A dual-modal graph attention interaction network for person Re-identification. IET Comput. Vis. 2023, 17, 687–699. [Google Scholar] [CrossRef]
Zhao, Z.; Liu, B.; Lu, Y.; Chu, Q.; Yu, N. Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 7534–7542. [Google Scholar]
Gong, T.; Wang, J.; Zhang, L. Cross-modal semantic aligning and neighbor-aware completing for robust text–image person retrieval. Inf. Fusion 2024, 112, 102544. [Google Scholar] [CrossRef]
Gan, W.; Liu, J.; Zhu, Y.; Wu, Y.; Zhao, G.; Zha, Z.J. Cross-Modal Semantic Alignment Learning for Text-Based Person Search. In Proceedings of the International Conference on Multimedia Modeling, Amsterdam, The Netherlands, 29 January–2 February 2024; pp. 201–215. [Google Scholar]
Liu, Q.; He, X.; Teng, Q.; Qing, L.; Chen, H. BDNet: A BERT-based dual-path network for text-to-image cross-modal person re-identification. Pattern Recognit. 2023, 141, 109636. [Google Scholar] [CrossRef]
Qi, B.; Chen, Y.; Liu, Q.; He, X.; Qing, L.; Sheriff, R.E.; Chen, H. An image–text dual-channel union network for person re-identification. IEEE Trans. Instrum. Meas. 2023, 72, 1–16. [Google Scholar] [CrossRef]
Yan, S.; Liu, J.; Dong, N.; Zhang, L.; Tang, J. Prototypical Prompting for Text-to-image Person Re-identification. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 2331–2340. [Google Scholar]
Huang, B.; Qi, X.; Chen, B. Cross-modal feature learning and alignment network for text–image person re-identification. J. Vis. Commun. Image Represent. 2024, 103, 104219. [Google Scholar] [CrossRef]
Liu, J.; Wang, J.; Huang, N.; Zhang, Q.; Han, J. Revisiting modality-specific feature compensation for visible-infrared person re-identification. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7226–7240. [Google Scholar] [CrossRef]
Cheng, Y.; Xiao, G.; Tang, X.; Ma, W.; Gou, X. Two-phase feature fusion network for visible-infrared person re-identification. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 1149–1153. [Google Scholar]
Xu, B.; Ye, H.; Wu, W. MGFNet: A Multi-granularity Feature Fusion and Mining Network for Visible-Infrared Person Re-identification. In Proceedings of the International Conference on Neural Information Processing, Changsha, China, 20–23 November 2023; pp. 15–28. [Google Scholar]
Wang, X.; Chen, C.; Zhu, Y.; Chen, S. Feature fusion and center aggregation for visible-infrared person re-identification. IEEE Access 2022, 10, 30949–30958. [Google Scholar] [CrossRef]
Wang, Y.; Chen, X.; Chai, Y.; Xu, K.; Jiang, Y.; Liu, B. Visible-infrared person re-identification with complementary feature fusion and identity consistency learning. Int. J. Mach. Learn. Cybern. 2024, 16, 703–719. [Google Scholar] [CrossRef]
Sarker, P.K.; Zhao, Q. Enhanced visible–infrared person re-identification based on cross-attention multiscale residual vision transformer. Pattern Recognit. 2024, 149, 110288. [Google Scholar] [CrossRef]
Feng, Y.; Yu, J.; Chen, F.; Ji, Y.; Wu, F.; Liu, S.; Jing, X.Y. Visible-infrared person re-identification via cross-modality interaction transformer. IEEE Trans. Multimed. 2022, 25, 7647–7659. [Google Scholar] [CrossRef]
Qi, M.; Wang, S.; Huang, G.; Jiang, J.; Wu, J.; Chen, C. Mask-guided dual attention-aware network for visible-infrared person re-identification. Multimed. Tools Appl. 2021, 80, 17645–17666. [Google Scholar] [CrossRef]
Park, H.; Lee, S.; Lee, J.; Ham, B. Learning by aligning: Visible-infrared person re-identification using cross-modal correspondences. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 12046–12055. [Google Scholar]
Liu, Q.; Teng, Q.; Chen, H.; Li, B.; Qing, L. Dual adaptive alignment and partitioning network for visible and infrared cross-modality person re-identification. Appl. Intell. 2022, 52, 547–563. [Google Scholar] [CrossRef]
Cheng, X.; Deng, S.; Yu, H.; Zhao, G. DMANet: Dual-modality alignment network for visible–infrared person re-identification. Pattern Recognit. 2025, 157, 110859. [Google Scholar] [CrossRef]
Wu, J.; Liu, H.; Su, Y.; Shi, W.; Tang, H. Learning concordant attention via target-aware alignment for visible-infrared person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 11122–11131. [Google Scholar]
Jiang, K.; Zhang, T.; Liu, X.; Qian, B.; Zhang, Y.; Wu, F. Cross-modality transformer for visible-infrared person re-identification. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 480–496. [Google Scholar]
Zhang, G.; Zhang, Y.; Zhang, H.; Chen, Y.; Zheng, Y. Learning dual attention enhancement feature for visible–infrared person re-identification. J. Vis. Commun. Image Represent. 2024, 99, 104076. [Google Scholar] [CrossRef]
Shi, W.; Liu, H.; Liu, M. Image-to-video person re-identification using three-dimensional semantic appearance alignment and cross-modal interactive learning. Pattern Recognit. 2022, 122, 108314. [Google Scholar] [CrossRef]
Shim, M.; Ho, H.I.; Kim, J.; Wee, D. Read: Reciprocal attention discriminator for image-to-video re-identification. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 335–350. [Google Scholar]
Wu, W.; Liu, J.; Zheng, K.; Sun, Q.; Zha, Z.J. Temporal complementarity-guided reinforcement learning for image-to-video person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7319–7328. [Google Scholar]
Gu, X.; Ma, B.; Chang, H.; Shan, S.; Chen, X. Temporal knowledge propagation for image-to-video person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9647–9656. [Google Scholar]
Zhang, X.; Li, S.; Jing, X.Y.; Ma, F.; Zhu, C. Unsupervised domain adaption for image-to-video person re-identification. Multimed. Tools Appl. 2020, 79, 33793–33810. [Google Scholar] [CrossRef]
Zhu, X.; Ye, P.; Jing, X.Y.; Zhang, X.; Cui, X.; Chen, X.; Zhang, F. Heterogeneous distance learning based on kernel analysis-synthesis dictionary for semi-supervised image to video person re-identification. IEEE Access 2020, 8, 169663–169675. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, J.; Zheng, L.; Liu, Y.; Sun, Y.; Li, Y.; Wang, S. Cycas: Self-supervised cycle association for learning re-identifiable descriptions. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 72–88. [Google Scholar]
Yu, B.; Xu, N.; Zhou, J. Cross-media body-part attention network for image-to-video person re-identification. IEEE Access 2019, 7, 94966–94976. [Google Scholar] [CrossRef]
Zhang, X.; Feng, W.; Han, R.; Wang, L.; Song, L.; Hou, J. From Synthetic to Real: Unveiling the Power of Synthetic Data for Video Person Re-ID. arXiv 2024, arXiv:2402.02108. [Google Scholar] [CrossRef]
Defonte, A.D. Synthetic-to-Real Domain Transfer with Joint Image Translation and Discriminative Learning for Pedestrian Re-Identification. Ph.D. Thesis, Politecnico di Torino, Turin, Italy, 2022. [Google Scholar]
Zhang, T.; Xie, L.; Wei, L.; Zhuang, Z.; Zhang, Y.; Li, B.; Tian, Q. Unrealperson: An adaptive pipeline towards costless person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 11506–11515. [Google Scholar]
Yuan, D.; Chang, X.; Huang, P.Y.; Liu, Q.; He, Z. Self-supervised deep correlation tracking. IEEE Trans. Image Process. 2020, 30, 976–985. [Google Scholar] [CrossRef]
Yang, S.; Zhang, Y. MLLMReID: Multimodal Large Language Model-based Person Re-identification. arXiv 2024, arXiv:2401.13201. [Google Scholar]
Wang, Q.; Li, B.; Xue, X. When Large Vision-Language Models Meet Person Re-Identification. arXiv 2024, arXiv:2411.18111. [Google Scholar] [CrossRef]
Jin, W.; Yanbin, D.; Haiming, C. Lightweight Person Re-identification for Edge Computing. IEEE Access 2024, 12, 75899–75906. [Google Scholar] [CrossRef]
Yuan, C.; Liu, X.; Guo, L.; Chen, L.; Chen, C.P. Lightweight Attention Network Based on Fuzzy Logic for Person Re-Identification. In Proceedings of the 2024 International Conference on Fuzzy Theory and Its Applications (iFUZZY), Kagawa, Japan, 10–13 August 2024; pp. 1–6. [Google Scholar]
Yuan, D.; Chang, X.; Li, Z.; He, Z. Learning adaptive spatial-temporal context-aware correlation filters for UAV tracking. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2022, 18, 1–18. [Google Scholar] [CrossRef]
RichardWebster, B.; Hu, B.; Fieldhouse, K.; Hoogs, A. Doppelganger saliency: Towards more ethical person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2847–2857. [Google Scholar]
Behera, N.K.S.; Sa, P.K.; Bakshi, S.; Bilotti, U. Explainable graph-attention based person re-identification in outdoor conditions. Multimed. Tools Appl. 2025, 84, 34781–34793. [Google Scholar] [CrossRef]
Shu, X.; Yuan, D.; Liu, Q.; Liu, J. Adaptive weight part-based convolutional network for person re-identification. Multimed. Tools Appl. 2020, 79, 23617–23632. [Google Scholar] [CrossRef]

Figure 1. Taxonomy of person Re-ID methodologies with representative approaches.

Figure 2. Annual publication trends in person Re-ID research (2019–2024). Note the temporary dip in 2022 corresponding to pandemic-related research disruptions.

Figure 3. Global co-authorship network of Re-ID researchers. Node sizes correspond to betweenness centrality, edge weights to collaboration frequency.

Figure 4. Core research teams in person Re-ID.

Figure 5. Keyword co-occurrence network. Node sizes reflect term frequency, edge weights co-occurrence counts.

Figure 6. PRISMA flow diagram documenting literature screening process. Exclusion criteria were independently validated by three domain experts (Fleiss’

κ = 0.82

).

Figure 6. PRISMA flow diagram documenting literature screening process. Exclusion criteria were independently validated by three domain experts (Fleiss’

κ = 0.82

).

Figure 7. Distribution and growth trends of Re-ID datasets (2012–2023).

Figure 8. The modal distribution of retrieved papers, illustrating the evolving focus within recent Re-ID research.

Table 1. Comparative Analysis of Recent Person Re-ID Surveys.

Manuscript	Year	Major Innovative Aspect
Ye et al. [1]	2021	• Five-stage evolutionary analysis of person Re-ID research progress
		• Comparative study of image/video models with detailed architectural analysis
		• Identification of fundamental challenges across processing stages
Ming et al. [2]	2022	• Systematic categorization of deep learning approaches into four methodological classes
Ming et al. [2]	2022	• Comprehensive analysis of image/video benchmark datasets
Zheng et al. [3]	2022	• Framework for visible-infrared cross-modal Re-ID challenges
Zheng et al. [3]	2022	• Systematic analysis of inter-modal and intra-modal variation mitigation
Huang et al. [4]	2023	• Task-oriented methodology comparison across image modalities
Huang et al. [4]	2023	• PRISMA*-guided systematic literature selection methodology
Zahra et al. [5]	2023	• Historical analysis through research paradigm shifts
Zahra et al. [5]	2023	• Multimodal performance visualization and comparative evaluation

* Abbreviations: PRISMA = Preferred Reporting Items for Systematic Reviews and Meta-Analyses.

Table 2. Network Structure Metrics.

Metric	Value	Description
Average clustering coefficient	0.81	Indicating strong community structure with tightly interconnected node clusters
Modularization index	0.67	Showing moderate specialization of functional modules
Network diameter	14	Reflecting efficient information dissemination

Table 3. Overview of Research Focus Areas.

Category	Keywords/Concept
Methodological Foundations	convolutional neural networks and metric learning
Technological Innovations	generative adversarial networks and graph neural networks
Application Orientations	video surveillance and smart cities

Table 4. Distribution of Research Topics.

Topic/Strategy	Percentage
Novel architecture designs	42%
Cross-domain adaptation strategies	31%
Real-world deployment frameworks	27%

Table 5. Comparative Analysis of Image-based Person Re-ID Datasets.

Dataset	Year	IDs	Images	Cams.	Training Set		Test Set		Metrics
Dataset	Year	IDs	Images	Cams.	IDs	Images	IDs	Images	Metrics
CUHK-01 [7]	2012	971	3884	2	–	–	–	–	CMC
CUHK-02 [8]	2013	1816	7264	10	–	–	–	–	CMC
CUHK-03 [9]	2014	1467	12,696	10	767	7368	700	5328	CMC
Market-1501 [10]	2015	1501	32,668	6	751	12,936	750	19,732	CMC, mAP
DukeMTMC-reID [11]	2017	1812	36,411	8	702	16,522	702	17,661	CMC, mAP
P-DukeMTMC-reID [12]	2017	1299	15,090	8	665	12,927	634	2163	CMC, mAP
CUHK-SYSU [13]	2017	8432	18,184	–	5532	11,206	2900	6978	mAP, Rank-k
MSMT17 [14]	2018	4101	126,441	15	1041	32,621	3060	93,820	CMC, mAP

CMC: Cumulative Matching Characteristics; mAP: mean Average Precision.

Table 6. Complete Comparative Analysis of Video-Based Person Re-ID Datasets.

Dataset	Year	IDs	Tracklets	Cams.	Training Set		Test Set		Metrics
Dataset	Year	IDs	Tracklets	Cams.	IDs	Tracklets	IDs	Tracklets	Metrics
PRID2011 [15]	2011	200	400	2	–	–	–	–	CMC
GRID [16]	2012	250	500	17	–	–	–	–	CMC
iLIDS-VID [17]	2014	300	600	2	–	–	–	–	CMC
MARS [18]	2016	1261	20,715	6	625	8298	636	12,180	CMC, mAP
DukeMTMC-Video [19]	2018	1812	4832	8	702	2196	702	2636	CMC, mAP
LPW [20]	2018	2731	7694	11	1975	5938	756	1756	CMC, mAP

CMC: Cumulative Matching Characteristics; mAP: mean Average Precision.

Table 7. Comparative Analysis of Text-Based Person Re-ID Datasets.

Dataset	Year	IDs	Images	Text Desc.	Train IDs	Train Imgs.	Test IDs	Test Imgs.	Avg. Text Len.	Metric
CUHK-PEDES [21]	2017	13,003	40,206	80,412	11,003	34,054	1000	3074	23.5	Rank-1/5/10, mAP
RSTPReid [22]	2021	4101	20,505	39,010	3701	18,505	200	1000	24.8	Rank-1/5/10, CMC
ICFG-PEDES [23]	2021	4102	54,522	54,522	3102	34,674	1000	19,848	37.2	Rank-1/5/10, mAP

Table 8. Comparative Analysis of Infrared-Visible Person Re-ID Datasets.

Dataset	Year	IDs	Visible		Infrared		Training Set			Test Set			Metrics
Dataset	Year	IDs	Cams.	Imgs.	Cams.	Imgs.	IDs	Vis.	IR	IDs	Vis.	IR	Metrics
SYSU-MM01 [24]	2017	491	4	30,071	2	15,792	395	22,258	11,909	96	7813	3883	Rank-1/10, mAP
RegDB [25]	2021	412	1	4120	1	4120	206	2060	2060	206	2060	2060	Rank-1/20, mAP
LLCM [26]	2023	1064	5	25,626	4	21,141	713	16,946	13,975	351	8680	7166	Rank-1/5/10, mAP

Table 9. Performance Comparison of Multi-view Learning across Different Frameworks.

Method	Dataset	Rank-1 (%)	Rank-5 (%)	mAP (%)
MVCDL [27]	PRID 2011	28.8	55.6	-
MVCDL [27]	iLIDS-VID	18.7	46.2	-
MVMP [28]	Market-1501	95.3	98.4	88.9
MVMP [28]	DukeMTMC-reID	87.4	94.5	79.7
MVI²P [29]	Market-1501	95.3	-	87.9
MVI²P [29]	P-DukeMTMC-reID	91.9	94.4	80.9
MVDC [30]	Market1501	75.2	-	52.6
MVDC [30]	DukeMCMT-reID	57.6	-	37.8
Yu et al. [31]	Market-1501	95.7	-	90.2
Yu et al. [31]	DukeMCMT-reID	91.4	-	84.1

Table 10. Performance Comparison of Multi-view Learning across Different Frameworks.

Method	Dataset	mAP (%)	Rank-1 (%)
CINorm [32]	Market1501	57.8	82.3
	MSMT17	21.1	49.7
	DukeMTMC-reID	52.4	71.3
MixNorm [33]	PRID	74.3	65.2
MixNorm [33]	VIPeR	66.6	56.4
UDG-ReID [34]	Market1501	79.7	53.2
UDG-ReID [34]	MSMT17	42.4	16.8
GDNorm [35]	VIPeR	74.1	66.1
GDNorm [35]	PRID	79.9	72.6
DTIN [36]	VIPeR	70.7	62.9
DTIN [36]	PRID	79.7	71.0

Table 11. Performance Comparison of Attention Mechanisms across Different Frameworks.

Method	Dataset	mAP (%)	Rank-1 (%)
PFE [37]	Market-1501	86.2	95.1
	DukeMTMC-ReID	75.9	88.2
	CUHK03	68.6	71.6
PGAL [38]	PRCC	58.7	59.5
PGAL [38]	LTCC	27.7	62.5
HOA [39]	Market-1501	95.1	85.0
	DukeMTMC-ReID	89.1	77.2
	CUHK03	77.2	65.4
Yan et al. [40]	Market-1501	94.5	82.5
Yan et al. [40]	DukeMTMC-ReID	72.0	85.6
PAM [41]	CUHK03	64.1	60.8
PAM [41]	DukeMTMC-reID	84.7	69.4

Table 12. Performance Comparison of Multi-Model Integration across Different Frameworks.

Method	Dataset	Rank-1 (%)	Rank-5 (%)	mAP (%)
Wu et al. [42]	PRID 2011	94.6	99.1	-
	ILIDS-VID	84.5	96.7	-
	MARS	89.8	96.1	81.1
DCCT [43]	PRID 2011	96.8	99.7	-
	ILIDS-VID	91.7	98.6	-
	MARS	91.5	97.4	86.3
Ansar et al. [44]	PRID 2011	84.0	97.6	-
	ILIDS-VID	76.6	90.8	-
	MARS	56.0	67.0	-

Table 13. Performance Comparison of Spatio-Temporal Multi-Granularity Feature Aggregation across Different Frameworks.

Method	Dataset	Rank-1 (%)	Rank-5 (%)	mAP (%)
MMA-GGA [45]	DukeMTMC-VideoReID	97.3	99.6	96.2
	PRID 2011	95.5	100	-
	ILIDS-VID	98.7	98.7	-
Zhang et al. [46]	PRID 2011	91.2	98.9	-
Zhang et al. [46]	ILIDS-VID	68.4	87.2	-
Bai et al. [47]	MARS	91.5	-	87.4
Bai et al. [47]	ILIDS-VID	90.6	-	84.2
MFA [48]	MARS	90.4	-	85.0
MFA [48]	ILIDS-VID	88.2	-	78.9

Table 14. Performance Comparison of Structured Spatio-Temporal Modeling across Different Frameworks.

Method	Dataset	Rank-1 (%)	Rank-5 (%)	mAP (%)
Pan et al. [49]	iLIDS-VID	90.2	98.5	-
	DukeMTMC-VideoReID	97.1	98.8	96.5
	MARS	91.1	97.2	86.5
zhang et al. [50]	iLIDS-VID	87.5	95.0	78.0
	DukeMTMC-VideoReID	97.6	-	97.4
	MARS	88.7	-	86.3
hou et al. [51]	iLIDS-VID	86.6	-	-
	DukeMTMC-VideoReID	96.9	-	96.2
	MARS	89.8	-	85.1
D3DNet [52]	iLIDS-VID	65.4	87.9	-
D3DNet [52]	MARS	76.0	87.2	71.4
KSTL [53]	iLIDS-VID	93.4	-	-
	PRID 2011	96.7	-	-
	MARS	91.5	-	86.3
MLTS [54]	iLIDS-VID	94.0	98.67	-
MLTS [54]	PRID 2011	96.63	97.75	-

Table 15. Performance Comparison of Multiscale Analysis across Different Frameworks.

Method	Dataset	Rank-1 (%)	Rank-5 (%)	mAP (%)
PM3D [55]	DukeMTMC-VideoReID	95.49	99.3	93.67
	iLIDS-VID	86.67	98.00	-
	MARS	88.87	96.64	85.36
MSTA [56]	iLIDS-VID	70.1	88.67	-
	PRID-2011	91.2	98.72	-
	MARS	84.08	93.52	79.67
MS-STI [57]	DukeMTMC-VideoReID	97.4	99.7	97.1
MS-STI [57]	MARS	92.7	97.5	87.2
Wei et al. [58]	iLIDS-VID	85.5	91.4	-
	PRID-2011	94.7	99.2	-
	MARS	90.2	96.6	83.2
MSCA [59]	iLIDS-VID	84.7	94.7	-
	PRID-2011	96.6	100	-
	MARS	91.8	96.5	83.2

Table 16. Performance Comparison of Unlabeled Dependency Strategies across Different Frameworks.

Method	Dataset	Rank-1 (%)	Rank-5 (%)	mAP (%)
TASTC [60]	iLIDS-VID	52.5	70.2	-
TASTC [60]	DukeMTMC-VideoReID	76.8	-	68.2
DGM [61]	iLIDS-VID	42.6	67.7	-
	PRID-2011	83.3	96.7	-
	MARS	24.3	40.4	11.9
UAAL [62]	DukeMTMC-VideoReID	89.7	97.0	87.0
UAAL [62]	MARS	73.2	86.3	60.1
SRC [63]	DukeMTMC-VideoReID	83.0	83.3	76.5
	MARS	62.7	76.1	40.5
	PRID-2011	72.0	87.7	-
CCM [64]	DukeMTMC-VideoReID	76.5	89.6	68.7
CCM [64]	MARS	65.3	77.8	41.2

Table 17. Performance Comparison of Multi-Granularity Matching across Different Frameworks.

Method	Dataset	Rank-1 (%)	Rank-5 (%)	Rank-10 (%)
SUM [65]	CUHK-PEDES	59.22 ± 0.02	80.35 ± 0.02	87.6 ± 0.03
PMG [66]	CUHK-PEDES	64.59	83.19	89.12
PMG [66]	RSTPReid	48.85	72.65	81.30
Bao et al. [67]	CUHK-PEDES	68.23	86.37	91.65
Bao et al. [67]	RSTPReid	56.05	78.65	86.75
MGCC [68]	Occluded-RSTPReid	49.85	74.95	83.45
PMA [69]	CUHK-PEDES	53.81	73.54	81.23
MIA [70]	CUHK-PEDES	53.10	75.00	82.90
CFine [71]	ICFGPEDES	60.83	76.55	82.42
	CUHK-PEDES	69.57	85.93	91.15
	RSTPReid	50.55	72.50	81.60

Table 18. Performance Comparison of Modal Alignment across Different Frameworks.

Method	Dataset	Rank-1 (%)	Rank-5 (%)	Rank-10 (%)	mAP (%)
AATE [72]	CUHK-PEDES	52.42	74.98	82.74	-
AMEN [73]	CUHK-PEDES	57.16	78.64	86.22	-
MAPS [74]	ICFGPEDES	57.22	-	82.70	-
MAPS [74]	CUHK-PEDES	65.24	-	90.10	-
BCRA [75]	ICFGPEDES	64.77	80.83	86.31	39.48
	CUHK-PEDES	75.03	89.93	93.89	51.80
	RSTPReid	62.29	82.16	89.03	48.43
MCL [76]	ICFGPEDES	53.43	76.45	84.28	0.46
MCL [76]	CUHK-PEDES	61.21	81.52	88.22	0.52
Dual-GAIN [77]	ICFGPEDES	53.43	76.45	84.28	0.46
Dual-GAIN [77]	CUHK-PEDES	61.21	81.52	88.22	0.52
Zhao et al. [78]	CUHK-PEDES	63.4	83.3	90.3	49.28
Zhao et al. [78]	ICFGPEDES	65.62	80.54	85.83	38.78
CANC [79]	CUHK-PEDES	54.95	77.39	84.24	-
	ICFGPEDES	44.22	64.93	72.86	-
	RSTPReid	45.78	71.28	81.86	-
SAL [80]	CUHK-PEDES	69.14	85.90	90.80	-
SAL [80]	ICFGPEDES	62.77	78.64	84.21	-
SSAN [23]	CUHK-PEDES	64.13	82.62	88.4	-
SSAN [23]	Flickr30K	50.74	77.92	85.46	-

Table 19. Performance Comparison of Cross-Modal Feature Enhancement across Different Frameworks.

Method	Dataset	Rank-1 (%)	Rank-5 (%)	mAP (%)
Liu et al. [81]	CUHK-PEDES	66.27	85.07	-
Liu et al. [81]	ICFGPEDES	57.31	76.15	-
Qi et al. [82]	Market1501	96.5	-	94.9
Qi et al. [82]	CUHK03	86.9	-	88.5
Yan et al. [83]	CUHK-PEDES	74.89	89.90	67.12
	ICFGPEDES	65.12	81.57	42.93
	RSTPReid	61.87	83.63	47.82
CFLAA [84]	CUHK-PEDES	78.4	92.6	66.9
	ICFGPEDES	69.4	80.7	-
	RSTPReid	64.2	84.1	-

Table 20. Performance Comparison of Feature Fusion across Different Frameworks.

Method	Dataset	Classification	Rank-1 (%)	Rank-10 (%)	mAP (%)
TSME [85]	SYSU-MM01	ALL-Search (Single-shot)	64.23	95.19	61.21
	SYSU-MM01	Indoor-Search (Single-shot)	64.80	96.92	99.31
	RegDB	Visible-to-Infrared	87.35	97.10	76.94
	RegDB	Infrared-to-Visible	86.41	96.39	75.70
TFFN [86]	SYSU-MM01	-	58.37	91.30	56.02
TFFN [86]	RegDB	-	81.17	93.69	77.16
MGFNet [87]	SYSU-MM01	ALL-Search (Single-shot)	72.63	-	69.64
	SYSU-MM01	Indoor-Search (Single-shot)	77.90	-	82.28
	RegDB	Visible-to-Infrared	91.14	-	82.53
	RegDB	Infrared-to-Visible	89.06	-	80.5
F2CALNet [88]	SYSU-MM01	ALL-Search (Single-shot)	62.86	-	59.38
	SYSU-MM01	Indoor-Search (Single-shot)	66.83	-	71.94
	RegDB	Visible-to-Infrared	86.88	-	76.33
	RegDB	Infrared-to-Visible	85.68	-	74.87
CFF-ICL [89]	SYSU-MM01	ALL-Search (Single-shot)	66.37	-	63.56
	SYSU-MM01	Indoor-Search (Single-shot)	68.76	-	72.47
	RegDB	Visible-to-Infrared	90.63	-	82.77
	RegDB	Infrared-to-Visible	88.65	-	81.97
WF-CAMReViT [90]	SYSU-MM01	ALL-Search	68.05	97.12	65.17
	SYSU-MM01	Indoor-Search	72.43	97.16	77.58
	RegDB	Visible-to-Infrared	91.66	95.97	85.96
	RegDB	Infrared-to-Visible	92.97	95.19	86.08
CMIT [91]	SYSU-MM01	ALL-Search	70.94	94.93	65.51
	SYSU-MM01	Indoor-Search	73.28	95.20	77.18
	RegDB	Visible-to-Infrared	88.78	94.76	88.49
	RegDB	Infrared-to-Visible	84.55	93.72	83.64
MDAN [92]	SYSU-MM01	ALL-Search	39.07	84.85	40.52
	SYSU-MM01	Indoor-Search	39.13	86.37	48.88
	RegDB	-	43.25	68.79	41.59

Table 21. Performance Comparison of Modal Alignment across Different Frameworks.

Method	Dataset	Classification	Rank-1 (%)	Rank-10 (%)	mAP (%)
Park et al. [93]	SYSU-MM01	ALL-Search (Single-shot)	55.41 ± 0.18	-	54.14 ± 0.33
	SYSU-MM01	Indoor-Search (Single-shot)	58.46 ± 0.67	-	66.33 ± 1.27
	RegDB	Visible-to-Infrared	74.17 ± 0.04	-	67.64 ± 0.08
	RegDB	Infrared-to-Visible	72.43 ± 0.42	-	65.46 ± 0.18
Liu et al. [94]	SYSU-MM01	-	52.99	91.98	60.73
Liu et al. [94]	RegDB	-	52.14	75.44	49.92
DMANet [95]	SYSU-MM01	ALL-Search (Single-shot)	76.33	97.58	69.38
	SYSU-MM01	Indoor-Search (Single-shot)	81.48	98.96	83.76
	RegDB	Visible-to-Infrared	94.51	-	88.46
	RegDB	Infrared-to-Visible	93.25	-	87.18
CAL [96]	SYSU-MM01	ALL-Search (Single-shot)	74.66	96.47	71.73
	SYSU-MM01	Indoor-Search (Single-shot)	79.69	98.93	86.97
	RegDB	Visible-to-Infrared	94.51	99.70	88.67
	RegDB	Infrared-to-Visible	93.64	99.46	97.61
CMT [97]	SYSU-MM01	ALL-Search (Single-shot)	71.88	96.45	68.57
	SYSU-MM01	Indoor-Search (Single-shot)	76.90	97.68	79.91
	RegDB	Visible-to-Infrared	95.17	98.82	87.30
	RegDB	Infrared-to-Visible	91.97	97.92	86.46
Zhang et al. [98]	SYSU-MM01	ALL-Search	66.61	-	62.86
	SYSU-MM01	Indoor-Search	70.90	-	75.78
	RegDB	Visible-to-Infrared	90.76	-	87.30
	RegDB	Infrared-to-Visible	88.79	-	85.44

Table 22. Performance Comparison of Cross-Modal Feature Embedding across Different Frameworks.

Method	Dataset	Rank-1 (%)	Rank-5 (%)	mAP (%)
Shi et al. [99]	DukeMTMC-VideoReID	82.8	92.0	81.0
Shi et al. [99]	MARS	79.1	89.4	69.0
READ [100]	DukeMTMC-VideoReID	86.3	94.4	83.4
READ [100]	MARS	91.5	92.1	70.4
TCRL [101]	ILIDS-VID	77.3	94.7	-
TCRL [101]	MARS	86.0	92.5	80.1
TKP [102]	ILIDS-VID	54.6	79.4	-
TKP [102]	MARS	75.6	87.6	65.1

Table 23. Performance Comparison of Cross-Modal Feature Generation across Different Frameworks.

Method	Dataset	Rank-1 (%)	Rank-5 (%)	Rank-10 (%)
CMGTN [103]	ILIDS-VID	38.4	66.2	74.8
	MARS	41.2	70.1	76.5
	PRID-2011	42.7	66.8	81.3
KADDL [104]	ILIDS-VID	56.3	81.9	88.7
	MARS	74.3	87.8	91.5
	PRID-2011	80.4	95.1	97.2
CycAs [105]	ILIDS-VID	73.3	-	-
	MARS	72.8	-	-
	PRID-2011	86.5	-	-
CBAN [106]	ILIDS-VID	43.2	71.0	80.1
	MARS	68.2	85.3	88.9
	PRID-2011	74.6	90.6	95.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, L.; Han, Y.; Chen, Z. Advancements and Challenges in Deep Learning-Based Person Re-Identification: A Review. Electronics 2025, 14, 4398. https://doi.org/10.3390/electronics14224398

AMA Style

Zhao L, Han Y, Chen Z. Advancements and Challenges in Deep Learning-Based Person Re-Identification: A Review. Electronics. 2025; 14(22):4398. https://doi.org/10.3390/electronics14224398

Chicago/Turabian Style

Zhao, Liang, Yuyan Han, and Zhihao Chen. 2025. "Advancements and Challenges in Deep Learning-Based Person Re-Identification: A Review" Electronics 14, no. 22: 4398. https://doi.org/10.3390/electronics14224398

APA Style

Zhao, L., Han, Y., & Chen, Z. (2025). Advancements and Challenges in Deep Learning-Based Person Re-Identification: A Review. Electronics, 14(22), 4398. https://doi.org/10.3390/electronics14224398

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advancements and Challenges in Deep Learning-Based Person Re-Identification: A Review

Abstract

1. Introduction

2. Bibliometric Analysis of the Person Re-ID Landscape

2.1. Publication Trends Analysis

2.2. Collaboration Network Analysis

2.3. Technical Landscape Mapping

2.4. Literature Screening Protocol

3. Datasets and Evaluation Metrics

3.1. Benchmark Datasets

3.1.1. Image-Based Person Re-ID Datasets

3.1.2. Video-Based Person Re-ID Datasets

3.1.3. Text-Based Person Re-ID Datasets

3.1.4. Infrared-Visible Cross-Modal Person Re-ID Datasets

3.2. Evaluation Metrics

3.3. Critical Synthesis on Data and Metrics Critical Synthesis on Data and Metrics

4. Critical Analysis of Methodological Evolution

4.1. Unimodal Person Re-ID: Image vs. Video Paradigms

4.1.1. Image-Based Methodologies–Beyond Superficial Representations

4.1.2. Video-Based Person Re-ID—From Motion Understanding to Structured Learning

4.2. Person Re-ID in Multi-Modality Scenarios: Progress and Fundamental Limitations

4.2.1. Text-Image Modality Fusion—Beyond Semantic Surface Alignment

4.2.2. Visible-Infrared Cross-Modal Re-ID–Progress and Limitations

4.2.3. Person Re-ID in Image-Video Modality–Bridging Spatiotemporal Heterogeneity

5. Critical Challenges and Emerging Paradigms

5.1. The Synthetic-Real Chasm: Beyond Data Scarcity to Domain Adaptation Failure

5.2. Multimodal Semantic Disintegration: When Alignment Becomes Illusion

5.3. The Accuracy-Efficiency Paradox: Beyond Model Compression to Dynamic Computation

5.4. Interpretability Illusion: The Epistemic Crisis in Re-ID Decisions

5.5. The Generalization Mirage: Unmasking Domain Adaptation’s False Promises

5.6. Ethical and Societal Challenges and Mitigation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI