You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

29 August 2025

VIPS: Learning-View-Invariant Feature for Person Search

,
,
,
,
,
and
1
Xi’an Key Laboratory of Human–Machine Integration and Control Technology for Intelligent Rehabilitation, Xijing University, Xi’an 710123, China
2
School of Information Science and Technology, Northwest University, Xi’an 710100, China
3
School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China
4
Academy of Advanced Interdisciplinary Research, Xidian University, Xi’an 710071, China
This article belongs to the Section Remote Sensors

Abstract

Unmanned aerial vehicles (UAVs) have become indispensable tools for surveillance, enabled by their ability to capture multi-perspective imagery in dynamic environments. Among critical UAV-based tasks, cross-platform person search—detecting and identifying individuals across distributed camera networks—presents unique challenges. Severe viewpoint variations, occlusions, and cluttered backgrounds in UAV-captured data degrade the performance of conventional discriminative models, which struggle to maintain robustness under such geometric and semantic disparities. To address this, we propose view-invariant person search (VIPS), a novel two-stage framework combining Faster R-CNN with a view-invariant re-Identification (VIReID) module. Unlike conventional discriminative models, VIPS leverages the semantic flexibility of large vision–language models (VLMs) and adopts a two-stage training strategy to decouple and align text-based ID descriptors and visual features, enabling robust cross-view matching through shared semantic embeddings. To mitigate noise from occlusions and cluttered UAV-captured backgrounds, we introduce a learnable mask generator for feature purification. Furthermore, drawing from vision–language models, we design view prompts to explicitly encode perspective shifts into feature representations, enhancing adaptability to UAV-induced viewpoint changes. Extensive experiments on benchmark datasets demonstrate state-of-the-art performance, with ablation studies validating the efficacy of each component. Beyond technical advancements, this work highlights the potential of VLM-derived semantic alignment for UAV applications, offering insights for future research in real-time UAV-based surveillance systems.

1. Introduction

Unmanned aerial vehicles (UAVs), or drones, have undergone transformative advancements in autonomy, sensing, and AI integration, enabling their deployment across diverse sectors from precision agriculture and disaster response to smart city surveillance. The global drone market is expected to continue its growth trajectory, driven by demand for scalable, real-time data acquisition and analysis. A critical enabler of this growth is the fusion of artificial intelligence (AI) with UAV platforms, particularly in cross-platform perception systems where drones collaborate with ground-based sensors (e.g., CCTV, IoT devices) to achieve comprehensive environmental understanding.
Such systems are increasingly vital for large-scale security monitoring, exemplified by applications like search-and-rescue operations, crowd behavior analysis, and cross-border surveillance. However, a fundamental challenge in these scenarios is cross-platform person search, seamlessly detecting and re-identifying individuals across heterogeneous camera networks comprising UAV-mounted and fixed ground cameras. As shown in Figure 1, a sharp contrast emerges between traditional same-platform datasets such as PRW, where camera views are relatively homogeneous and mostly ground-based, and cross-platform datasets like G2APS, which couple UAV-mounted aerial views with fixed ground cameras. The drastic viewpoint disparity in G2APS (e.g., top-down vs. frontal perspectives) results in pronounced geometric distortions, scale variations, and appearance shifts, thereby causing much higher false-negative rates compared to conventional ground-only benchmarks.
Figure 1. Person captured by different cameras in the PRW [] and G2APS [] datasets.The images in PRW are all from ground cameras, while the images in G2APS are from a ground camera and a UAV. A more detailed description of dataset characteristics is available in Section 5.
Existing person search methods mainly focus on the cooperation of the two subtasks, misalignment of scale, occlusion, and detector optimization; few works consider the view difference when matching persons across cameras. It may be because there is no camera annotation in one of the commonly used datasets, CUHK-SYSU []. However, its images are derived from hand-held cameras and movie snapshots and can still be viewed as coming from two different views. In addition, the PRW [] dataset was captured by six different cameras, and the rich view information also needs to be further fully utilized. Recently, Zhang et al. [] constructed the UAV- and camera-based person search dataset G2APS. The huge difference between the high altitude and ground view makes accurate person matching more difficult.
The visual–language pre-training model CLIP [] unifies the two modes of image and text. It can learn semantic information from the prompt “A photo of a {xx} person” and match it with the corresponding image. It has the characteristics of zero-shot learning, so it is widely used in many downstream tasks such as image classification, object detection, and semantic segmentation. CLIP-ReID [] applied CLIP to person ReID for the first time, learning the descriptor “A photo of a {xx} person” for each ID and guiding the image encoder to learn the semantic information in the image through the descriptor. Inspired by this, we use CLIP to learn the semantic information shared in images of different views and then embed the semantic information into image features to solve the problem of view difference.
In this work, we present view-invariant person search (VIPS), a novel two-stage framework that synergizes UAV-optimized detection with vision–language model (VLM) advancements to address cross-view person search challenges. The model consists of Faster R-CNN [] and view-invariant person ReID (VIReID). VIReID uses CLIP-ReID as the baseline model. Like CLIP-ReID, VIReID has two training stages: one to learn ID descriptor text features and one for visual features. We observe that the presence of background and occlusion in the image introduces noise to the text feature, so we design a mask generator to eliminate the noise, allowing the model to learn more accurate text features. Furthermore, to reduce the difference between image features of the same ID under different views, we design the view prompt to embed different view information through a set of learnable embeddings. By inputting the view prompt and image patch embeddings into each encoder layer, the encoder learns view-invariant image features. To the best of our knowledge, VIPS provides pioneering contributions to AI-driven UAV applications. VIPS integrates CLIP’s semantic alignment for cross-platform person search, providing a dedicated solution for UAV-induced viewpoint discrepancies.
We conduct comprehensive evaluations on both person search benchmarks and traditional person re-identification benchmarks. The experimental results demonstrate that our proposed methods can significantly outperform existing state-of-the-art methods. The visualization of the feature mask further illustrates that our method can guide the model to focus on more discriminative regions. The contributions of this paper are summarized as follows:
  • We propose a novel viewpoint-invariant person search (VIPS) framework leveraging CLIP’s semantic alignment for UAV and cross-camera scenarios;
  • We propose a mask generator to suppress noise in UAV-captured images, enhancing text-guided feature learning and view prompts to encode camera perspectives into visual features, reducing viewpoint discrepancy;
  • Extensive experiments on five benchmark datasets demonstrate the superiority of VIPS, establishing a new state-of-the-art method in UAV-based person search tasks.
The rest of this article is organized as follows: We review the related works in Section 2 and overview preliminary works in Section 3. The proposed framework and the optimization procedure are detailed in Section 4. Section 5 highlights the experimental evaluations. Section 6 draws the conclusion.

3. Preliminaries

3.1. Overview of CLIP-ReID

CLIP-ReID [] applies CLIP [] for the person ReID task. It consists of a text encoder and an image encoder, implemented by ViT-B/16 [] “A photo of a [ X ] 1   [ X ] 2   [ X ] 3   [ X ] L person” for each person ID, where [ X ] represents a learnable text embedding.
CLIP-ReID adopts a two-stage training strategy. For the k t h image with ID y k , the model learns the text feature T k of the ID descriptor in the first stage. Then the model learns the visual feature V k under the supervision of T k and y k .
In the first stage, two contrast losses are used:
L i 2 t ( k ) = log exp ( S ( V k , T k ) ) x = 1 B exp ( S ( V k , T x ) ) ,
L t 2 i ( y k ) = 1 | Q ( y k ) | q Q ( y k ) log exp ( S ( V q , T y k ) ) b = 1 B exp ( S ( V b , T y k ) ) ,
where S ( · , · ) represents the cosine similarity, B represents the batch size, Q ( y k ) represents the index of images corresponding to T y k in the batch, | · | represents the number of elements.
In the second stage, a cross-entropy loss is designed:
L i 2 t c e ( k ) = j = 1 N q j l o g e x p ( S ( V k , T y j ) ) y b = 1 N e x p ( S ( V k , T y b ) ) ,
where q j = ( 1 ϵ ) δ j , y + ϵ N represents the true probability distribution of class j and N represents the number of IDs.
CLIP-ReID successfully applies the visual language pre-training model to the person ReID task in a two-stage training method. The visual features extracted by CLIP contain rich semantic information, so we built a two-step personnel search framework, VIPS, with the help of CLIP-ReID. We use text features shared under different views to guide the model to learn view-invariant visual features and then align person features under different views.

3.2. Overview of Faster R-CNN

Faster R-CNN is an end-to-end framework composed of three key modules: a feature extractor M f , a Region Proposal Network (RPN) M r p n , and an RoI head M r o i . Given an image I, the feature extractor M f produces a dense convolutional feature map F . The RPN M r p n then slides a small network over F to generate, for of k predefined anchors at each location, an objectness score p i and a set of box offsets l i = ( l x , l y , l w , l h ) . After non-maximum suppression, the top proposals are reshaped by RoIAlign into fixed-size tensors and passed to the RoI head M r o i , which yields class probabilities c i R | C | + 1 ( | C | foreground classes plus one background) and refined box coordinates l i .
Training relies on a multi-task loss. The RPN loss can be formulated as
L R P N ( p , p * , l , l * ) = 1 N c l s i L c l s ( p i , p i * ) + 1 N r e g i p i * L r e g ( l i , l i * ) ,
where L c l s denotes cross-entropy classification loss, L r e g is the smooth- 1 regression loss, p i * { 0 , 1 } and p i = 1 indicate the corresponding proposal region is positive (containing foreground objects), and c i * , l i * are the ground-truth class labels and box regression targets, respectively. The RoI head loss can be formulated as
L R O I ( c , c * , l , l * ) = 1 N c l s i L c l s ( c i , c i * ) + 1 N r e g i p i * L r e g ( l i , l i * ) .
The overall objective combines these two terms
L = L R P N + L R O I .
Our system uses a trained Faster R-CNN to generate person bounding boxes, which are then cropped and resized for the ReID stage.

3.3. Vision Transformer

We adopt a Vision Transformer (ViT) as our visual feature encoder of VIPS. For an image with size H × W × 3 , ViT [] divides it into N p image patches of P × P and encodes together with the position embedding into a d-dimensional vector e i R d . There are 12 encoder layers in ViT-B/16. We use E i = { e i j R d | 1 j N p } to represent the patch embeddings sent to the i t h layer L i . Then ViT is formulated as
[ x 1 , E 1 ] = L 1 ( [ C L S , E 0 ] ) ,
[ x i , E i ] = L i ( [ x i 1 , E i 1 ] ) i = 2 , 3 , 12 ,
where x i R d represents the embedding of CLS, and [·,·] represents stacking two vectors.

5. Experiments

In this section, we will introduce the datasets and metrics used in the experiments, as well as details in training. Finally, our experimental results are analyzed.
Datasets To comprehensively evaluate our approach, we conducted experiments on three benchmark person search datasets and three person ReID datasets, covering both conventional and UAV-specific scenarios, CUHK-SYSU [], PRW [], and G2APS []. CUHK-SYSU is a large-scale benchmark containing 18,184 images with 8432 identities, notable for its realistic search scenario, where targets must be identified from whole gallery images rather than pre-cropped boxes. PRW comprises 11,816 frames from 6 synchronized cameras with 932 identities, emphasizing real-world challenges in pedestrian retrieval, with comprehensive annotations for both bounding boxes and identities. G2APS is the first cross-platform person search dataset specifically designed for UAV–ground camera scenarios, containing 31,770 images of 2077 identities. Each identity appears in both ground and aerial views, making it particularly valuable for evaluating view-invariant methods. In addition, to further verify the effectiveness of VIReID, we also conducted experiments on person ReID datasets Market1501 [], MSMT17 [], and Occluded-Duke []. The detailed information for each dataset is summarized in Table 1.
Table 1. Statistics of the datasets used in our experiments. “#image” denotes the number of images, and “#ID” denotes the number of unique person identities. For person search datasets (CUHK-SYSU, PRW, and G2APS), the images are uncropped scene-level images containing multiple pedestrians, while for person ReID datasets (Market-1501, MSMT17, and Occluded-Duke), each image is a cropped pedestrian bounding box. In CUHK-SYSU, camera and movie images are treated as two distinct views. MSMT17 is a large-scale, multi-view dataset collected across 15 cameras under diverse indoor and outdoor conditions, featuring pronounced viewpoint differences that closely resemble cross-platform person search scenarios.
Evaluation Protocols We adopt Mean Average Precision (mAP) and Cumulative Matching Characteristics (CMC) as the evaluation metrics. mAP measures the overall retrieval quality by averaging the precision over all query identities, while CMC evaluates the retrieval accuracy at different ranks. Specifically, Rank-k denotes the proportion of query images whose correct match appears within the top-k retrieved gallery results, i.e., Rank - k = # queries correctly matched within top - k # total queries . mAP is calculated as the mean of average precision scores across all queries, where each average precision is obtained by integrating the corresponding precision–recall curve. Unless otherwise stated, we report mAP/CMC in percentage (%) and all “improvements” as absolute percentage points. For all datasets, we follow the standard train/test splits provided in their official protocols.
Implementation Details The model was built using Pytorch, and all experiments were conducted on the NVIDIA RTX 3090 GPU. The person ReID model uses CLIP-ReID [] as the baseline model. All weights were initialized to the weights of CLIP-ReID pre-trained on Market-1501 []. The initial weights of the mask generator and image encoder were the same. The batch size was B = 64 , 120 epochs were trained in the first stage, and 240 epochs were trained in the second stage.
The number of learnable text embeddings in the ID descriptor L was set to 6, the image patch size P was 16, and the dimension D was 512. The size of the view prompt was set to h = 6 , N v = 11 , and N c directly took the number of cameras in the dataset. It should be noted that CUHK-SYSU [] does not have camera annotation, but the images in the camera and movie can be regarded as two different view styles, so for CUHK-SYSU, N c = 2 .

5.1. Comparison with State-of-the-Art Methods

In this section, we compare the proposed VIPS with other person search SOTA methods in Table 2. In addition, we also compare VIReID with the SOTA method of person ReID, and the results are shown in Table 3.
Table 2. Comparison of VIPS and all state-of-the-art person search methods on PRW, CUHK-SYSU, and G2APS datasets. The end-to-end methods are marked with *, and the best results are in bold. Param. (M) denotes the number of learnable parameters.
Table 3. Comparison of VIReID and all state-of-the-art person ReID methods on three datasets. The best results are highlighted in bold.
Person search As can be seen from Table 2, the best methods among the end-to-end methods and the two-step methods are HKD [] and TCTS []. We attribute this to the former alleviating the conflict between detection and identification subtasks through the head knowledge distillation strategy, while the latter makes the detection and recognition tasks more consistent by generating query-like proposals. However, VIPS significantly outperforms both of them. VIPS outperforms HKD by 2.8%, 2.9%, and 15.6% in mAP on the three datasets, respectively. Compared with TCTS, which is also a two-step method, VIPS obtains 9.5% and 4.3% mAP advantages on PRW and CUHK-SYSU, respectively. These results collectively demonstrate that viewpoint variation represents a fundamental challenge in person search that has been largely overlooked in previous works. Our approach successfully addresses this limitation through its novel integration of foreground-aware feature purification and view-conditioned adaptation, establishing a new paradigm for view-invariant person search that is particularly suited for aerial surveillance scenarios.
Person ReID Performance As demonstrated in Table 3, our proposed VIReID establishes new state-of-the-art performance across all three benchmark datasets. The method achieves remarkable mAP scores of 90.3%, 74.9%, and 60.2% on Market-1501, MSMT17, and Occluded-Duke, respectively, surpassing all existing CNN-based and ViT-based approaches. Notably, VIReID shows particularly strong performance on MSMT17 (74.9% mAP). We attribute these results to the inherent characteristics of MSMT17. MSMT17 is characterized by a complex multi-camera setup, wide-ranging viewpoints, and diverse environmental conditions, which closely mirror the cross-platform scenario. This comprehensive evaluation not only validates our technical innovations but also highlights the importance of viewpoint invariance as a critical research direction for UAV-based person recognition systems.

5.2. Ablation Study

In this section, we conduct thorough ablation experiments on PRW to explore the effectiveness of each proposed module.
Effectiveness of mask generator and view prompt We incrementally added the mask generator and view prompt modules to the baseline model to assess their individual and combined contributions. The experimental results are summarized in Table 4. The baseline model achieves an mAP of 55.5% and a top-1 accuracy of 81.0%. With the addition of the mask generator, the performance improves slightly to 55.8% mAP and 82.1% top-1 accuracy. Similarly, introducing the view prompt yields 56.0% mAP and 81.6% top-1. When both modules are integrated, the model reaches its best performance at 56.3% mAP and 82.6% top-1 accuracy, indicating gains of +0.8% mAP and +1.6% top-1 over the baseline. These results validate the complementary benefits of the two modules. The mask generator helps the text encoder focus on foreground semantics by filtering out background noise and occlusions, while the view prompt enhances the image encoder’s ability to learn robust visual representations across varying viewpoints.
Table 4. Performance comparison before and after adding the mask generator and view prompt to the baseline model. ✓ means adding it, and × means not adding it. The best results are highlighted in bold.
Analysis of Performance Heterogeneity Across Benchmarks As shown in Table 2 and Table 3, the performance gains vary substantially across datasets. We attribute these results to the distinct characteristics of different datasets. Datasets with greater view diversity and geometric shifts (e.g., G2APS with UAV–ground pairs and MSMT17 with 15 cameras) exhibit the most significant improvements, as our proposed camera-conditioned prompts effectively bridge cross-view gaps. In contrast, relatively smaller gains are achieved on PRW and CUHK-SYSU, with fewer or more homogeneous cameras. Furthermore, the severe occlusion and clutter in G2APS and MSMT17 highlight the benefits of VIPS’s mask generator, which suppresses irrelevant regions. To further validate this, we conducted an ablation on MSMT17 (Table 5), showing that improvements remain stable across different time-of-day splits, while being slightly larger in complex indoor scenes, demonstrating VIPS’s effectiveness under challenging environments. The performance heterogeneity further confirms the strength of our proposed method under large cross-view and cluttered conditions.
Table 5. Ablation study on MSMT17 under different query splits. The query set is divided by time of day into morning, noon, and afternoon subsets and by scene type into indoor and outdoor subsets, while the gallery set remains unchanged.
Mask visualization Figure 3 presents visualizations of the generated foreground masks under different scenarios. In the first two examples, the scenes are relatively clean, with minimal background clutter and occlusion. In contrast, the latter two examples contain significant occlusions—primarily from the background and a bicycle, respectively. These visualizations demonstrate that the proposed foreground mask is capable of effectively suppressing irrelevant background regions and occluding objects, thereby enhancing the focus on salient targets. This is particularly valuable for UAV-based perception tasks, where dynamic environments and occlusions are common challenges.
Figure 3. Mask of some person in PRW, where (a,d,g,j) represent the original image, (b,e,h,k) represent the mask. (c,f,i,l) represent the superposition of the mask and the original image.
Hyper-parameter Sensitivity As illustrated in Figure 4, we studied the impact of the two view-prompt hyper-parameters, prompt depth h and token count N v . First, with N v fixed at 1, we varied h over { 0 , 2 , 4 , 6 , 8 , 12 } and observed that both mAP and top-1 accuracy peaked at h = 6 . Next, holding h = 6 constant, we swept N v through { 1 , 3 , 5 , 7 , 9 , 11 , 13 } , finding the best retrieval performance at N v = 11 . Consequently, we adopted h = 6 and N v = 11 for all subsequent experiments.
Figure 4. Effect of view-prompt depth h and token count N v on PRW performance. (a) Retrieval mAP and top-1 accuracy as h increases (with N v = 1 ). (b) Retrieval mAP and top-1 accuracy as ( N v ) increases (with h = 6 ).
Effectiveness of mask Figure 5 shows the distance distribution between the visual features of the images in the training set and the corresponding text features. After applying the mask, the distance between textual features and visual features is significantly reduced. This shows that the mask can remove the noise in text features. In the second training stage, the image features contain more accurate semantic information, which helps the model solve the problem of view differences through the semantic information shared between different view images.
Figure 5. Euclidean distance distribution of image features and corresponding text features in the training set of PRW.
Visualization of view prompt effects The G2APS dataset contains images in the ground camera and UAV views, with a very significant view difference. Therefore, we use t-SNE to visualize the image feature distribution under two different views before and after fusing the view prompt, and the results are shown in Figure 6. Before fusing the view prompt with the image feature, although the samples with the same ID are closely clustered, there is an obvious gap between the feature distributions of the two different views. After fusing, the image features of different views for each ID are brought closer. This demonstrates that the view prompt helps the model learn view-invariant visual features.
Figure 6. Distribution of image features in two views before (a) and after (b) fused with the view prompt. “o” and “+” represent the ground camera and UAV views, respectively, and the numbers represent the sample index rather than the ID.

6. Conclusions

This paper addresses the critical yet understudied challenge of viewpoint differences in person search, a problem exacerbated in UAV-based surveillance where cross-camera perspective shifts severely degrade matching accuracy. We propose VIPS, a novel two-stage framework that combines Faster R-CNN with a view-invariant ReID module, leveraging CLIP’s vision-language paradigm to align features across viewpoints through shared semantic descriptors. To overcome noise from UAV-captured backgrounds and occlusions, we introduce a mask generator to purify text-guided feature learning. Furthermore, our view prompts explicitly encode camera perspectives into visual features, mitigating viewpoint divergence—an innovation particularly relevant for UAV multimodal applications where perspective robustness is essential. Extensive experiments demonstrate the effectiveness of our proposed method in resolving view differences. This work not only advances person search technology but also highlights the potential of vision–language models (VLMs) for UAV-centric tasks, advancing the application of new AI technologies in UAVs.

Author Contributions

Conceptualization, J.L. (Jindong Liu), J.L. (Jing Li) and S.Z.; methodology, H.W. and W.L.; software, W.L.; validation, W.W. and F.X.; formal analysis, H.W.; investigation, W.W.; resources, J.L. (Jing Li); writing—original draft preparation, H.W.; writing—review and editing, S.Z.; visualization, H.W.; project administration, J.L. (Jindong Liu); funding acquisition, F.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key R&D program of China under grant 2022YFB4300700, The Key R&D programs of Shaanxi Province under grants 2021ZDLGY02-06, 2023-YBGY-132, 2024GX-YBXM-134, in part by Youth New Star Project of Shaanxi Province under grant 2023KJXX-136, in part by the Shaanxi Association for Science and Technology Young Talent Lifting Program under grant XXJS202242, Qin Chuangyuan project (No. 2021QCYRC4-49), Qinchuangyuan Scientist+Engineer (No. HYGJZN202331), National Defense Science and Technology Key Laboratory Fund Project (No. 6142101210202), The Basic Research Program of Natural Science in Shaanxi Province (grant No. 2024JC-YBMS-558), Xi’an Social Science Planning Fund Project (No. 24JX201).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zheng, L.; Zhang, H.; Sun, S.; Chandraker, M.; Yang, Y.; Tian, Q. Person re-identification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1367–1376. [Google Scholar]
  2. Zhang, S.; Yang, Q.; Cheng, D.; Xing, Y.; Liang, G.; Wang, P.; Zhang, Y. Ground-to-Aerial Person Search: Benchmark Dataset and Approach. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 789–799. [Google Scholar]
  3. Xiao, T.; Li, S.; Wang, B.; Lin, L.; Wang, X. Joint detection and identification feature learning for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3415–3424. [Google Scholar]
  4. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
  5. Li, S.; Sun, L.; Li, Q. Clip-reid: Exploiting vision-language model for image re-identification without concrete text labels. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 1405–1413. [Google Scholar]
  6. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef] [PubMed]
  7. Wang, C.; Ma, B.; Chang, H.; Shan, S.; Chen, X. Tcts: A task-consistent two-stage framework for person search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11952–11961. [Google Scholar]
  8. Dong, W.; Zhang, Z.; Song, C.; Tan, T. Instance guided proposal network for person search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2585–2594. [Google Scholar]
  9. Han, C.; Ye, J.; Zhong, Y.; Tan, X.; Zhang, C.; Gao, C.; Sang, N. Re-id driven localization refinement for person search. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9814–9823. [Google Scholar]
  10. Han, C.; Zheng, Z.; Gao, C.; Sang, N.; Yang, Y. Decoupled and Memory-Reinforced Networks: Towards Effective Feature Learning for One-Step Person Search. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1505–1512. [Google Scholar] [CrossRef]
  11. Yan, Y.; Li, J.; Qin, J.; Bai, S.; Liao, S.; Liu, L.; Zhu, F.; Shao, L. Anchor-free person search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 7690–7699. [Google Scholar]
  12. Nguyen, H.; Nguyen, K.; Pemasiri, A.; Liu, F.; Sridharan, S.; Fookes, C. AG-VPReID: A Challenging Large-Scale Benchmark for Aerial-Ground Video-based Person Re-Identification. In Proceedings of the Computer Vision and Pattern Recognition Conference, Chongqing, China, 25–26 October 2025; pp. 1241–1251. [Google Scholar]
  13. Deng, Z.; Ge, Y.; Qi, X.; Sun, K.; Wan, R.; Zhang, B.; Zhang, S.; Zhang, X.; Meng, Y. SPL-PlaneTR: Lightweight and Generalizable Indoor Plane Segmentation Based on Prompt Learning. Sensors 2025, 25, 2797. [Google Scholar] [CrossRef] [PubMed]
  14. Jiang, Y.; Chen, J.; Lu, J. Leveraging Vision Foundation Model via PConv-Based Fine-Tuning with Automated Prompter for Defect Segmentation. Sensors 2025, 25, 2417. [Google Scholar] [CrossRef] [PubMed]
  15. Zhou, Y.; Yan, H.; Ding, K.; Cai, T.; Zhang, Y. Few-Shot Image Classification of Crop Diseases Based on Vision–Language Models. Sensors 2024, 24, 6109. [Google Scholar] [CrossRef] [PubMed]
  16. Khattak, M.U.; Rasheed, H.; Maaz, M.; Khan, S.; Khan, F.S. Maple: Multi-modal prompt learning. In Proceedings of the the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19113–19122. [Google Scholar]
  17. Wang, Z.; Zhang, Z.; Lee, C.Y.; Zhang, H.; Sun, R.; Ren, X.; Su, G.; Perot, V.; Dy, J.; Pfister, T. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 139–149. [Google Scholar]
  18. Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
  19. Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16816–16825. [Google Scholar]
  20. Jia, M.; Tang, L.; Chen, B.C.; Cardie, C.; Belongie, S.; Hariharan, B.; Lim, S.N. Visual prompt tuning. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 709–727. [Google Scholar]
  21. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  22. Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 1116–1124. [Google Scholar]
  23. Wei, L.; Zhang, S.; Gao, W.; Tian, Q. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 79–88. [Google Scholar]
  24. Miao, J.; Wu, Y.; Liu, P.; Ding, Y.; Yang, Y. Pose-guided feature alignment for occluded person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 542–551. [Google Scholar]
  25. Chen, D.; Zhang, S.; Ouyang, W.; Yang, J.; Tai, Y. Person search via a mask-guided two-stream cnn model. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
  26. Lan, X.; Zhu, X.; Gong, S. Person search by multi-scale matching. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 536–552. [Google Scholar]
  27. Li, Z.; Miao, D. Sequential end-to-end network for efficient person search. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 2011–2019. [Google Scholar]
  28. Lee, S.; Oh, Y.; Baek, D.; Lee, J.; Ham, B. OIMNet++: Prototypical Normalization and Localization-Aware Learning for Person Search. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part X. pp. 621–637. [Google Scholar]
  29. Cao, J.; Pang, Y.; Anwer, R.M.; Cholakkal, H.; Xie, J.; Shah, M.; Khan, F.S. PSTR: End-to-end one-step person search with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9458–9467. [Google Scholar]
  30. Yu, R.; Du, D.; LaLonde, R.; Davila, D.; Funk, C.; Hoogs, A.; Clipp, B. Cascade transformers for end-to-end person search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7267–7276. [Google Scholar]
  31. Wang, G.; Yang, S.; Liu, H.; Wang, Z.; Yang, Y.; Wang, S.; Yu, G.; Zhou, E.; Sun, J. High-order information matters: Learning relation and topology for occluded person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6449–6458. [Google Scholar]
  32. Jaffe, L.; Zakhor, A. Swap Path Network for Robust Person Search Pre-training. arXiv 2024, arXiv:2412.05433. [Google Scholar] [CrossRef]
  33. Yan, L.; Li, K. Unknown Instance Learning for Person Search. In 2024 IEEE International Conference on Multimedia and Expo (ICME); IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
  34. Jia, M.; Cheng, X.; Lu, S.; Zhang, J. Learning disentangled representation implicitly via transformer for occluded person re-identification. IEEE Trans. Multimed. 2022, 25, 1294–1305. [Google Scholar] [CrossRef]
  35. Wang, P.; Zhao, Z.; Su, F.; Meng, H. LTReID: Factorizable Feature Generation with Independent Components for Long-Tailed Person Re-Identification. IEEE Trans. Multimed. 2022, 25, 4610–4622. [Google Scholar] [CrossRef]
  36. Dong, N.; Zhang, L.; Yan, S.; Tang, H.; Tang, J. Erasing, transforming, and noising defense network for occluded person re-identification. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 4458–4472. [Google Scholar] [CrossRef]
  37. Xi, J.; Huang, J.; Zheng, S.; Zhou, Q.; Schiele, B.; Hua, X.S.; Sun, Q. Learning comprehensive global features in person re-identification: Ensuring discriminativeness of more local regions. PAttern Recognit. 2023, 134, 109068. [Google Scholar] [CrossRef]
  38. Zhu, H.; Ke, W.; Li, D.; Liu, J.; Tian, L.; Shan, Y. Dual cross-attention learning for fine-grained visual categorization and object re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4692–4702. [Google Scholar]
  39. Wang, T.; Liu, H.; Song, P.; Guo, T.; Shi, W. Pose-guided feature disentangling for occluded person re-identification based on transformer. AAAI Conf. Artif. Intell. 2022, 36, 2540–2549. [Google Scholar] [CrossRef]
  40. Zhu, K.; Guo, H.; Zhang, S.; Wang, Y.; Liu, J.; Wang, J.; Tang, M. Aaformer: Auto-aligned transformer for person re-identification. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 17307–17317. [Google Scholar] [CrossRef] [PubMed]
  41. He, S.; Chen, W.; Wang, K.; Luo, H.; Wang, F.; Jiang, W.; Ding, H. Region generation and assessment network for occluded person re-identification. IEEE Trans. Inf. Forensics Secur. 2023, 19, 120–132. [Google Scholar] [CrossRef]
  42. Zhang, G.; Zhang, Y.; Zhang, T.; Li, B.; Pu, S. PHA: Patch-Wise High-Frequency Augmentation for Transformer-Based Person Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14133–14142. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.