VIPS: Learning-View-Invariant Feature for Person Search

Hexu Wang; Wenlong Luo; Wei Wu; Fei Xie; Jindong Liu; Jing Li; Shizhou Zhang

doi:10.3390/s25175362

,

and

¹

Xi’an Key Laboratory of Human–Machine Integration and Control Technology for Intelligent Rehabilitation, Xijing University, Xi’an 710123, China

²

School of Information Science and Technology, Northwest University, Xi’an 710100, China

³

School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China

⁴

Academy of Advanced Interdisciplinary Research, Xidian University, Xi’an 710071, China

Sensors2025, 25(17), 5362;https://doi.org/10.3390/s25175362

This article belongs to the Section Remote Sensors

Version Notes

Order Reprints

Abstract

Unmanned aerial vehicles (UAVs) have become indispensable tools for surveillance, enabled by their ability to capture multi-perspective imagery in dynamic environments. Among critical UAV-based tasks, cross-platform person search—detecting and identifying individuals across distributed camera networks—presents unique challenges. Severe viewpoint variations, occlusions, and cluttered backgrounds in UAV-captured data degrade the performance of conventional discriminative models, which struggle to maintain robustness under such geometric and semantic disparities. To address this, we propose view-invariant person search (VIPS), a novel two-stage framework combining Faster R-CNN with a view-invariant re-Identification (VIReID) module. Unlike conventional discriminative models, VIPS leverages the semantic flexibility of large vision–language models (VLMs) and adopts a two-stage training strategy to decouple and align text-based ID descriptors and visual features, enabling robust cross-view matching through shared semantic embeddings. To mitigate noise from occlusions and cluttered UAV-captured backgrounds, we introduce a learnable mask generator for feature purification. Furthermore, drawing from vision–language models, we design view prompts to explicitly encode perspective shifts into feature representations, enhancing adaptability to UAV-induced viewpoint changes. Extensive experiments on benchmark datasets demonstrate state-of-the-art performance, with ablation studies validating the efficacy of each component. Beyond technical advancements, this work highlights the potential of VLM-derived semantic alignment for UAV applications, offering insights for future research in real-time UAV-based surveillance systems.

Keywords:

person search; UAV object tracking; cross-platform

1. Introduction

Unmanned aerial vehicles (UAVs), or drones, have undergone transformative advancements in autonomy, sensing, and AI integration, enabling their deployment across diverse sectors from precision agriculture and disaster response to smart city surveillance. The global drone market is expected to continue its growth trajectory, driven by demand for scalable, real-time data acquisition and analysis. A critical enabler of this growth is the fusion of artificial intelligence (AI) with UAV platforms, particularly in cross-platform perception systems where drones collaborate with ground-based sensors (e.g., CCTV, IoT devices) to achieve comprehensive environmental understanding.

Such systems are increasingly vital for large-scale security monitoring, exemplified by applications like search-and-rescue operations, crowd behavior analysis, and cross-border surveillance. However, a fundamental challenge in these scenarios is cross-platform person search, seamlessly detecting and re-identifying individuals across heterogeneous camera networks comprising UAV-mounted and fixed ground cameras. As shown in Figure 1, a sharp contrast emerges between traditional same-platform datasets such as PRW, where camera views are relatively homogeneous and mostly ground-based, and cross-platform datasets like G2APS, which couple UAV-mounted aerial views with fixed ground cameras. The drastic viewpoint disparity in G2APS (e.g., top-down vs. frontal perspectives) results in pronounced geometric distortions, scale variations, and appearance shifts, thereby causing much higher false-negative rates compared to conventional ground-only benchmarks.

Figure 1. Person captured by different cameras in the PRW [] and G2APS [] datasets.The images in PRW are all from ground cameras, while the images in G2APS are from a ground camera and a UAV. A more detailed description of dataset characteristics is available in Section 5.

Existing person search methods mainly focus on the cooperation of the two subtasks, misalignment of scale, occlusion, and detector optimization; few works consider the view difference when matching persons across cameras. It may be because there is no camera annotation in one of the commonly used datasets, CUHK-SYSU []. However, its images are derived from hand-held cameras and movie snapshots and can still be viewed as coming from two different views. In addition, the PRW [] dataset was captured by six different cameras, and the rich view information also needs to be further fully utilized. Recently, Zhang et al. [] constructed the UAV- and camera-based person search dataset G2APS. The huge difference between the high altitude and ground view makes accurate person matching more difficult.

The visual–language pre-training model CLIP [] unifies the two modes of image and text. It can learn semantic information from the prompt “A photo of a {xx} person” and match it with the corresponding image. It has the characteristics of zero-shot learning, so it is widely used in many downstream tasks such as image classification, object detection, and semantic segmentation. CLIP-ReID [] applied CLIP to person ReID for the first time, learning the descriptor “A photo of a {xx} person” for each ID and guiding the image encoder to learn the semantic information in the image through the descriptor. Inspired by this, we use CLIP to learn the semantic information shared in images of different views and then embed the semantic information into image features to solve the problem of view difference.

In this work, we present view-invariant person search (VIPS), a novel two-stage framework that synergizes UAV-optimized detection with vision–language model (VLM) advancements to address cross-view person search challenges. The model consists of Faster R-CNN [] and view-invariant person ReID (VIReID). VIReID uses CLIP-ReID as the baseline model. Like CLIP-ReID, VIReID has two training stages: one to learn ID descriptor text features and one for visual features. We observe that the presence of background and occlusion in the image introduces noise to the text feature, so we design a mask generator to eliminate the noise, allowing the model to learn more accurate text features. Furthermore, to reduce the difference between image features of the same ID under different views, we design the view prompt to embed different view information through a set of learnable embeddings. By inputting the view prompt and image patch embeddings into each encoder layer, the encoder learns view-invariant image features. To the best of our knowledge, VIPS provides pioneering contributions to AI-driven UAV applications. VIPS integrates CLIP’s semantic alignment for cross-platform person search, providing a dedicated solution for UAV-induced viewpoint discrepancies.

We conduct comprehensive evaluations on both person search benchmarks and traditional person re-identification benchmarks. The experimental results demonstrate that our proposed methods can significantly outperform existing state-of-the-art methods. The visualization of the feature mask further illustrates that our method can guide the model to focus on more discriminative regions. The contributions of this paper are summarized as follows:

We propose a novel viewpoint-invariant person search (VIPS) framework leveraging CLIP’s semantic alignment for UAV and cross-camera scenarios;
We propose a mask generator to suppress noise in UAV-captured images, enhancing text-guided feature learning and view prompts to encode camera perspectives into visual features, reducing viewpoint discrepancy;
Extensive experiments on five benchmark datasets demonstrate the superiority of VIPS, establishing a new state-of-the-art method in UAV-based person search tasks.

The rest of this article is organized as follows: We review the related works in Section 2 and overview preliminary works in Section 3. The proposed framework and the optimization procedure are detailed in Section 4. Section 5 highlights the experimental evaluations. Section 6 draws the conclusion.

2. Related Work

In this section, we briefly review the related literature on person search and prompt learning, with a particular emphasis on the unique challenges posed by cross-view scenarios.

2.1. Person Search

Person search aims to locate and identify the target person from scene images by combining two subtasks: person detection and person re-identification (ReID). Existing methods are typically categorized into two-step methods and end-to-end frameworks. The two-step approach trains detection and ReID models independently, while the end-to-end approach integrates both tasks with a shared backbone.

Two-step methods often emphasize improving ReID performance through detection-aware mechanisms. For instance, TCTS [] and IGPN [] select query-like proposals to align detector outputs with ReID objectives. RDLR [] further refines proposals by imposing ReID supervision to enhance identification. In contrast, end-to-end methods focus on resolving the conflict between detection and identification objectives. DMRN [] structurally decouples the two tasks, while AlignPS [] and HKD [] employ strategies such as ReID-first optimization and head-level knowledge distillation to mitigate task interference and boost performance.

Despite these advances, most prior works [,] assume viewpoint consistency and primarily target ground-view scenarios. This assumption severely limits their generalizability to more complex settings, such as cross-view person search between UAV and ground platforms. In these cases, drastic viewpoint discrepancies, scale variations, and domain-specific noise (e.g., occlusion, motion blur, and background clutter) substantially degrade model robustness. A ReID feature extractor trained on one platform often fails to transfer across platforms, leading to misalignment in the embedding space and degraded search accuracy.

Several recent studies [,] have proposed benchmarks for cross-platform or cross-view person search. However, most methods still lack mechanisms to explicitly exploit platform-specific cues for learning view-invariant semantic representations. Our VIPS framework directly addresses this gap by introducing view-aware semantic prompts. By explicitly embedding platform-specific cues into the representation learning process, our approach captures discriminative yet view-robust semantics that generalize across cross-platform scenarios. This design effectively mitigates severe appearance variations, scale discrepancies, and background clutter inherent in such settings, thereby positioning our work as a targeted and scalable solution for person search under large cross-platform divergences.

2.2. Prompt Learning

Prompt learning has recently emerged as an effective paradigm for adapting large pre-trained models to downstream tasks. In natural language processing, carefully designed prompts are used to guide pre-trained models in specific tasks [,,,,]. For instance, CLIP [] connects images and natural language through the template “A photo of a {object}”. To reduce the reliance on manual prompt engineering, CoOp [] introduces context optimization to automatically learn suitable prompt tokens. Later, Zhou et al. [] designed conditional prompts that adapt to each input image, alleviating the bias toward seen classes. In the vision domain, VPT [] uses learnable visual prompts to efficiently adapt pre-trained vision transformers to diverse tasks.

Inspired by VPT, we design learnable view prompts to represent platform-specific information. By incorporating view cues into image representations, our approach explicitly accounts for view discrepancies and learns features that remain robust under drastic cross-platform variations.

3. Preliminaries

3.1. Overview of CLIP-ReID

CLIP-ReID [] applies CLIP [] for the person ReID task. It consists of a text encoder and an image encoder, implemented by ViT-B/16 [] “A photo of a

{[X]}_{1}

{[X]}_{2}

{[X]}_{3}

…

{[X]}_{L}

person” for each person ID, where

[X]

represents a learnable text embedding.

CLIP-ReID adopts a two-stage training strategy. For the

k_{t h}

image with ID

y_{k}

, the model learns the text feature

T_{k}

of the ID descriptor in the first stage. Then the model learns the visual feature

V_{k}

under the supervision of

T_{k}

and

y_{k}

.

In the first stage, two contrast losses are used:

L_{i 2 t} (k) = - log \frac{exp (S (V_{k}, T_{k}))}{\sum_{x = 1}^{B} exp (S (V_{k}, T_{x}))},

(1)

L_{t 2 i} (y_{k}) = \frac{- 1}{| Q (y_{k}) |} \sum_{q \in Q (y_{k})} log \frac{exp (S (V_{q}, T_{y_{k}}))}{\sum_{b = 1}^{B} exp (S (V_{b}, T_{y_{k}}))},

(2)

where

S (\cdot, \cdot)

represents the cosine similarity, B represents the batch size,

Q (y_{k})

represents the index of images corresponding to

T_{y_{k}}

in the batch,

| \cdot |

represents the number of elements.

In the second stage, a cross-entropy loss is designed:

L_{i 2 t c e} (k) = - \sum_{j = 1}^{N} q_{j} l o g \frac{e x p (S (V_{k}, T_{y_{j}}))}{\sum_{y_{b} = 1}^{N} e x p (S (V_{k}, T_{y_{b}}))},

(3)

where

q_{j} = (1 - ϵ) δ_{j, y} + \frac{ϵ}{N}

represents the true probability distribution of class j and N represents the number of IDs.

CLIP-ReID successfully applies the visual language pre-training model to the person ReID task in a two-stage training method. The visual features extracted by CLIP contain rich semantic information, so we built a two-step personnel search framework, VIPS, with the help of CLIP-ReID. We use text features shared under different views to guide the model to learn view-invariant visual features and then align person features under different views.

3.2. Overview of Faster R-CNN

Faster R-CNN is an end-to-end framework composed of three key modules: a feature extractor

M_{f}

, a Region Proposal Network (RPN)

M_{r p n}

, and an RoI head

M_{r o i}

. Given an image I, the feature extractor

M_{f}

produces a dense convolutional feature map

F

. The RPN

M_{r p n}

then slides a small network over

F

to generate, for of k predefined anchors at each location, an objectness score

p_{i}

and a set of box offsets

l_{i} = (l_{x}, l_{y}, l_{w}, l_{h})

. After non-maximum suppression, the top proposals are reshaped by RoIAlign into fixed-size tensors and passed to the RoI head

M_{r o i}

, which yields class probabilities

c_{i} \in R^{| C | + 1}

(

| C |

foreground classes plus one background) and refined box coordinates

l_{i}

.

Training relies on a multi-task loss. The RPN loss can be formulated as

\begin{matrix} L_{R P N} (p, p^{*}, l, l^{*}) & = \frac{1}{N_{c l s}} \sum_{i} L_{c l s} (p_{i}, p_{i}^{*}) + \frac{1}{N_{r e g}} \sum_{i} p_{i}^{*} L_{r e g} (l_{i}, l_{i}^{*}), \end{matrix}

(4)

where

L_{c l s}

denotes cross-entropy classification loss,

L_{r e g}

is the smooth-

ℓ_{1}

regression loss,

p_{i}^{*} \in {0, 1}

and

p_{i} = 1

indicate the corresponding proposal region is positive (containing foreground objects), and

c_{i}^{*}

,

l_{i}^{*}

are the ground-truth class labels and box regression targets, respectively. The RoI head loss can be formulated as

\begin{matrix} L_{R O I} (c, c^{*}, l, l^{*}) & = \frac{1}{N_{c l s}} \sum_{i} L_{c l s} (c_{i}, c_{i}^{*}) + \frac{1}{N_{r e g}} \sum_{i} p_{i}^{*} L_{r e g} (l_{i}, l_{i}^{*}) . \end{matrix}

(5)

The overall objective combines these two terms

L = L_{R P N} + L_{R O I} .

(6)

Our system uses a trained Faster R-CNN to generate person bounding boxes, which are then cropped and resized for the ReID stage.

3.3. Vision Transformer

We adopt a Vision Transformer (ViT) as our visual feature encoder of VIPS. For an image with size

H \times W \times 3

, ViT [] divides it into

N_{p}

image patches of

P \times P

and encodes together with the position embedding into a d-dimensional vector

e_{i} \in R^{d}

. There are 12 encoder layers in ViT-B/16. We use

E_{i} = {e_{i}^{j} \in R^{d} | 1 \leq j \leq N_{p}}

to represent the patch embeddings sent to the

i_{t h}

layer

L_{i}

. Then ViT is formulated as

\begin{matrix} [x_{1}, E_{1}] & = L_{1} ([C L S, E_{0}]), \end{matrix}

(7)

\begin{matrix} [x_{i}, E_{i}] & = L_{i} ([x_{i - 1}, E_{i - 1}]) i = 2, 3, \dots 12, \end{matrix}

(8)

where

x_{i} \in R^{d}

represents the embedding of CLS, and [·,·] represents stacking two vectors.

4. View-Invariant Person Search

In this paper, we propose a two-step view-invariant person search model (VIPS). It consists of Faster R-CNN [] and the view-invariant person ReID model VIReID. Faster R-CNN predicts the location of persons in scene images. VIReID aligns the image features of different views by learning the semantic information shared by the images of the same ID and learns the view-invariant features with the help of the view prompt, thereby solving the view problem.

4.1. View-Invariant Person ReID

Although CLIP-ReID has achieved satisfactory performance, it uses image features as guidance when learning text features in the first stage. If the image contains too much background or occlusions, there will be noise in the text features. In addition, CLIP-ReID can only rely on text features to align persons in different views, and the effect is limited by the performance of the text encoder.

To overcome these challenges, we propose the view-invariant ReID (VIReID) module, which enhances CLIP-ReID with two novel UAV-specific components. Figure 2 shows the overview of VIReID. To learn more robust text features, we design a mask generator to generate a human foreground mask, which eliminates noise in image features. To further improve the cross-view alignment ability, we propose the view prompt at the image encoder level to further reduce the differences in person features under different views of the same ID.

Figure 2. Overview of view-invariant person ReID (a). We use CLIP-ReID as the baseline model and propose a mask generator (b) and view prompt (c) to learn view-invariant features.

4.1.1. Mask Generator

The first stage learns the text feature of the ID descriptor through the contrast loss between the image feature and the text feature. The background and occlusions in the image will bring noise to the text feature. Weighting the image feature with a human foreground mask can effectively eliminate background and occlusions. As shown in Figure 2b, we design a mask generator implemented by an image encoder.

The last encoder layer of the mask generator obtains feature

F^{G} = {x_{12}^{G}, E_{12}^{G}}

, where the patch embedding

E_{12}^{G} \in R^{N_{p} \times d}

. In order to generate the mask, we rearrange

E_{12}^{G}

as

{e_{12, m, n} | m = 1, \dots, \frac{H}{P}, n = 1, \dots, \frac{W}{P}}

, and

E_{12}^{G} \in R^{\frac{H}{P} \times \frac{W}{P} \times d}

. Then we calculate the cosine similarity of each patch embedding and CLS to obtain the mask

M \in R^{\frac{H}{P} \times \frac{W}{P}}

:

M_{m, n} = S (e_{12, m, n}, x_{12}),

(9)

where

S (\cdot, \cdot)

represents the cosine similarity.

At the same time, the image encoder

I

extracts visual features

F^{I} = {x_{12}^{I}, E_{12}^{I}}

of the input image. After reshape

E_{12}^{I}

to the same size as

E_{12}^{G}

, it is weighted with the human body mask M to obtain the noise-free visual feature

V^{'}

:

V^{'} = G A P (M ⊙ E_{12}^{I}),

(10)

where GAP represents global average pooling, and ⊙ denotes element-wise multiplication.

When using

V^{'}

as the supervision information of the text encoder in the first stage, the learned text features will only contain information about the human body. The model will provide more accurate semantic features for the second stage.

4.1.2. View Prompt

In the second stage, the image encoder learns the semantic information shared in different views under the guidance of the noise-free text features, which helps to solve the view difference problem. However, this does not eliminate the interference of the view difference. Therefore, we need to learn view-invariant visual features to enhance the model’s robustness to changes in views.

Inspired by the task prompt in VPT [], we design the view prompt to embed view information in images. It is fed into the encoder together with the embedding of image patches so that the learned

C L S

features contain view information and can show stronger robustness in the face of view differences.

In practice, we treat each camera as a kind of view. Assuming that there are

N_{c}

different cameras in the dataset, we initialize the view prompt

P \in R^{h \times N_{c} \times N_{v} \times d}

, where

N_{v}

represents the number of embeddings required for each view,

h \in [1, 12]

represents the number of layers in which the view prompt is applied in the encoder.

As shown in Figure 2c, when we extract image features, we send the view prompt to the image encoder together with the CLS embedding and patch embeddings. It should be noted that the view prompt added in layers 1 to h of the encoder differs for each specific camera. Given an image from the

c_{t h}

camera, and letting

E_{0}

denote the input patch embeddings of the image, the calculation process of image features is formulated as

\begin{matrix} [x_{1}, E_{1}] & = L_{1} ([C L S, E_{0}, P_{1}]), \end{matrix}

(11)

\begin{matrix} [x_{i}, E_{i}] & = L_{i} ([x_{i - 1}, E_{i - 1}, P_{i, c}]) i = 2, 3, \dots h, \end{matrix}

(12)

\begin{matrix} [x_{i}, E_{i}] & = L_{i} ([x_{i - 1}, E_{i - 1}]) i = h + 1, h + 2, \dots 12, \end{matrix}

(13)

where

P_{i, c} = {p_{i, c, n} \in R^{d} | n \in N, 1 \leq n \leq N_{v}}

represents the view prompt set used to embed camera c in the

i_{t h}

encoding layer, and

L_{i}

represents the i-th encoder layer.

With the help of the learnable view prompt, the image features learned by the image encoder will contain corresponding view information. This allows person features to show greater robustness when facing intra-class differences between persons with the same ID under different cameras.

4.2. Training

We train the model in a two-step approach. First, the person search dataset is changed into datasets in the form of object detection and person ReID based on annotation, and then Faster R-CNN and VIReID are trained on these two datasets, respectively.

The loss function of VIReID is consistent with CLIP-ReID. In the first stage, the image encoder, text encoder, and mask generator are frozen, and only the ID descriptor is updated. The loss function is

L_{s t a g e 1} = L_{i 2 t} + L_{t 2 i} .

(14)

In the second stage, only the image encoder is trained using cross-entropy loss and triplet loss for optimization.

L_{i d} = - \sum_{j = 1}^{N} q_{j} l o g (p_{j}),

(15)

L_{t r i} = m a x (d_{p} - d_{n} + α, 0),

(16)

where

p_{j}

is the probability distribution predicted by the model.

d_{p}

and

d_{n}

represent the cosine distance between the features of the positive pair and the negative pair, respectively, and

α

is a preset margin.

Finally, the total loss function of the second stage is

L_{s t a g e 2} = L_{i d} + L_{t r i} + L_{i 2 t c e} .

(17)

4.3. Relation to Prior Work

CLIP-ReID applies CLIP to the person ReID task and learns the text descriptor features and visual features of images. However, it is not suitable for directly solving the view difference problem in person search. Therefore, we introduce a mask generator and view prompt to CLIP-ReID, ultimately solving the view problem through shared semantic information between images and view-invariant visual features.

5. Experiments

In this section, we will introduce the datasets and metrics used in the experiments, as well as details in training. Finally, our experimental results are analyzed.

Datasets To comprehensively evaluate our approach, we conducted experiments on three benchmark person search datasets and three person ReID datasets, covering both conventional and UAV-specific scenarios, CUHK-SYSU [], PRW [], and G2APS []. CUHK-SYSU is a large-scale benchmark containing 18,184 images with 8432 identities, notable for its realistic search scenario, where targets must be identified from whole gallery images rather than pre-cropped boxes. PRW comprises 11,816 frames from 6 synchronized cameras with 932 identities, emphasizing real-world challenges in pedestrian retrieval, with comprehensive annotations for both bounding boxes and identities. G2APS is the first cross-platform person search dataset specifically designed for UAV–ground camera scenarios, containing 31,770 images of 2077 identities. Each identity appears in both ground and aerial views, making it particularly valuable for evaluating view-invariant methods. In addition, to further verify the effectiveness of VIReID, we also conducted experiments on person ReID datasets Market1501 [], MSMT17 [], and Occluded-Duke []. The detailed information for each dataset is summarized in Table 1.

Table 1. Statistics of the datasets used in our experiments. “#image” denotes the number of images, and “#ID” denotes the number of unique person identities. For person search datasets (CUHK-SYSU, PRW, and G2APS), the images are uncropped scene-level images containing multiple pedestrians, while for person ReID datasets (Market-1501, MSMT17, and Occluded-Duke), each image is a cropped pedestrian bounding box. In CUHK-SYSU, camera and movie images are treated as two distinct views. MSMT17 is a large-scale, multi-view dataset collected across 15 cameras under diverse indoor and outdoor conditions, featuring pronounced viewpoint differences that closely resemble cross-platform person search scenarios.

Evaluation Protocols We adopt Mean Average Precision (mAP) and Cumulative Matching Characteristics (CMC) as the evaluation metrics. mAP measures the overall retrieval quality by averaging the precision over all query identities, while CMC evaluates the retrieval accuracy at different ranks. Specifically, Rank-k denotes the proportion of query images whose correct match appears within the top-k retrieved gallery results, i.e.,

Rank - k = \frac{# queries correctly matched within top - k}{# total queries}

. mAP is calculated as the mean of average precision scores across all queries, where each average precision is obtained by integrating the corresponding precision–recall curve. Unless otherwise stated, we report mAP/CMC in percentage (%) and all “improvements” as absolute percentage points. For all datasets, we follow the standard train/test splits provided in their official protocols.

Implementation Details The model was built using Pytorch, and all experiments were conducted on the NVIDIA RTX 3090 GPU. The person ReID model uses CLIP-ReID [] as the baseline model. All weights were initialized to the weights of CLIP-ReID pre-trained on Market-1501 []. The initial weights of the mask generator and image encoder were the same. The batch size was

B = 64

, 120 epochs were trained in the first stage, and 240 epochs were trained in the second stage.

The number of learnable text embeddings in the ID descriptor L was set to 6, the image patch size P was 16, and the dimension D was 512. The size of the view prompt was set to

h = 6, N_{v} = 11

, and

N_{c}

directly took the number of cameras in the dataset. It should be noted that CUHK-SYSU [] does not have camera annotation, but the images in the camera and movie can be regarded as two different view styles, so for CUHK-SYSU,

N_{c} = 2

.

5.1. Comparison with State-of-the-Art Methods

In this section, we compare the proposed VIPS with other person search SOTA methods in Table 2. In addition, we also compare VIReID with the SOTA method of person ReID, and the results are shown in Table 3.

Table 2. Comparison of VIPS and all state-of-the-art person search methods on PRW, CUHK-SYSU, and G2APS datasets. The end-to-end methods are marked with *, and the best results are in bold. Param. (M) denotes the number of learnable parameters.

Table 3. Comparison of VIReID and all state-of-the-art person ReID methods on three datasets. The best results are highlighted in bold.

Person search As can be seen from Table 2, the best methods among the end-to-end methods and the two-step methods are HKD [] and TCTS []. We attribute this to the former alleviating the conflict between detection and identification subtasks through the head knowledge distillation strategy, while the latter makes the detection and recognition tasks more consistent by generating query-like proposals. However, VIPS significantly outperforms both of them. VIPS outperforms HKD by 2.8%, 2.9%, and 15.6% in mAP on the three datasets, respectively. Compared with TCTS, which is also a two-step method, VIPS obtains 9.5% and 4.3% mAP advantages on PRW and CUHK-SYSU, respectively. These results collectively demonstrate that viewpoint variation represents a fundamental challenge in person search that has been largely overlooked in previous works. Our approach successfully addresses this limitation through its novel integration of foreground-aware feature purification and view-conditioned adaptation, establishing a new paradigm for view-invariant person search that is particularly suited for aerial surveillance scenarios.

Person ReID Performance As demonstrated in Table 3, our proposed VIReID establishes new state-of-the-art performance across all three benchmark datasets. The method achieves remarkable mAP scores of 90.3%, 74.9%, and 60.2% on Market-1501, MSMT17, and Occluded-Duke, respectively, surpassing all existing CNN-based and ViT-based approaches. Notably, VIReID shows particularly strong performance on MSMT17 (74.9% mAP). We attribute these results to the inherent characteristics of MSMT17. MSMT17 is characterized by a complex multi-camera setup, wide-ranging viewpoints, and diverse environmental conditions, which closely mirror the cross-platform scenario. This comprehensive evaluation not only validates our technical innovations but also highlights the importance of viewpoint invariance as a critical research direction for UAV-based person recognition systems.

5.2. Ablation Study

In this section, we conduct thorough ablation experiments on PRW to explore the effectiveness of each proposed module.

Effectiveness of mask generator and view prompt We incrementally added the mask generator and view prompt modules to the baseline model to assess their individual and combined contributions. The experimental results are summarized in Table 4. The baseline model achieves an mAP of 55.5% and a top-1 accuracy of 81.0%. With the addition of the mask generator, the performance improves slightly to 55.8% mAP and 82.1% top-1 accuracy. Similarly, introducing the view prompt yields 56.0% mAP and 81.6% top-1. When both modules are integrated, the model reaches its best performance at 56.3% mAP and 82.6% top-1 accuracy, indicating gains of +0.8% mAP and +1.6% top-1 over the baseline. These results validate the complementary benefits of the two modules. The mask generator helps the text encoder focus on foreground semantics by filtering out background noise and occlusions, while the view prompt enhances the image encoder’s ability to learn robust visual representations across varying viewpoints.

Table 4. Performance comparison before and after adding the mask generator and view prompt to the baseline model. ✓ means adding it, and × means not adding it. The best results are highlighted in bold.

Analysis of Performance Heterogeneity Across Benchmarks As shown in Table 2 and Table 3, the performance gains vary substantially across datasets. We attribute these results to the distinct characteristics of different datasets. Datasets with greater view diversity and geometric shifts (e.g., G2APS with UAV–ground pairs and MSMT17 with 15 cameras) exhibit the most significant improvements, as our proposed camera-conditioned prompts effectively bridge cross-view gaps. In contrast, relatively smaller gains are achieved on PRW and CUHK-SYSU, with fewer or more homogeneous cameras. Furthermore, the severe occlusion and clutter in G2APS and MSMT17 highlight the benefits of VIPS’s mask generator, which suppresses irrelevant regions. To further validate this, we conducted an ablation on MSMT17 (Table 5), showing that improvements remain stable across different time-of-day splits, while being slightly larger in complex indoor scenes, demonstrating VIPS’s effectiveness under challenging environments. The performance heterogeneity further confirms the strength of our proposed method under large cross-view and cluttered conditions.

Table 5. Ablation study on MSMT17 under different query splits. The query set is divided by time of day into morning, noon, and afternoon subsets and by scene type into indoor and outdoor subsets, while the gallery set remains unchanged.

Mask visualization Figure 3 presents visualizations of the generated foreground masks under different scenarios. In the first two examples, the scenes are relatively clean, with minimal background clutter and occlusion. In contrast, the latter two examples contain significant occlusions—primarily from the background and a bicycle, respectively. These visualizations demonstrate that the proposed foreground mask is capable of effectively suppressing irrelevant background regions and occluding objects, thereby enhancing the focus on salient targets. This is particularly valuable for UAV-based perception tasks, where dynamic environments and occlusions are common challenges.

Figure 3. Mask of some person in PRW, where (a,d,g,j) represent the original image, (b,e,h,k) represent the mask. (c,f,i,l) represent the superposition of the mask and the original image.

Hyper-parameter Sensitivity As illustrated in Figure 4, we studied the impact of the two view-prompt hyper-parameters, prompt depth h and token count

N_{v}

. First, with

N_{v}

fixed at 1, we varied h over

{0, 2, 4, 6, 8, 12}

and observed that both mAP and top-1 accuracy peaked at

h = 6

. Next, holding

h = 6

constant, we swept

N_{v}

through

{1, 3, 5, 7, 9, 11, 13}

, finding the best retrieval performance at

N_{v} = 11

. Consequently, we adopted

h = 6

and

N_{v} = 11

for all subsequent experiments.

Figure 4. Effect of view-prompt depth h and token count

N_{v}

on PRW performance. (a) Retrieval mAP and top-1 accuracy as h increases (with

N_{v} = 1

). (b) Retrieval mAP and top-1 accuracy as (

N_{v}

) increases (with

h = 6

).

Effectiveness of mask Figure 5 shows the distance distribution between the visual features of the images in the training set and the corresponding text features. After applying the mask, the distance between textual features and visual features is significantly reduced. This shows that the mask can remove the noise in text features. In the second training stage, the image features contain more accurate semantic information, which helps the model solve the problem of view differences through the semantic information shared between different view images.

Figure 5. Euclidean distance distribution of image features and corresponding text features in the training set of PRW.

Visualization of view prompt effects The G2APS dataset contains images in the ground camera and UAV views, with a very significant view difference. Therefore, we use t-SNE to visualize the image feature distribution under two different views before and after fusing the view prompt, and the results are shown in Figure 6. Before fusing the view prompt with the image feature, although the samples with the same ID are closely clustered, there is an obvious gap between the feature distributions of the two different views. After fusing, the image features of different views for each ID are brought closer. This demonstrates that the view prompt helps the model learn view-invariant visual features.

Figure 6. Distribution of image features in two views before (a) and after (b) fused with the view prompt. “o” and “+” represent the ground camera and UAV views, respectively, and the numbers represent the sample index rather than the ID.

6. Conclusions

This paper addresses the critical yet understudied challenge of viewpoint differences in person search, a problem exacerbated in UAV-based surveillance where cross-camera perspective shifts severely degrade matching accuracy. We propose VIPS, a novel two-stage framework that combines Faster R-CNN with a view-invariant ReID module, leveraging CLIP’s vision-language paradigm to align features across viewpoints through shared semantic descriptors. To overcome noise from UAV-captured backgrounds and occlusions, we introduce a mask generator to purify text-guided feature learning. Furthermore, our view prompts explicitly encode camera perspectives into visual features, mitigating viewpoint divergence—an innovation particularly relevant for UAV multimodal applications where perspective robustness is essential. Extensive experiments demonstrate the effectiveness of our proposed method in resolving view differences. This work not only advances person search technology but also highlights the potential of vision–language models (VLMs) for UAV-centric tasks, advancing the application of new AI technologies in UAVs.

Author Contributions

Conceptualization, J.L. (Jindong Liu), J.L. (Jing Li) and S.Z.; methodology, H.W. and W.L.; software, W.L.; validation, W.W. and F.X.; formal analysis, H.W.; investigation, W.W.; resources, J.L. (Jing Li); writing—original draft preparation, H.W.; writing—review and editing, S.Z.; visualization, H.W.; project administration, J.L. (Jindong Liu); funding acquisition, F.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key R&D program of China under grant 2022YFB4300700, The Key R&D programs of Shaanxi Province under grants 2021ZDLGY02-06, 2023-YBGY-132, 2024GX-YBXM-134, in part by Youth New Star Project of Shaanxi Province under grant 2023KJXX-136, in part by the Shaanxi Association for Science and Technology Young Talent Lifting Program under grant XXJS202242, Qin Chuangyuan project (No. 2021QCYRC4-49), Qinchuangyuan Scientist+Engineer (No. HYGJZN202331), National Defense Science and Technology Key Laboratory Fund Project (No. 6142101210202), The Basic Research Program of Natural Science in Shaanxi Province (grant No. 2024JC-YBMS-558), Xi’an Social Science Planning Fund Project (No. 24JX201).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zheng, L.; Zhang, H.; Sun, S.; Chandraker, M.; Yang, Y.; Tian, Q. Person re-identification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1367–1376. [Google Scholar]
Zhang, S.; Yang, Q.; Cheng, D.; Xing, Y.; Liang, G.; Wang, P.; Zhang, Y. Ground-to-Aerial Person Search: Benchmark Dataset and Approach. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 789–799. [Google Scholar]
Xiao, T.; Li, S.; Wang, B.; Lin, L.; Wang, X. Joint detection and identification feature learning for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3415–3424. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Li, S.; Sun, L.; Li, Q. Clip-reid: Exploiting vision-language model for image re-identification without concrete text labels. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 1405–1413. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef] [PubMed]
Wang, C.; Ma, B.; Chang, H.; Shan, S.; Chen, X. Tcts: A task-consistent two-stage framework for person search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11952–11961. [Google Scholar]
Dong, W.; Zhang, Z.; Song, C.; Tan, T. Instance guided proposal network for person search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2585–2594. [Google Scholar]
Han, C.; Ye, J.; Zhong, Y.; Tan, X.; Zhang, C.; Gao, C.; Sang, N. Re-id driven localization refinement for person search. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9814–9823. [Google Scholar]
Han, C.; Zheng, Z.; Gao, C.; Sang, N.; Yang, Y. Decoupled and Memory-Reinforced Networks: Towards Effective Feature Learning for One-Step Person Search. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1505–1512. [Google Scholar] [CrossRef]
Yan, Y.; Li, J.; Qin, J.; Bai, S.; Liao, S.; Liu, L.; Zhu, F.; Shao, L. Anchor-free person search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 7690–7699. [Google Scholar]
Nguyen, H.; Nguyen, K.; Pemasiri, A.; Liu, F.; Sridharan, S.; Fookes, C. AG-VPReID: A Challenging Large-Scale Benchmark for Aerial-Ground Video-based Person Re-Identification. In Proceedings of the Computer Vision and Pattern Recognition Conference, Chongqing, China, 25–26 October 2025; pp. 1241–1251. [Google Scholar]
Deng, Z.; Ge, Y.; Qi, X.; Sun, K.; Wan, R.; Zhang, B.; Zhang, S.; Zhang, X.; Meng, Y. SPL-PlaneTR: Lightweight and Generalizable Indoor Plane Segmentation Based on Prompt Learning. Sensors 2025, 25, 2797. [Google Scholar] [CrossRef] [PubMed]
Jiang, Y.; Chen, J.; Lu, J. Leveraging Vision Foundation Model via PConv-Based Fine-Tuning with Automated Prompter for Defect Segmentation. Sensors 2025, 25, 2417. [Google Scholar] [CrossRef] [PubMed]
Zhou, Y.; Yan, H.; Ding, K.; Cai, T.; Zhang, Y. Few-Shot Image Classification of Crop Diseases Based on Vision–Language Models. Sensors 2024, 24, 6109. [Google Scholar] [CrossRef] [PubMed]
Khattak, M.U.; Rasheed, H.; Maaz, M.; Khan, S.; Khan, F.S. Maple: Multi-modal prompt learning. In Proceedings of the the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19113–19122. [Google Scholar]
Wang, Z.; Zhang, Z.; Lee, C.Y.; Zhang, H.; Sun, R.; Ren, X.; Su, G.; Perot, V.; Dy, J.; Pfister, T. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 139–149. [Google Scholar]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16816–16825. [Google Scholar]
Jia, M.; Tang, L.; Chen, B.C.; Cardie, C.; Belongie, S.; Hariharan, B.; Lim, S.N. Visual prompt tuning. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 709–727. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 1116–1124. [Google Scholar]
Wei, L.; Zhang, S.; Gao, W.; Tian, Q. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 79–88. [Google Scholar]
Miao, J.; Wu, Y.; Liu, P.; Ding, Y.; Yang, Y. Pose-guided feature alignment for occluded person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 542–551. [Google Scholar]
Chen, D.; Zhang, S.; Ouyang, W.; Yang, J.; Tai, Y. Person search via a mask-guided two-stream cnn model. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Lan, X.; Zhu, X.; Gong, S. Person search by multi-scale matching. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 536–552. [Google Scholar]
Li, Z.; Miao, D. Sequential end-to-end network for efficient person search. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 2011–2019. [Google Scholar]
Lee, S.; Oh, Y.; Baek, D.; Lee, J.; Ham, B. OIMNet++: Prototypical Normalization and Localization-Aware Learning for Person Search. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part X. pp. 621–637. [Google Scholar]
Cao, J.; Pang, Y.; Anwer, R.M.; Cholakkal, H.; Xie, J.; Shah, M.; Khan, F.S. PSTR: End-to-end one-step person search with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9458–9467. [Google Scholar]
Yu, R.; Du, D.; LaLonde, R.; Davila, D.; Funk, C.; Hoogs, A.; Clipp, B. Cascade transformers for end-to-end person search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7267–7276. [Google Scholar]
Wang, G.; Yang, S.; Liu, H.; Wang, Z.; Yang, Y.; Wang, S.; Yu, G.; Zhou, E.; Sun, J. High-order information matters: Learning relation and topology for occluded person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6449–6458. [Google Scholar]
Jaffe, L.; Zakhor, A. Swap Path Network for Robust Person Search Pre-training. arXiv 2024, arXiv:2412.05433. [Google Scholar] [CrossRef]
Yan, L.; Li, K. Unknown Instance Learning for Person Search. In 2024 IEEE International Conference on Multimedia and Expo (ICME); IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Jia, M.; Cheng, X.; Lu, S.; Zhang, J. Learning disentangled representation implicitly via transformer for occluded person re-identification. IEEE Trans. Multimed. 2022, 25, 1294–1305. [Google Scholar] [CrossRef]
Wang, P.; Zhao, Z.; Su, F.; Meng, H. LTReID: Factorizable Feature Generation with Independent Components for Long-Tailed Person Re-Identification. IEEE Trans. Multimed. 2022, 25, 4610–4622. [Google Scholar] [CrossRef]
Dong, N.; Zhang, L.; Yan, S.; Tang, H.; Tang, J. Erasing, transforming, and noising defense network for occluded person re-identification. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 4458–4472. [Google Scholar] [CrossRef]
Xi, J.; Huang, J.; Zheng, S.; Zhou, Q.; Schiele, B.; Hua, X.S.; Sun, Q. Learning comprehensive global features in person re-identification: Ensuring discriminativeness of more local regions. PAttern Recognit. 2023, 134, 109068. [Google Scholar] [CrossRef]
Zhu, H.; Ke, W.; Li, D.; Liu, J.; Tian, L.; Shan, Y. Dual cross-attention learning for fine-grained visual categorization and object re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4692–4702. [Google Scholar]
Wang, T.; Liu, H.; Song, P.; Guo, T.; Shi, W. Pose-guided feature disentangling for occluded person re-identification based on transformer. AAAI Conf. Artif. Intell. 2022, 36, 2540–2549. [Google Scholar] [CrossRef]
Zhu, K.; Guo, H.; Zhang, S.; Wang, Y.; Liu, J.; Wang, J.; Tang, M. Aaformer: Auto-aligned transformer for person re-identification. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 17307–17317. [Google Scholar] [CrossRef] [PubMed]
He, S.; Chen, W.; Wang, K.; Luo, H.; Wang, F.; Jiang, W.; Ding, H. Region generation and assessment network for occluded person re-identification. IEEE Trans. Inf. Forensics Secur. 2023, 19, 120–132. [Google Scholar] [CrossRef]
Zhang, G.; Zhang, Y.; Zhang, T.; Li, B.; Pu, S. PHA: Patch-Wise High-Frequency Augmentation for Transformer-Based Person Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14133–14142. [Google Scholar]

Figure 1. Person captured by different cameras in the PRW [] and G2APS [] datasets.The images in PRW are all from ground cameras, while the images in G2APS are from a ground camera and a UAV. A more detailed description of dataset characteristics is available in Section 5.

Figure 2. Overview of view-invariant person ReID (a). We use CLIP-ReID as the baseline model and propose a mask generator (b) and view prompt (c) to learn view-invariant features.

Figure 3. Mask of some person in PRW, where (a,d,g,j) represent the original image, (b,e,h,k) represent the mask. (c,f,i,l) represent the superposition of the mask and the original image.

Figure 4. Effect of view-prompt depth h and token count

N_{v}

on PRW performance. (a) Retrieval mAP and top-1 accuracy as h increases (with

N_{v} = 1

). (b) Retrieval mAP and top-1 accuracy as (

N_{v}

) increases (with

h = 6

).

Figure 5. Euclidean distance distribution of image features and corresponding text features in the training set of PRW.

Figure 6. Distribution of image features in two views before (a) and after (b) fused with the view prompt. “o” and “+” represent the ground camera and UAV views, respectively, and the numbers represent the sample index rather than the ID.

Table 1. Statistics of the datasets used in our experiments. “#image” denotes the number of images, and “#ID” denotes the number of unique person identities. For person search datasets (CUHK-SYSU, PRW, and G2APS), the images are uncropped scene-level images containing multiple pedestrians, while for person ReID datasets (Market-1501, MSMT17, and Occluded-Duke), each image is a cropped pedestrian bounding box. In CUHK-SYSU, camera and movie images are treated as two distinct views. MSMT17 is a large-scale, multi-view dataset collected across 15 cameras under diverse indoor and outdoor conditions, featuring pronounced viewpoint differences that closely resemble cross-platform person search scenarios.

Dataset	Training Set		Test Set		#Cam
Dataset	#Image	#ID	#Image	#ID	#Cam
CUHK-SYSU	11,206	5532	6978	2900	2
PRW	5704	482	6112	450	6
G2APS	21,962	2077	9808	566	2
Market-1501	12,936	751	19,732	750	6
MSMT17	32,621	1041	93,820	3060	15
Occluded-Duke	15,618	702	19,871	1110	8

Table 2. Comparison of VIPS and all state-of-the-art person search methods on PRW, CUHK-SYSU, and G2APS datasets. The end-to-end methods are marked with *, and the best results are in bold. Param. (M) denotes the number of learnable parameters.

Method	Year	Param. (M)	PRW		CUHK-SYSU		G2APS
Method	Year	Param. (M)	mAP	Top-1	mAP(M)	Top-1	mAP	Top-1
MGTS []	2018	-	32.6	72.1	83.0	83.7	-	-
CLSA []	2018	-	38.7	65.0	87.2	88.5	-
RDLR []	2019	-	42.9	70.2	93.0	94.2	-	-
IGPN []	2020	-	47.2	87.0	90.3	91.4	-	-
TCTS []	2020	-	46.8	87.5	93.9	95.1	-	-
SeqNet * []	2021	48.4	46.7	83.4	93.8	94.6	34.0	44.5
AlignPS []	2021	42.2	45.9	81.9	93.1	93.4	27.0	34.7
OIM++ * []	2022	-	46.8	83.9	93.1	93.9	32.5	40.3
PSTR * []	2022	-	49.5	87.8	93.5	95.0	28.4	39.9
COAT * []	2022	37.0	52.5	86.0	93.7	94.1	40.3	50.5
HKD * []	2023	54.4	53.5	86.6	95.3	96.1	41.4	51.9
Faster+HOreid []	2023	95.6	55.6	98.3	97.4	98.0	52.6	62.2
SPNet-L * []	2024	-	54.2	89.0	95.8	96.3	-	-
UIL * []	2024	-	51.5	86.1	93.9	94.7	-	-
VIPS (ours)	-	168.2	56.3	82.6	98.2	98.6	57.0	66.1

Table 3. Comparison of VIReID and all state-of-the-art person ReID methods on three datasets. The best results are highlighted in bold.

Method	References	Market-1501		MSMT17		Occluded-Duke
Method	References	mAP	Top-1	mAP	Top-1	mAP	Top-1
CNN as the backbone
DRL-Net []	TMM 2022	86.9	94.7	55.3	78.4	50.8	65.0
LTReID []	TMM 2022	89.0	95.9	58.6	81.0	-	-
ETNDN []	TSCVT 2023	88.7	95.7	58.0	82.7	57.6	68.1
CLIP-ReID []	AAAI 2023	89.8	95.7	63.0	84.4	53.5	61.0
CGE []	PR 2023	90.1	95.6	65.9	85.1	-	-
ViT as the backbone
DCAL []	CVPR 2022	87.5	94.7	64.0	83.1	-	-
PFD []	AAAI 2022	89.6	95.5	64.4	83.8	60.1	67.7
AAformer []	TNNLS 2023	87.7	95.4	63.2	83.6	58.2	67.0
CLIP-ReID []	AAAI 2023	89.6	95.5	73.4	88.7	59.5	67.1
RGANET []	TIFS 2023	89.8	95.5	72.3	88.1	-	-
PHA []	CVPR 2023	90.2	96.1	68.9	86.1	-	-
VIReID (ours)		90.3	96.1	74.9	89.2	60.2	67.9

Table 4. Performance comparison before and after adding the mask generator and view prompt to the baseline model. ✓ means adding it, and × means not adding it. The best results are highlighted in bold.

Mask Generator	View Prompt	PRW
Mask Generator	View Prompt	mAP	Top-1
×	×	55.5	81.0
✓	×	55.8	82.1
×	✓	56.0	81.6
✓	✓	56.3	82.6

Table 5. Ablation study on MSMT17 under different query splits. The query set is divided by time of day into morning, noon, and afternoon subsets and by scene type into indoor and outdoor subsets, while the gallery set remains unchanged.

Methods	Time of Day						Scene
	Morning		Noon		Afternoon		Indoor		Outdoor
	mAP	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP	Rank-1
Baseline	77.1	90.8	67.3	84.3	74.4	90.0	67.0	86.9	74.3	89.0
VIReID (ours)	78.5	91.1	69.1	85.1	76.0	90.4	69.8	87.9	75.6	89.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

VIPS: Learning-View-Invariant Feature for Person Search

Abstract

1. Introduction

2. Related Work

2.1. Person Search

2.2. Prompt Learning

3. Preliminaries

3.1. Overview of CLIP-ReID

3.2. Overview of Faster R-CNN

3.3. Vision Transformer

4. View-Invariant Person Search

4.1. View-Invariant Person ReID

4.1.1. Mask Generator

4.1.2. View Prompt

4.2. Training

4.3. Relation to Prior Work

5. Experiments

5.1. Comparison with State-of-the-Art Methods

5.2. Ablation Study

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics