Unsupervised Aerial-Ground Re-Identification from Pedestrian to Group for UAV-Based Surveillance

Mei, Ling; Cheng, Yiwei; Chen, Hongxu; Jia, Lvxiang; Yu, Yaowen

doi:10.3390/drones9040244

Open AccessArticle

Unsupervised Aerial-Ground Re-Identification from Pedestrian to Group for UAV-Based Surveillance

by

Ling Mei

^1,2

,

Yiwei Cheng

¹,

Hongxu Chen

³,

Lvxiang Jia

² and

Yaowen Yu

^2,*

¹

School of Electronic Information, Wuhan University of Science and Technology, Wuhan 430081, China

²

School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China

³

School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(4), 244; https://doi.org/10.3390/drones9040244

Submission received: 3 March 2025 / Revised: 21 March 2025 / Accepted: 24 March 2025 / Published: 25 March 2025

Download

Browse Figures

Versions Notes

Abstract

Person re-identification (ReID) plays a crucial role in advancing UAV-based surveillance applications, enabling robust tracking and event analysis. However, existing methods in UAV scenarios primarily focus on individual pedestrians, requiring cumbersome annotation efforts and lacking seamless integration with ground-based surveillance systems. These limitations hinder the broader development of UAV-based monitoring. To address these challenges, this paper proposes an Unsupervised Aerial-Ground Re-identification from Pedestrian to Group (UAGRPG) framework. Specifically, we introduce a neighbor-aware collaborative learning (NCL) and gradual graph matching (GGC) strategy to uncover the implicit associations between cross-modality groups in an unsupervised manner. Furthermore, we develop a collaborative cross-modality association learning (CCAL) module to bridge feature disparities and achieve soft alignment across modalities. To quantify the optimal group similarity between aerial and ground domains, we design a minimum pedestrian distance transformation strategy. Additionally, we introduce a new AG-GReID dataset, and extensive experiments demonstrate that our approach achieves state-of-the-art performance on both pedestrian and group re-identification tasks in aerial-ground scenarios, validating its effectiveness in integrating ground and UAV-based surveillance.

Keywords:

unmanned aerial vehicle; aerial-ground; surveillance monitoring; unsupervised; group re-identification

1. Introduction

Unmanned aerial vehicles (UAVs) have emerged as a powerful tool for surveillance, traffic monitoring, disaster response, and security applications [1,2,3,4]. UAVs are increasingly used in large-scale surveillance applications such as border patrol [5], crowd monitoring [6,7], and law enforcement [8]. One critical capability for UAV-based surveillance systems is person and group re-identification. Traditional re-identification methods primarily focus on fixed-ground surveillance cameras with consistent viewpoints, limiting their ability to capture a broader perspective for global pedestrian observation.

Aerial surveillance has garnered significant attention from researchers due to the potential of UAV-based cameras [9] in enhancing person re-identification. While existing pedestrian re-identification techniques in UAV-based scenarios primarily target individual pedestrians, recent advances have shifted towards cross-domain pedestrian re-identification, particularly with an emphasis on aerial-to-ground applications. Cross-modality re-identification enables UAVs to continuously track individuals and groups as they transition between aerial and ground surveillance systems, ensuring seamless monitoring across multiple perspectives. Some previous works [10,11,12] have explored aerial-ground person re-identification from different issues, yet they predominantly address single-person recognition tasks.

However, pedestrians often move in groups with several individuals (usually 2–6 persons [13,14]), and traditional aerial retrieval systems under UAV scenarios primarily focus on individual pedestrians, overlooking the need for re-identification of multiple pedestrians, i.e., group re-identification [15]. Group re-identification seeks to recognize a group of pedestrians across non-overlapping surveillance camera systems, playing a crucial role in societal safety and security [16,17,18], especially for the development of UAV-based surveillance monitoring and detection.

In traditional group-based research, most studies focus on ground-level surveillance scenarios [13,19,20,21], which lack the flexibility and comprehensiveness that aerial perspectives offer. Ground-based monitoring systems are limited by their constrained field of view, often failing to capture the larger three-dimensional group dynamics that are more easily observed from an aerial vantage point. Notably, while some work [15] has made strides in UAV-based object re-identification, their group experiments were restricted to traditional ground-level scenarios, without addressing the more complex dynamics that arise in aerial environments.

Figure 1 illustrates person and group re-identification of aerial-ground views; a critical gap in current research is the lack of UAV-based group re-identification systems that specifically consider the unique challenges of cross-modality group re-identification in real-world settings. The majority of pedestrians in practice tend to travel in groups [14], making it essential to move beyond individual re-identification models and focus on group-level analysis, especially in cross-modality aerial-ground scenarios. Existing methods [10,11] have not yet adequately explored the practical implications of group re-identification from cross-modality ground–aerial perspectives.

Moreover, existing aerial-ground person re-identification methods [10,11] predominantly rely on supervised learning, which is heavily dependent on label annotations. This dependency becomes even more pronounced in the context of group re-identification, posing a significant challenge for the practical implementation of real-world applications. To solve the above challenges, our research proposes a framework for cross-modality aerial-ground group re-identification that combines ground-level and aerial perspectives, addressing the practical needs of group-based UAV surveillance.

In this study, we propose an Unsupervised Aerial-Ground Re-identification from Pedestrian to Group (UAGRPG) framework to address the challenge of integrating retrieval tasks across UAV-based and ground-based surveillance systems. Our contributions are threefold, which are as follows:

•: We propose the Gradual Graph Correspondence (GGC) method to realize unsupervised group re-identification, which uses neighbor-aware collaborative learning (NCL) to mine the correspondence in the group and find reliable correlations of modalities by matching graphs progressively.
•: We propose a novel unsupervised group re-identification framework to effectively mine intra-group correspondences and establish reliable cross-modality correlations through progressive graph matching.
•: We introduce a novel minimum pedestrian distance transformation strategy to enhance the accuracy of similarity measurement for group images across aerial and ground domains. Additionally, we present a new aerial-ground group re-identification dataset. Extensive experiments on both person and group re-identification tasks validate the effectiveness of our approach, demonstrating its superior performance and substantial improvements in unsupervised aerial-ground pedestrian retrieval scenarios.

2. Related Works

This section conducts a review of methodologies about person and group re-identification from an aerial perspective within the context of UAV scenarios. It synthesizes extant research on cross-view re-identification by integrating established approaches from ground-based perspectives.

2.1. Supervised View-Homogeneous Person Re-Identification

Person re-identification (ReID), which aims to retrieve images of the same individual across non-overlapping camera views, has been increasingly linked to applications of UAVs since they introduce unique challenges such as extreme viewpoint variations, scale discrepancies, and occlusions. These challenges have spurred the development of supervised techniques to address the complexities of aerial-based ReID, such as referring to some previous view-homogeneous ReID models to obtain robust feature extraction.

View-homogeneous ReID focuses on scenarios where images are captured by camera networks with consistent perspectives, such as ground-only or aerial-only systems. Among these, ground-based ReID has been the predominant research focus due to its widespread practical applications. Benchmark datasets like Market1501 [22] and MSMT17 [23] have driven substantial progress in this domain. Early approaches primarily depended on handcrafted features [24]. With technological advancements, CNN-based architectures emerged [25], paving the way for more recent transformer-based frameworks that further enhance performance and adaptability [26]. These approaches have achieved remarkable success in addressing challenges such as viewpoint variations and occlusions within ground-based camera networks. In contrast, aerial-only ReID remains comparatively understudied, with limited datasets [9,12] and methodologies [11,27] addressing the unique challenges posed by aerial imagery. The distinct perspective and resolution characteristics of aerial views (often characterized by top–down angles and lower pixel densities) create significant domain gaps compared with ground-based imagery. While recent work has begun to explore UAV-based human monitoring [10], the performance of existing methods remains constrained by the inherent complexity of aerial data. Notably, the methodologies developed for view-homogeneous ReID exhibit limited effectiveness in cross-view scenarios due to substantial perspective discrepancies between aerial and ground cameras. This limitation motivates the need for specialized approaches in view-heterogeneous settings.

2.2. Supervised View-Heterogeneous Person Re-Identification

View-heterogeneous person re-identification (ReID) aims to match individuals across cameras with disparate viewpoints, such as aerial and ground-based surveillance systems. While substantial progress has been made in view-homogeneous ReID, the problem of aerial-ground person ReID remains relatively underexplored. The primary challenge arises from the substantial geometric and photometric discrepancies between aerial and ground perspectives, including variations in scale, resolution, and occlusion patterns. These differences significantly hinder the development of robust matching techniques, posing unique challenges for UAV-based surveillance and security applications.

To tackle this challenge, Nguyen et al. [28] introduced the AG-ReID dataset, which incorporates both identity and attribute labels, and proposed an explainable model leveraging attribute-guided training. However, this method is highly dependent on attribute annotations, posing challenges to its scalability and generalizability in real-world applications. More recently, Zhang et al. [10] proposed the CARGO dataset, which supports multiple heterogeneous view-matching patterns. They introduced a view-decoupled transformer (VDT) that improves performance by decoupling view-specific features using hierarchical separation and orthogonal loss, thus reducing the dependency on attribute labeling. Despite these advancements, one limitation of the VDT is its failure to dynamically select key tokens related to the target object, which may lead to redundant computation and inefficiencies. In response, newer methods, such as the Dynamic Token Selective Transformer (DTST) [29], have been proposed to improve feature extraction efficiency. DTST dynamically selects the most informative tokens from the image using a top-k strategy, enhancing the identification of critical features. While these developments mark important progress in aerial-ground ReID, challenges such as efficient token selection and handling large-scale data remain areas for future research.

Although supervised methods have significantly advanced aerial-ground ReID, their reliance on labor-intensive, human-annotated cross-modality associations poses a major challenge to scalability. This limitation is particularly critical in UAV-based surveillance applications, where large-scale data collection is essential for real-world deployment. Consequently, there is a growing need for unsupervised approaches that mitigate dependence on costly labeled data, facilitating more efficient and scalable aerial-ground ReID solutions in dynamic and complex aerial-ground environments of UAV-based surveillance.

2.3. Unsupervised Person Re-Identification

For UAV-based surveillance and tracking, unsupervised ReID plays a crucial role in identifying and following individuals across non-overlapping camera views without requiring manually annotated datasets. Unsupervised ReID has gained increasing attention as it offers a solution to the challenge of limited annotated data. This is particularly relevant for cross-modality and viewpoint-heterogeneous, where the lack of ground-truth annotations further complicates the task.

Unsupervised methods for single-modality ReID have made significant strides. Approaches like self-training pipelines, pseudo-label refinement, and memory bank-based methods [30,31] have been adopted to generate pseudo-labels and train networks iteratively. These techniques aim to alleviate the impact of noisy labels and improve model robustness. For example, methods such as MMT [32] and PPLR [33] focus on refining pseudo-labels using clustering consensus and global-part agreement strategies to improve feature learning, respectively. Memory bank-based methods, such as SPCL [34] and DCMIP [31], utilize hybrid memory structures to facilitate contrastive learning during training and manage multi-instance information efficiently.

However, existing unsupervised Re-ID methods for UAVs primarily focus on aerial single-modality scenarios [35,36], lacking research on cross-modality approaches. In contrast, unsupervised cross-modality Re-ID methods have been extensively studied in ground-based surveillance, particularly in visible-infrared scenarios. These unsupervised techniques can provide valuable insights for developing aerial-ground UAV-based Re-ID methods.

The challenge for the cross-modality ReID issue is even more pronounced, as it involves aligning features from disparate sensor types, such as visible and infrared images, without relying on manual annotations. Liang et al. [37] introduced an ISML loss to enhance the alignment of reliable cross-modality instances, while Yang et al. [38] proposed dual-contrastive aggregation learning to improve unsupervised visible-infrared ReID. Cheng et al. [39] focused on efficient cluster matching and collaborative learning for unsupervised visible-infrared ReID, where label refinement is crucial in addressing the heterogeneity between modalities. Wu and Ye [40] presented a progressive graph-matching approach to address the complexities of heterogeneous data by iteratively refining pseudo-labels for each modality. Additionally, Wang et al. [41] used optimal transport to assign cross-modality pseudo-labels uniformly, alleviating the issue of biased label associations. Fan et al. [4] introduced an unsupervised moving object detection method for UAV scenarios by leveraging domain-specific knowledge. These unsupervised approaches offer promising directions for improving cross-modality ReID by minimizing the need for expensive manual annotations. They represent a key area of future research, as they aim to enhance the robustness of models in heterogeneous environments with minimal supervision.

In summary, in UAV surveillance data analysis, unsupervised methods primarily focus on single-modality aerial scenarios, with limited research addressing unsupervised solutions for the aerial-ground integration challenge and lacking investigation on the group issue.

2.4. Group Re-Identification

Group re-identification (GReID) aims to match groups of individuals across non-overlapping camera views, which is distinct from traditional person re-identification due to the complexity of group-level interactions and the arrangement of group members. Early approaches in GReID used hand-crafted features to represent group identities, such as the covariance descriptor [14] and salience channels [42], which attempted to match groups based on statistical representations and visual salience. However, these methods were limited by their reliance on manual feature engineering, which constrained their scalability and adaptability. As deep learning techniques have advanced, deep feature-based methods have become the mainstream approach in GReID. For instance, SVIGR [13] employed a Siamese network to learn robust person features, while MGR [43] focused on multi-granularity matching to address the variations in group composition. More recently, graph neural networks (GNNs) have been applied to model group contexts. DotGNN [44] and GCGNN [45] utilize relational information between individual group members and their surroundings, providing a richer representation of the whole group. These methods have shown promise in handling intra-group and inter-group relationships, particularly when applied to large-scale datasets such as CSG [17]. Despite these advancements, a key challenge remains the ability to handle dynamic group membership, which can change due to movement, occlusions, or environmental factors. Methods like PREP [46] aim to address this by dynamically removing individuals who do not belong to a particular group, improving matching accuracy. Additionally, some approaches, such as 3DT [47], leverage 3D layout information to account for the relative positioning of group members, offering a more comprehensive context for group identification. For the aerial aspect, Zhang et al. [15] proposed a dynamic strategy by using the group contextual information to handle the group member variation problem. However, they only consider the general object under the UAV scenario rather than the group; the GReID experiment has not been conducted in the cross-modality aerial-ground scenario.

Despite significant advancements in GReID, research in UAV-based scenarios remains limited, particularly in aerial environments. Existing studies predominantly focus on ground-based settings, with few addressing the challenges posed by aerial perspectives, such as varying viewpoints, scale variations, and occlusions. Moreover, research on aerial-ground cross-modality GReID is scarce, further complicating identity matching across heterogeneous viewpoints. Additionally, unsupervised methods, which are crucial for reducing reliance on extensive labeled datasets, have received minimal attention in this domain. The lack of unsupervised approaches for cross-modality aerial-ground GReID highlights a critical research gap, necessitating novel methodologies to enhance robustness and generalization in UAV surveillance applications.

3. The Proposed Methodology

Unsupervised GReID in UAV surveillance plays a crucial role in enhancing situational awareness and multi-agent tracking across aerial and ground perspectives. Traditional person re-identification methods often struggle with the inherent domain gap between aerial and ground views, making it challenging to establish consistent associations without extensive labeled data. To address this limitation, we propose an unsupervised framework shown in Figure 2 that effectively bridges the aerial-ground domain discrepancy by leveraging collaborative learning of multi-grained features and heterogeneous associations to realize soft feature alignment across modalities.

3.1. Neighbor-Aware Collaborative Learning

We define an aerial-ground GReID dataset

H = {(x_{k}, y_{k}, m_{k})}_{k = 1}^{| H |}

, consisting of a training subset

H_{t r a i n}

and a testing subset

H_{t e s t}

, where

x_{k}

denotes the k-th group image, and

y_{k}

and

m_{k}

represent the pedestrian’s identity (person identity or group identity) and modality labels, respectively.

m_{k} \in {m^{a}, m^{g}}

indicates that the image

x_{k}

is collected from an aerial view

m^{a}

or a ground view

m^{g}

. For an aerial-ground training subset

H_{t r a i n} = {H^{a}, H^{g}}

,

H^{a}

represents the aerial dataset with M aerial images, while

H^{g}

denotes that there are N ground images. A significant view discrepancy between the aerial and ground scenarios results in a view-dependent feature space characterized by high inter-class similarity and low intra-class similarity. Therefore, the objective of the aerial-ground GReID method is to learn a parameterized model

f (\cdot; θ_{f})

for feature extraction to mitigate the modality gap as follows:

min_{θ} \sum_{k} [| | f (x_{k}; θ_{f}) - f (x_{k}^{+}; θ_{f}) {| |}_{2}^{2} - | | f (x_{k}; θ_{f}) - f (x_{k}^{-}; θ_{f}) {| |}_{2}^{2}]

(1)

where

| | \cdot | |_{2}

represents the

L_{2}

distance, and

x_{k}^{+} / x_{k}^{-}

indicates whether the instance image shares the same identity as

x_{k}

or belongs to a different identity. As shown in Figure 2, group images are first partitioned into instance features using a patch embedding mechanism, ensuring that individual instances within the crowd are effectively represented as independent feature tokens. The embedded instance features are subsequently fed into the TransReID model [26]; the extracted feature embeddings serve as the input for the clustering module, which groups similar pedestrian representations based on their learned feature affinities.

The basic rationale for learning the feature representations of aerial-grounding groups is that the cross-modality instance features are consistent, and the collaborative neighbor learning among group members should be consistent across modalities. These intrinsic relational consistencies provide useful constraints for both inter-modality and intra-modality optimization, facilitating the collaborative learning of deep and shallow features under both homogeneous and heterogeneous modalities. This enables the effective identification of reliable neighbors for collaborative learning. Consequently, it is essential to establish a homogeneous and heterogeneous collaborative paradigm among group members. For a given query instance q, its similarity with other instances in the training set is defined as follows:

r (q_{i}, u_{j}) = \frac{f (q_{i}; θ_{f}) \cdot f (u_{j}; θ_{f})}{| | f (q_{i}; θ_{f}) | |_{2} \cdot | | f (u_{j}; θ_{f}) | |_{2}}

(2)

where

q_{i}

and

u_{j}

come from shallow or deep features of aerial and ground modality to obtain multiple types of homogeneous and heterogeneous shallow or deep query-instance similarity, i.e.,

r (q_{i}^{b, c}, u_{j}^{b, c}) |_{b = s, d}^{c = a, g}

. For example, the superscripts “

s, a

” and “

d, g

” stand for the shallow aerial and the deep ground features, respectively. With the above multiple kinds of similarities, we seek reliable shallow and deep intra-modality and inter-modality neighbors to formulate collaborative neighbor learning. The intra-modality neighbors can be defined as follows:

h^{a} (q_{i}^{a}) = {h^{a} (q_{i}^{a, s}) \cap h^{a} (q_{i}^{a, d})}

(3)

h^{g} (q_{i}^{g}) = {h^{g} (q_{i}^{g, s}) \cap h^{g} (q_{i}^{g, d})}

(4)

where

h^{a} (q_{i}^{a, *})

and

h^{g} (q_{i}^{g, *})

are neighborhood sets searched by aerial–aerial and ground–ground similarity, respectively. With the neighborhood set of the group representation

{h^{c_{1}} (q_{i}^{c_{2}})}_{c_{2} = a, g}^{c_{1} = a, g}

, the neighbor-aware collaborative learning can be performed. Given queries

q_{i}^{s}

and

q_{i}^{d}

, we can obtain the expression of the aerial–aerial shallow and deep neighbor learning by the following:

L_{b}^{a, a} = - \frac{1}{N_{B Z}} \sum_{i = 1}^{N_{B Z}} \sum_{j \in h^{a} (q_{i}^{a})} log \frac{exp (r (q_{i}^{a, b}, u_{j}^{a, b}) / σ)}{\sum_{m = 1}^{M} exp (r (q_{i}^{a, b}, u_{m}^{a, b}) / σ)}

(5)

where

b = s, d

specifies whether the loss term is computed for the shallow feature or the deep feature,

N_{B Z}

is the batch size of query

q_{i}

, and the parameter

σ

in the above equation functions as a temperature that controls the sharpness of the probability distribution in the softmax function. Then the neighbor-aware collaborative learning loss of aerial–aerial is computed by the following:

L^{a, a} = L_{d}^{a, a} + α_{s} L_{s}^{a, a}

(6)

where

α_{s}

serves as the weight to balance the relationship between

L_{d}

and

L_{s}

. Similarly, other neighbor-aware collaborative learning losses of ground–ground

L^{g, g}

, ground–aerial

L^{g, a}

, and aerial-ground

L^{a, g}

can be obtained similarly. The overall optimization objective for the neighbor-aware collaborative learning loss is formulated as a combination of the following terms:

L_{N C L} = L^{a, a} + L^{a, g} + L^{g, a} + L^{g, g}

(7)

To explore global cross-modality feature extraction, a two-stream structure (i.e., homogeneous and heterogeneous stream) is employed to obtain global group features. After clustering the features, memories of the two modalities are generated and pseudo-labels for all images are generated; we use the DBSCAN method [48] to realize clustering as follows:

Y^{c} = D B S C A N (F^{c})

(8)

where

c = a, g

indicates aerial and ground modality, respectively. Y and F are the set of pseudo-labels and features of all images. Different from previous approaches, we not only cluster homogeneous samples (

c = a o r c = g

) but also cluster heterogeneous samples (

c = a, g

) to build cross-modality correspondence indirectly. The cross-modality clustering typically optimizes the feature extractor using a single memory, but a single memory may not fully capture nuances of the group, e.g., member variation and layout variation. Moreover, cross-modality clustering does not establish relations between the ground and the aerial modality directly, thus limiting its effectiveness in cases with significant modality discrepancies. To more effectively capture individual nuances and bridge the gap across the two modalities, a multi-memory learning and matching strategy [38] is used to mine a holistic representation and establishes reliable cross-modality correspondences. Therefore, we further subdivide a single memory into multiple memories for a single identity, which can be formulated as a sub-cluster.

3.2. Gradual Graph Correspondence (GGC)

By employing the multi-memory learning strategy, we realize more diverse memories for a single identity. However, these memories still exhibit a strong implicit correlation with the modality, which negatively copes with the establishment of cross-modality correspondences and the integration of the ground surveillance data and the aerial data from UAV surveillance.

To solve the problem, we present the GGC method to connect aerial and ground data and find reliable cross-modality correspondences. Notably, each modality is built with a bipartite graph. Suppose that there are

N^{a}

nodes in the aerial graph and each node represents a cluster. These nodes are represented as

[A] = {w_{i}^{a} | i = 1, 2, \dots, N^{a}}

. Similarly, the ground graph

D_{G}

includes

N^{g}

ground nodes represented by

[G] = {w_{j}^{g} | j = 1, 2, \dots, N^{g}}

. We assign the dissimilarity of the aerial node

w_{i}^{a}

and the ground node

w_{j}^{g}

by using the elements of a cost matrix. We aim to establish correspondences in

D_{A} (D_{G})

for all nodes within

D_{G} (D_{A})

. Without loss of generality, we assume that

N_{g} \geq N_{a}

, implying a different cluster size between the two modalities. The cost matrix functions as a penalty term that assesses feature-based discrepancies to impose constraints on inter-modality cluster alignments, which is defined as the reciprocal of the clustering similarity defined by the following Equation (2):

R (i, j) = {exp}^{- 1} (r (q_{i}^{a}, u_{j}^{g}))

(9)

where

q_{i}^{a}

and

u_{j}^{g}

denote the average feature representations of clusters in aerial and ground modality, respectively. The average feature depends on the number of instances in the cluster.

To match each aerial cluster with the corresponding identity of the ground cluster while minimizing the cost, we define a gradual graph matching strategy according to [49] that can be formulated as binary linear programming with linear constraints, as follows:

Γ (e) = \underset{e}{\arg \min} R^{T} (i, j) e

(10)

where

e = {e_{i}^{j}} \in R^{N^{g} \times N^{a} \times 1}

serves as a sign for node matching between nodes

w_{i}^{g}

and

w_{j}^{a}

, specifying whether they come from the same identity

(e_{i}^{j} = 1)

or not

(e_{i}^{j} = 0)

,

s . t . \forall i \in [A] : \sum_{j \in | G |} e_{i}^{j} \leq 1, \forall j \in [G] : \sum_{i \in | A |} e_{i}^{j} = 1

. Various algorithms could be applied to solve the matching problem, e.g., Hungarian algorithm [50]. Given the cost matrix R, there are

N^{a}

elements of 1 in the preliminary graph matching outputs matrix

e

, representing corresponding positive matched pairs. Notably, a node

w_{i}^{a}

in

D_{A}

may not seek a positive counterpart

w_{j}^{g}

that fits

w_{i}^{j} = 1

, implying that certain clusters within the aerial modality may lack corresponding matches. To address this imbalance caused by the disparity between UAV and ground-based surveillance during the matching process, the GGC method is introduced in this study.

The GGC method aims to establish correspondences for each node through multiple dynamic matchings. Specifically, we assume that the number of nodes in

D_{A}

exceeds the number in

D_{G}

. After applying the basic graph matching strategy, there will be unmatched nodes remaining in

D_{A}

, while all nodes in

D_{G}

will have found their correspondences. We progressively rebuild a new graph, denoted as

D_{A^{'}}

, using the remaining nodes from

D_{A}

and their corresponding connected edges. The graphs

D_{A^{'}}

and

D_{G}

are then combined into a bipartite graph, restarting a new graph matching process. Notably, the existing correspondences of nodes in

D_{G}

would not be updated, as they have already been established. During the new graph matching process, only the correspondences of nodes in

D_{A^{'}}

will be updated. This gradual graph matching strategy is iterated until all nodes progressively find their corresponding matches.

3.3. Collaborative Cross-Modality Association Learning (CCAL)

Pedestrian data captured by UAV-based aerial and ground-based surveillance exhibit significant cross-modality discrepancies due to variations in viewpoint, resolution, and illumination conditions. This challenge is more complex in group re-identification scenarios, where intra-group occlusions and spatial misalignments introduce additional complexity. To bridge this modality gap, it is essential to design a collaborative association mechanism that facilitates mutual adaptation and feature alignment between aerial and ground domains, enabling the learning features to be used in UAV scenarios.

Therefore, after obtaining the cross-modality correspondences by GGC, we introduce the CCAL to minimize modality discrepancies and alleviate the noise problem from the correspondence. The CCAL consists of two unidirectional learning branches, i.e., aerial-to-ground (A2G) learning and ground-to-aerial (G2A) learning. The G2A loss function can be defined as follows:

L_{G 2 A} = - \sum_{i = 1}^{N_{B Z}} log (\frac{exp (ϕ^{a} {[\hat{y_{i}^{g}}]}^{T} \cdot f (x_{i}^{g}) / σ)}{\sum_{j = 1}^{N^{a}} exp (ϕ^{a} {[j]}^{T} \cdot f (x_{i}^{g}) / σ)})

(11)

where

ϕ^{a} [j]

stores the feature centroid generated by the DBSCAN method for the

j

-th class in the ground modality,

\hat{y_{i}^{g}} = G 2 A [\tilde{y_{i}^{g}}]

,

\hat{y_{i}^{g}}

denotes the ground-to-aerial correspondence of

\tilde{y_{i}^{g}}

, serving as a cross-modality label of the ground image

x_{i}^{g}

, while

\tilde{y_{i}^{g}}

represents the pseudo-label of

x_{i}^{g}

. The association between different modalities can mitigate the disparity by aggregating the input instance to its respective cross-modality representation.

Likewise, aerial-to-ground (A2G) learning exhibits a similar form to G2A learning; the loss function is denoted as follows:

L_{A 2 G} = - \sum_{i = 1}^{N_{B Z}} log (\frac{exp (ϕ^{g} {[\hat{y_{i}^{a}}]}^{T} \cdot f (x_{i}^{a}) / σ)}{\sum_{j = 1}^{N^{g}} exp (ϕ^{g} {[j]}^{T} \cdot f (x_{i}^{a}) / σ)})

(12)

where

\hat{y_{i}^{a}} = A 2 G [\tilde{y_{i}^{a}}]

,

\hat{y_{i}^{a}}

is the aerial-to-ground correspondence for

\tilde{y_{i}^{a}}

, serving as a cross-modality label for the aerial image

x_{i}^{a}

, while

\tilde{y_{i}^{a}}

represents the pseudo-label of

x_{i}^{a}

, and

σ

is a temperature parameter.

3.4. Soft Cross-Modality Alignment (SCMA)

To further solve the label noise issue brought by clusters across different modalities, we design a soft cross-modality alignment strategy to help the label refinement. Since group re-identification is a multiple pedestrian matching problem, we propose the heterogeneous alignment loss

L_{H A L}

and homogeneous smoothing loss

L_{H S L}

to force the feature distribution to be consistent across modalities.

\begin{matrix} \begin{matrix} L_{H A L} & = L_{i n t e r}^{A} + L_{i n t e r}^{G} \\ = \frac{1}{N_{y}} \sum_{y = 1}^{N_{y}} \frac{1}{2} (D (F_{y}^{a}, S R (F_{y}^{g}) + D (F_{y}^{g}, S R (F_{y}^{a})) \end{matrix} \end{matrix}

(13)

where

S R (\cdot)

represents the stop-gradient operation, and

D (a, b)

denotes the distance between two distributions a and b.

N_{y}

is the minimum image number of two modalities, i.e.,

N^{a}

and

N^{g}

.

F_{y}^{a}

and

F_{y}^{g}

represent the aerial feature and the ground feature sets of the pseudo-label y.

The heterogeneous alignment loss should associate the cross-modality labels with the intersection of two label sets from different modalities; thus the refined labels across the ground and the aerial modality should have the same ranking consistency in the retrieval list, which reflects the ranking association of inter-modality.

On the other hand, the homogeneous smoothing loss is proposed to investigate the intra-modality ranking consistency of the refined labels as follows:

\begin{matrix} \begin{matrix} L_{H S L} & = L_{i n t r a}^{A} + L_{i n t r a}^{G} \\ = \sum_{y = 1}^{N^{a}} \sum_{f^{a} \in F_{y}^{a}} | | f^{a} - F_{A^{y}} {| |}_{2}^{2} \\ + \sum_{y = 1}^{N^{g}} \sum_{f^{g} \in F_{y}^{g}} | | f^{g} - F_{G^{y}} {| |}_{2}^{2} \end{matrix} \end{matrix}

(14)

where

F_{v}^{p}

denotes ground feature and aerial feature sets of group ID p. The memory for each cluster is calculated at the beginning of each training iteration as follows:

F_{A^{y}} = \frac{1}{N_{y}} \sum_{i = 1}^{N_{y}} f_{i}^{a}

(15)

F_{G^{y}} = \frac{1}{N_{y}} \sum_{i = 1}^{N_{y}} f_{i}^{g}

(16)

In total, the overall objective loss for training the proposed model in this study is defined as follows:

L_{t o t a l} = L_{N C L} + L_{A 2 G} + L_{G 2 A} + L_{H A L} + L_{H S L}

(17)

3.5. Transformation from Pedestrian to Group Distance

To measure the similarity among group images, we propose a minimum pedestrian distance transformation strategy to realize the goal. The pedestrian features between the query group and the gallery group are measured using cosine distance to quantify the pedestrian distance. The minimum pedestrian distance between the two groups is selected as the group distance, which is subsequently utilized for cross-modality identity matching. This process yields a cross-modality group identity retrieval list. Specifically, the cosine distance between two feature vectors

f_{i}

and

f_{j}

is defined as follows:

d (f_{i}, f_{j}) = 1 - \frac{f_{i} \cdot f_{j}}{| | f_{i} | | | | f_{j} | |}

(18)

where

f_{i} \cdot f_{j}

denotes the dot product of the vectors, and

| | f_{i} | |

and

| | f_{j} | |

represent their Euclidean norms. The group distance

D_{g r o u p}

between the query group

G_{q}

and the gallery group

G_{g}

is computed as follows:

D_{g r o u p} (G_{q}, G_{g}) = min_{f_{i} \in G_{q}, f_{j} \in G_{g}} d (f_{i}, f_{j})

(19)

As shown in Figure 2, the person distance is transformed to group distance, which can be used to rank the group retrieval list.

4. Experiments

In this section, to evaluate the effectiveness of integrating UAV-based and ground-based surveillance data, we conduct diverse experiments to compare the proposed method with state-of-the-art baselines on both person and group re-identification tasks under three aerial-ground settings, i.e., unsupervised cross-modality person re-identification, aerial-to-aerial person re-identification, and unsupervised cross-modality group re-identification. All experiments are conducted on the AG-ReID dataset [28] and our constructed AG-GReID dataset.

4.1. Experimental Setting

4.1.1. Evaluating Metrics

In the experiment of pedestrian re-identification, we use three metrics to evaluate different methods; they are the mean Average Precision (mAP), the cumulative matching characteristic (CMC) at Rank-n, and the mean Inverse Negative Penalty (mINP).

The mAP metric evaluates the overall retrieval performance by considering both precision and recall across all ranks. For each query, the Average Precision (AP) is calculated as the area under the precision-recall curve. The mAP is then obtained by averaging the AP values over all queries. Mathematically, the mAP is defined as follows:

mAP = \frac{1}{Q} \sum_{q \in Q} AP (q)

(20)

where Q is the set of queries, and

AP (q)

for a query q is computed as follows:

AP (q) = \frac{1}{P_{q}} \sum_{k = 1}^{N} P_{q} (k) \cdot re l_{q} (k)

(21)

where

P_{q} (k)

is the precision at rank k,

{rel}_{q} (k)

is an indicator function that equals 1 if the item at rank k is a true positive, and

P_{q}

is the set of true positives for query q.

The CMC metric measures the probability of finding the correct match within the top-k ranked results. It is particularly useful for evaluating the ranking quality of ReID systems. The CMC at Rank-n is defined as follows:

CMC [n] = \frac{1}{|Q|} \sum_{q \in Q} I (rank (q) \leq k)

(22)

where

I (\cdot)

is an indicator function that equals 1 if the condition is satisfied, and

rank (q)

is the rank of the first correct match for query q.

The mINP metric evaluates the difficulty of retrieving all true positives by considering the penalty incurred due to negative samples before the last true positive is found. For each query, the Inverse Negative Penalty (INP) is calculated as follows:

INP (q) = \frac{1}{ran k_{l a s t} (q) - ran k_{f i r s t} (q) + 1}

(23)

where

ran k_{l a s t} (q)

and

ran k_{f i r s t} (q)

are the ranks of the last and first true positives, respectively. The mINP is then obtained by averaging the INP values over all queries, as follows:

mINP = \frac{1}{|Q|} \sum_{q \in Q} INP (q)

(24)

4.1.2. Dataset Description

AG-ReID is a large aerial-ground person re-identification dataset collected by UAVs; there are 11,554 images with 199 person-IDs for training and 12,464 images with 189 person-IDs for testing. For the testing set, there are 2033 and 10,429 images in the query set and the gallery set, respectively. In addition, AG-ReID also provides detailed person attribute labeling as optional information.

To comprehensively investigate pedestrian retrieval performance in UAV scenarios, three testing protocols were established: aerial-to-ground, ground-to-aerial, and aerial-to-aerial. In the aerial-to-ground protocol, 1701 aerial images served as queries, while 3331 ground images constituted the gallery. For the ground-to-aerial protocol, 962 ground images were used as queries, and 7204 aerial images formed the gallery. Both cross-modality scenarios involved 189 unique Person-IDs. In the single-modality aerial-to-aerial scenario, 962 query images and 7204 gallery images were utilized.

Existing aerial-ground pedestrian re-identification (Re-ID) datasets primarily focus on single individuals, lacking datasets that address group re-identification. To bridge this gap and facilitate research on group re-identification, we construct a new group-based dataset under UAV surveillance by aggregating the pedestrian identities from the AG-ReID dataset. Specifically, we follow the pedestrian aggregating approach proposed in [20] to synthesize the group identity under UAV scenarios.

In this process, individuals from the AG-ReID dataset are stitched together in sequential order based on their unique pedestrian IDs to form groups. This approach allows us to simulate realistic scenarios where groups need to be re-identified across aerial and ground views. The resulting dataset preserves the original challenges of cross-view ReID, such as viewpoint variations, illumination changes [51,52], and resolution differences, while introducing the additional complexity of group-level retrieval. The newly constructed group dataset is called AG-GReID.

In total, the AG-GReID dataset has 7980 group images with 54 group identities (Group-IDs); there are 3851 and 4129 images for the training set and the testing set. As the group data are derived from the concatenation of pedestrian data, both ground and aerial scenes possess only a single viewpoint, making them suitable for cross-modality identity matching. The testing set comprises 678 query images and 3451 gallery images, with an average group size of 3 individuals. To further explore group retrieval performance in UAV scenarios, two cross-modality testing protocols were designed based on the characteristics of the AG-GReID dataset: aerial-to-ground and ground-to-aerial. In the aerial-to-ground protocol, 633 aerial images were used as queries, and 1110 ground images comprised the gallery. For the ground-to-aerial protocol, 321 ground images served as queries, while 2401 aerial images formed the gallery. Both cross-modality scenarios included 54 unique Group-IDs.

Some representative examples of the dataset are illustrated in Figure 3; it is evident that pedestrian images captured by UAVs predominantly exhibit an aerial perspective, in contrast to the typically horizontal viewpoint from ground-level fixed surveillance. UAV’s perspective is notably more flexible, allowing for dynamic angle adjustments based on the positions of pedestrians. Specifically, drone imagery often provides a broader field of view, capturing more contextual information about the environment, and can easily adjust to focus on specific areas of interest.

The AG-GReID dataset enables the study of group re-identification under aerial-ground cross-modality scenarios, involving feature representation and group matching under changes in viewpoint and modality. By providing a benchmark for group re-identification, the AG-GReID dataset aims to advance research in this emerging area and foster the development of robust algorithms for real-world applications. For example, by monitoring group movement patterns and interactions, UAVs assist in detecting abnormal behaviors [53], such as unauthorized gatherings or potential security threats. UAVs can detect and track suspicious individuals and groups by matching their aerial and ground appearances, improving real-time threat detection and event security.

4.1.3. Implementation Details

The proposed UAGRPG method was conducted under the PyTorch framework. Pedestrian feature extraction was performed using TransReID [26] as the backbone network. The DBSCAN algorithm [48] is used to generate the clustering pseudo-labels for the unsupervised method. All synthesized pedestrian images were resized to 288 × 144 pixels, with a batch size of 128. The training was carried out for 50 epochs, and the

e p s i l o n

parameter was set to 0.4. We adopt the SGD optimizer with an initial learning rate of

3.5 \times 10^{- 4}

to train the proposed model. The

α_{s}

in Equation (6) is set to 0.5, and the time consumption of the training phase takes about 2 h.

4.2. Evaluation of Person Re-Identification

4.2.1. Aerial-Ground Person Re-Identification

In the re-identification experiments, we first focused on individual pedestrians. To validate the unsupervised performance of our proposed method in cross-modality aerial-ground scenarios, we compared recent state-of-the-art unsupervised cross-modality person re-identification models originally designed for visible-to-infrared modality by transferring their cross-modality adaptation to the aerial-ground scenario, which enables a comprehensive evaluation of their generalizability performance. Table 1 presents the experimental results on the AG-ReID dataset, demonstrating that our method significantly outperforms the comparative methods. In both A–G and G–A cross-modality scenarios, the R1/mAP/mINP metrics of our method surpassed the best USL methods in Table 1 by 27.38%/20.98%/10.58% and 8.32%/8.04%/4.67%, respectively.

Some representative visualization results are shown in Figure 4. These results confirm the effectiveness and superiority of our proposed unsupervised cross-modality method.

4.2.2. Person Re-Identification of Aerial Scenario

In addition to the heterogeneous cross-modality experiments, we further evaluated the retrieval performance of UAGRPG in the aerial-to-aerial scenario from a UAV perspective, and compared it with mainstream baselines on the AR-ReID benchmark. As shown in Table 2, recent unsupervised models have significantly improved the R1 performance compared with traditional supervised methods, but the improvement in mAP is limited. This indicates that unsupervised methods have certain advantages in the R1 metric for top-ranked retrieval but still exhibit shortcomings and limitations in the global mAP metric compared with supervised methods. Additionally, the mINP is not particularly high. However, the proposed UAGRPG method significantly outperforms all comparative supervised methods across all three metrics, thereby validating the effectiveness of the homogeneous smoothing strategy proposed in UAGRPG for aerial field retrieval performance.

4.3. Evaluation of Cross-Modality Group Re-Identification

To explore the group re-identification performance of cross-modality, we conducted experiments on the constructed AG-GReID dataset. Under the premise of maintaining consistent hyperparameters with those used in person re-identification, the minimum distance of inter-group pedestrian features in Equation (19) is employed as the group feature distance for cross-modality group matching retrieval.

To make fair comparisons, we use the proposed pedestrian distance transformation strategy to transform the unsupervised ReID models into group representations. Table 3 shows the results of the proposed method compared with these unsupervised models. The results are consistent with the person re-identification shown in Table 1. The proposed method maintains achieving the best performance on the group dataset, significantly outperforming comparative methods in both A–G and G–A scenarios, with slightly better performance in A–G. This aligns with the aerial person re-identification results in Table 2, indicating that unsupervised methods are more effective for retrieving ground targets based on aerial targets. This phenomenon holds true for both individual pedestrians and groups, as unsupervised methods rely less on label information. Traditional ground target retrieval often requires extensive annotated data, whereas supervised methods perform better. However, in aerial scenarios with broader fields of view and less accessible label information, unsupervised methods demonstrate superior performance.

Furthermore, the cross-modality results of groups outperform those of individual pedestrian re-identification, which is attributed to the fact that groups consist of multiple pedestrians, and prior works [17,61] have validated that group retrieval enhances the accuracy of pedestrian re-identification. Additionally, the proposed group retrieval method employs a minimum pedestrian distance transformation strategy (refer to Equation (19)) when generating group distances, which effectively narrows and focuses the retrieval scope to the most similar groups, thereby significantly improving matching accuracy.

4.4. Ablation Study

To evaluate the performance of each module in the proposed UAGRPG method, Table 4 and Table 5 present the ablation study results for person re-identification and group re-identification under aerial-ground scenarios, respectively. The results indicate that the proposed four modules, when used together, achieve the best performance for both person and group retrieval, with more significant improvements in group retrieval, particularly in the global retrieval metrics mAP and mINP. Overall, the ablation experiments for cross-modality person and group retrieval validate the effectiveness of each module in the proposed method.

4.5. Visualization

To qualitatively evaluate the performance of the proposed method, we employ t-SNE [62] to visualize the feature space and further analyze the similarity distribution. Figure 5 compares the proposed method with the baseline, which only uses the TransReID [26] backbone without any modules of the proposed UAGRPG method.

As illustrated in Figure 5, the proposed approach effectively brings aerial and ground positive sample points closer while enhancing the separation between positive and negative distributions in the cross-modality scenario. These visualization results demonstrate the effectiveness of our method in mitigating the modality disparity between aerial and ground perspectives. To show the retrieval superiority, Figure 6 illustrates some visual retrieval results of the aerial-ground group re-identification. The favorable visualization results underscore the efficacy of the proposed method in seamlessly integrating aerial surveillance data acquired from UAVs with ground-based surveillance data.

5. Conclusions

In this paper, we propose an unsupervised aerial-ground group re-identification framework tailored for UAV surveillance applications, achieving promising results in aerial-ground scenarios for both retrieving individuals and groups, as well as in the homogeneous aerial-to-aerial scenarios under pure UAV scenario, demonstrating its potential for cross-view and cross-modality identity retrieval of pedestrians in UAV applications. Our approach can facilitate more reliable pedestrian tracking in UAV-based monitoring scenarios.

This method holds promise for future applications in aerial-ground target retrieval, multi-target tracking, and trajectory prediction using UAVs [63,64], contributing to advancements in computer vision and pattern recognition for UAVs. Future work should focus on expanding aerial-ground group re-identification datasets by increasing the number of group identities and viewpoints to enhance the training of generalizable models, and developing real aerial-ground group datasets under the same UAV surveillance scenario, thereby facilitating the practical deployment of this research further.

In the future, this study will be particularly significant for the advancement of UAV-based surveillance, crowd monitoring, and autonomous navigation, where robust and adaptive pedestrian association across viewpoints is crucial. By eliminating the reliance on labeled data, our approach enhances the scalability and practicality of UAV vision systems in dynamic and large-scale environments. For example, we plan to explore adaptations of our approach to scene-based retrieval tasks by incorporating additional semantic priors and multi-scale feature fusion strategies. Specifically, integrating landmark detection and region-based representations may bridge the gap between aerial-ground person ReID and building retrieval or geo-localization, making cross-domain generalization more feasible.

In summary, the proposed method contributes to the broader development of intelligent UAV perception, facilitating more reliable human tracking and situational awareness in real-world scenarios.

Author Contributions

Conceptualization, L.M; methodology, L.M. and Y.C.; software, L.M., Y.C., H.C. and L.J.; validation, L.M. and Y.C.; formal analysis, Y.Y.; investigation, L.M., H.C. and L.J.; resources, L.M., H.C. and Y.Y.; data curation, L.M., Y.C., H.C. and L.J.; writing—original draft preparation, L.M., Y.C. and L.J.; writing—review and editing, L.M., Y.Y. and H.C.; visualization, L.M. and Y.C.; supervision, L.M. and Y.Y.; project administration, L.M. and Y.Y.; funding acquisition, L.M. and Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62306218 and the Nature Science Foundation of Hubei Province under Grant 2023AFB070.

Data Availability Statement

The original data used in this study are openly available at the following references: [28] (https://github.com/huynguyen792/AG-ReID, accessed on 18 June 2024 after the websites).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yasmeen, A.; Daescu, O. Recent Research Progress on Ground-to-Air Vision-Based Anti-UAV Detection and Tracking Methodologies: A Review. Drones 2025, 9, 58. [Google Scholar] [CrossRef]
Norbelt, M.; Luo, X.; Sun, J.; Claude, U. UAV Localization in Urban Area Mobility Environment Based on Monocular VSLAM with Deep Learning. Drones 2025, 9, 171. [Google Scholar] [CrossRef]
Grando, L.; Jaramillo, J.F.G.; Leite, J.R.E.; Ursini, E.L. Systematic Literature Review Methodology for Drone Recharging Processes in Agriculture and Disaster Management. Drones 2025, 9, 40. [Google Scholar] [CrossRef]
Fan, X.; Wen, G.; Gao, Z.; Chen, J.; Jian, H. An Unsupervised Moving Object Detection Network for UAV Videos. Drones 2025, 9, 150. [Google Scholar] [CrossRef]
Lei, X.; Hu, X.; Wang, G.; Luo, H. A multi-UAV deployment method for border patrolling based on Stackelberg game. J. Syst. Eng. Electron. 2023, 34, 99–116. [Google Scholar] [CrossRef]
Mei, L.; Yu, M.; Jia, L.; Fu, M. Crowd Density Estimation via Global Crowd Collectiveness Metric. Drones 2024, 8, 616. [Google Scholar] [CrossRef]
Mei, L.; Lai, J.; Chen, Z.; Xie, X. Measuring crowd collectiveness via global motion correlation. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1222–1231. [Google Scholar]
Stöcker, C.; Bennett, R.; Nex, F.; Gerke, M.; Zevenbergen, J. Review of the current state of UAV regulations. Remote Sens. 2017, 9, 459. [Google Scholar] [CrossRef]
Li, T.; Liu, J.; Zhang, W.; Ni, Y.; Wang, W.; Li, Z. Uav-human: A large benchmark for human behavior understanding with unmanned aerial vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 16266–16275. [Google Scholar]
Zhang, Q.; Wang, L.; Patel, V.M.; Xie, X.; Lai, J. View-decoupled transformer for person re-identification under aerial-ground camera network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 22000–22009. [Google Scholar]
Wang, L.; Zhang, Q.; Qiu, J.; Lai, J. Rotation exploration transformer for aerial person re-identification. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]
Nguyen, H.; Nguyen, K.; Sridharan, S.; Fookes, C. AG-ReID. v2: Bridging aerial and ground views for person re-identification. IEEE Trans. Inf. Forensics Secur. 2024, 19, 2896–2908. [Google Scholar] [CrossRef]
Mei, L.; Lai, J.; Feng, Z.; Xie, X. From pedestrian to group retrieval via siamese network and correlation. Neurocomputing 2020, 412, 447–460. [Google Scholar] [CrossRef]
Cai, Y.; Takala, V.; Pietikainen, M. Matching groups of people by covariance descriptor. In Proceedings of the IEEE 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2744–2747. [Google Scholar]
Zhang, G.; Liu, T.; Ye, Z. Dynamic Screening Strategy Based on Feature Graphs for UAV Object and Group Re-Identification. Remote Sens. 2024, 16, 775. [Google Scholar] [CrossRef]
Mei, L.; Lai, J.; Feng, Z.; Xie, X. Open-world group retrieval with ambiguity removal: A benchmark. In Proceedings of the 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 584–591. [Google Scholar]
Yan, Y.; Qin, J.; Ni, B.; Chen, J.; Liu, L.; Zhu, F.; Zheng, W.S.; Yang, X.; Shao, L. Learning multi-attention context graph for group-based re-identification. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 45, 7001–7018. [Google Scholar] [CrossRef]
Yang, D.; Lai, J.; Mei, L. Deep representations based on sparse auto-encoder networks for face spoofing detection. In Proceedings of the 11th Chinese Conference on Biometric Recognition (CCBR), Chengdu, China, 14–16 October 2016; pp. 620–627. [Google Scholar]
Xiong, J.; Lai, J. Similarity metric learning for RGB-infrared group re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13662–13671. [Google Scholar]
Zhang, Q.; Lai, J.; Xie, X.; Jin, X.; Huang, S. Separable Spatial-Temporal Residual Graph for Cloth-Changing Group Re-Identification. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5791–5805. [Google Scholar] [CrossRef]
Zhang, Q.; Lai, J.; Feng, Z.; Xie, X. Uncertainty modeling for group re-identification. Int. J. Comput. Vis. 2024, 132, 3046–3066. [Google Scholar] [CrossRef]
Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1116–1124. [Google Scholar]
Wei, L.; Zhang, S.; Gao, W.; Tian, Q. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 79–88. [Google Scholar]
Xiong, F.; Gou, M.; Camps, O.; Sznaier, M. Person re-identification using kernel-based metric learning methods. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part VII 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 1–16. [Google Scholar]
Zhang, Q.; Lai, J.; Feng, Z.; Xie, X. Seeing like a human: Asynchronous learning with dynamic progressive refinement for person re-identification. IEEE Trans. Image Process. 2021, 31, 352–365. [Google Scholar] [CrossRef]
He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; Jiang, W. Transreid: Transformer-based object re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15013–15022. [Google Scholar]
Chen, S.; Ye, M.; Du, B. Rotation invariant transformer for recognizing object in uavs. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 2565–2574. [Google Scholar]
Nguyen, H.; Nguyen, K.; Sridharan, S.; Fookes, C. Aerial-ground person re-id. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 2585–2590. [Google Scholar]
Sun, X.; Zheng, L. Dissecting person re-identification from the viewpoint of viewpoint. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 608–617. [Google Scholar]
Lin, Y.; Xie, L.; Wu, Y.; Yan, C.; Tian, Q. Unsupervised person re-identification via softened similarity learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 3390–3399. [Google Scholar]
Zou, C.; Chen, Z.; Cui, Z.; Liu, Y.; Zhang, C. Discrepant and multi-instance proxies for unsupervised person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 11058–11068. [Google Scholar]
Ge, Y.; Chen, D.; Li, H. Mutual mean-teaching: Pseudo label refinery for unsupervised domain adaptation on person re-identification. arXiv 2020, arXiv:2001.01526. [Google Scholar] [CrossRef]
Cho, Y.; Kim, W.J.; Hong, S.; Yoon, S.E. Part-based pseudo label refinement for unsupervised person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7308–7318. [Google Scholar]
Ge, Y.; Zhu, F.; Chen, D.; Zhao, R. Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. Adv. Neural Inf. Process. Syst. 2020, 33, 11309–11321. [Google Scholar]
Layne, R.; Hospedales, T.M.; Gong, S. Investigating open-world person re-identification using a drone. In Proceedings of the Computer Vision—ECCV 2014 Workshops, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part III 13. Springer: Berlin/Heidelberg, Germany, 2015; pp. 225–240. [Google Scholar]
Albaluchi, Y.; Fu, B.; Damer, N.; Ramachandra, R.; Raja, K. UAV-based person re-identification: A survey of UAV datasets, approaches, and challenges. Comput. Vis. Image Underst. 2025, 251, 104261. [Google Scholar] [CrossRef]
Liang, W.; Wang, G.; Lai, J.; Xie, X. Homogeneous-to-heterogeneous: Unsupervised learning for RGB-infrared person re-identification. IEEE Trans. Image Process. 2021, 30, 6392–6407. [Google Scholar] [CrossRef] [PubMed]
Yang, B.; Ye, M.; Chen, J.; Wu, Z. Augmented dual-contrastive aggregation learning for unsupervised visible-infrared person re-identification. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 2843–2851. [Google Scholar]
Cheng, D.; He, L.; Wang, N.; Zhang, S.; Wang, Z.; Gao, X. Efficient bilateral cross-modality cluster matching for unsupervised visible-infrared person reid. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 1325–1333. [Google Scholar]
Wu, Z.; Ye, M. Unsupervised visible-infrared person re-identification via progressive graph matching and alternate learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9548–9558. [Google Scholar]
Wang, J.; Zhang, Z.; Chen, M.; Zhang, Y.; Wang, C.; Sheng, B.; Qu, Y.; Xie, Y. Optimal transport for label-efficient visible-infrared person re-identification. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 93–109. [Google Scholar]
Zhu, F.; Chu, Q.; Yu, N. Consistent matching based on boosted salience channels for group re-identification. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 4279–4283. [Google Scholar]
Lin, W.; Li, Y.; Xiao, H.; See, J.; Zou, J.; Xiong, H.; Wang, J.; Mei, T. Group reidentification with multigrained matching and integration. IEEE Trans. Cybern. 2019, 51, 1478–1492. [Google Scholar] [CrossRef]
Huang, Z.; Wang, Z.; Hu, W.; Lin, C.W.; Satoh, S. DoT-GNN: Domain-transferred graph neural network for group re-identification. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1888–1896. [Google Scholar]
Zhu, J.; Yang, H.; Lin, W.; Liu, N.; Wang, J.; Zhang, W. Group re-identification with group context graph neural networks. IEEE Trans. Multimed. 2020, 23, 2614–2626. [Google Scholar] [CrossRef]
Lisanti, G.; Martinel, N.; Del Bimbo, A.; Luca Foresti, G. Group re-identification via unsupervised transfer of sparse features encoding. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2449–2458. [Google Scholar]
Zhang, Q.; Dang, K.; Lai, J.H.; Feng, Z.; Xie, X. Modeling 3d layout for group re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7512–7520. [Google Scholar]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mini, Portland, OR, USA, 2–4 August 1996; Volume 96, pp. 226–231. [Google Scholar]
Rezatofighi, S.H.; Milan, A.; Zhang, Z.; Shi, Q.; Dick, A.; Reid, I. Joint probabilistic data association revisited. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3047–3055. [Google Scholar]
Bruff, D. The assignment problem and the hungarian method. Notes Math 2005, 20, 5. [Google Scholar]
Mei, L.; Lai, J.; Xie, X.; Zhu, J.; Chen, J. Illumination-invariance optical flow estimation using weighted regularization transform. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 495–508. [Google Scholar] [CrossRef]
Mei, L.; Lai, J.; Chen, Z. Geodesic-based probability propagation for efficient optical flow. Electron. Lett. 2018, 54, 758–760. [Google Scholar] [CrossRef]
Pan, C.; Fang, H.; Zhang, H.; Wan, Z.; Mei, L.; Wu, S. Visual Attention-Guided Weighted Naïve Bayes for Behavior Intention Inference. In Proceedings of the 2nd International Conference on Artificial Intelligence, Human-Computer Interaction and Robotics (AIHCIR), Tianjin, China, 8–10 December 2023; pp. 569–574. [Google Scholar]
He, L.; Cheng, D.; Wang, N.; Gao, X. Exploring Homogeneous and Heterogeneous Consistent Label Associations for Unsupervised Visible-Infrared Person ReID. Int. J. Comput. Vis. 2024, 1–20. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]
Wang, G.; Yuan, Y.; Chen, X.; Li, J.; Zhou, X. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 274–282. [Google Scholar]
Luo, H.; Gu, Y.; Liao, X.; Lai, S.; Jiang, W. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 15–20 June 2019; pp. 1487–1495. [Google Scholar]
He, L.; Liao, X.; Liu, W.; Liu, X.; Cheng, P.; Mei, T. Fastreid: A pytorch toolbox for general instance re-identification. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 9664–9667. [Google Scholar]
Mei, L.; Lai, J.; Feng, Z.; Chen, Z.; Xie, X. Person re-identification using group constraint. In Proceedings of the Intelligence Science and Big Data Engineering. Visual Data Engineering: 9th International Conference, IScIDE 2019, Nanjing, China, 17–20 October 2019; Proceedings, Part I 9. Springer: Berlin/Heidelberg, Germany, 2019; pp. 459–471. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Mei, L.; He, Y.; Fishani, F.; Yu, Y.; Zhang, L.; Helge, R. Learning Domain-Adaptive Landmark Detection-Based Self-Supervised Video Synchronization for Remote Sensing Panorama. Remote Sens. 2023, 15, 953. [Google Scholar] [CrossRef]
Mei, L.; Fu, M.; Wang, B.; Jia, L.; Yu, M.; Zhang, Y.; Zhang, L. LSN-GTDA: Learning Symmetrical Network via Global Thermal Diffusion Analysis for Pedestrian Trajectory Prediction in Unmanned Aerial Vehicle Scenarios. Remote Sens. 2025, 17, 154. [Google Scholar] [CrossRef]

Figure 1. Illustration of aerial-ground person re-identification and group re-identification.

Figure 2. The pipeline of the proposed method. The proposed UAGRPG framework leverages neighbor-aware collaborative learning (NCL) and the gradual graph matching (GGC) strategy to explore implicit cross-modality group associations. Additionally, a collaborative cross-modality association learning (CCAL) module is developed to address feature disparities and enable soft alignment between the aerial and the ground modality. Finally, a minimum pedestrian distance transformation strategy is designed to quantify and optimize group similarity across modalities.

Figure 3. Some representative examples of the proposed AG-GReID dataset. (a,e), (b,f), (c,g), and (d,h) belong to the same group identity across different modalities. Rows 1 and 2 show the aerial and the ground group images in the proposed dataset, respectively.

Figure 4. Visualizations of cross-modality aerial-ground person re-identification by the proposed method on AG-GReID dataset. “A–G” and “G–A” represent aerial-to-ground retrieval and ground-to-aerial retrieval, respectively. Red and blue bounding boxes denote the true and false retrieval results, respectively.

Figure 5. Qualitative results of randomly selected group identities. The first and the second row are the t-SNE and similarity distribution, respectively. In the t-SNE results, identities are annotated with different colors, and the triangles denote the aerial modality, while the circles denote the ground modality. (a,b) is the result of the baseline and the proposed UAGRPG method.

Figure 6. Visualizations of cross-modality aerial-ground group re-identification by the proposed method. “A–G” and “G–A” represent aerial-to-ground retrieval and ground-to-aerial retrieval with blue and green dashed boxes, respectively. Red and blue bounding boxes denote the true and false retrieval results, respectively.

Table 1. Comparisons of different state-of-the-art unsupervised person re-identification baselines under cross-modality protocol (%). “A–G” and “G–A” represent “aerial-to-ground” protocol and “ground-to-aerial” protocol of person re-identification, respectively. Rn means Rank-n in all the following tables.

Model	Protocol	R1	R5	R10	R20	mAP	mINP	Publication
USL [40]	A–G	31.82	42.35	47.61	54.30	19.96	4.30	CVPR’2023
MBCCM [39]		32.31	48.27	55.37	62.28	19.49	4.05	MM’2023
MULT [54]		28.29	45.10	51.82	57.61	13.30	2.16	IJCV’2024
Ours		59.20	69.28	73.48	77.68	40.94	14.88	-
USL [40]	G–A	42.62	54.57	61.95	66.84	28.72	7.85	CVPR’2023
MBCCM [39]		42.00	53.95	59.15	65.18	28.96	7.88	MM’2023
MULT [54]		17.57	27.65	34.82	43.04	10.60	1.91	IJCV’2024
Ours		50.94	62.06	66.94	72.04	36.76	12.52	-

Table 2. Comparisons of different state-of-the-art person re-identification baselines under aerial-to-aerial protocol (%). “S” and “U” denote the supervised methods and unsupervised methods, respectively.

Model	mAP	R1	mINP	Publication	Supervision
Swin [55]	67.37	68.23	-	Arxiv’2021	S
HRNet-18 [56]	64.52	65.48	-	TPAMI’2021	S
SwinV2 [57]	69.15	70.12	-	CVPR’2022	S
MGN [58]	70.40	70.38	-	MM’2018	S
BoT [59]	63.41	62.48	-	CVPRW’2019	S
SBS [60]	65.93	66.38	-	MM’2023	S
V2E [12]	71.47	72.75	-	TIFE’2024	S
USL [40]	67.76	96.83	23.48	CVPR’2023	U
MBCCM [39]	55.43	92.25	12.99	MM’2023	U
MULT [54]	48.19	90.76	8.19	IJCV’2024	U
Ours	76.02	98.13	37.66	-	U

Table 3. Comparisons of state-of-the-art models under cross-modality group re-identification protocol on AG-GReID dataset (%). “A–G” and “G–A” represent “aerial-to-ground” protocol and “ground-to-aerial” protocol of group re-identification, respectively.

Model	Protocol	R1	R5	R10	R20	mAP	mINP	Publication
USL [40]	A–G	72.22	90.74	98.15	100.00	80.32	40.19	CVPR’2023
MBCCM [39]		53.70	83.33	90.74	94.44	65.35	14.55	MM’2023
MULT [54]		40.74	70.37	81.48	87.04	55.38	11.46	IJCV’2024
Ours		90.74	96.30	96.30	98.15	92.25	56.88	-
USL [40]	G–A	57.41	77.78	85.19	94.44	66.36	26.44	CVPR’2023
MBCCM [39]		51.85	74.07	88.89	96.30	63.85	6.73	MM’2023
MULT [54]		25.93	48.15	64.81	79.63	37.76	6.89	IJCV’2024
Ours		74.07	94.44	98.15	100.00	81.87	42.25	-

Table 4. Ablation study of person re-identification task (%). “A–G” and “G–A” represent “aerial-to-ground” protocol and “ground-to-aerial” protocol of person re-identification, respectively.

Module				A–G			G–A
NCL	GGC	ACCL	SCMA	R1	mAP	mINP	R1	mAP	mINP
Yes				46.78	29.03	7.59	35.34	23.93	6.60
Yes	Yes			53.22	34.84	10.46	39.50	27.68	7.87
Yes	Yes	Yes		54.72	36.09	11.75	43.76	30.80	9.44
Yes	Yes	Yes	Yes	59.20	40.94	14.88	50.94	36.76	12.52

Table 5. Ablation study of group re-identification task (%). “A–G” and “G–A” represent “aerial-to-ground” protocol and “ground-to-aerial” protocol of group re-identification, respectively.

Module				A–G			G–A
NCL	GGC	ACCL	SCMA	R1	mAP	mINP	R1	mAP	mINP
Yes				74.07	80.73	49.10	57.41	66.79	31.65
Yes	Yes			75.93	83.01	50.50	61.11	68.34	31.82
Yes	Yes	Yes		87.04	90.38	58.24	70.37	77.70	40.22
Yes	Yes	Yes	Yes	90.74	92.25	56.88	74.07	81.87	42.25

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mei, L.; Cheng, Y.; Chen, H.; Jia, L.; Yu, Y. Unsupervised Aerial-Ground Re-Identification from Pedestrian to Group for UAV-Based Surveillance. Drones 2025, 9, 244. https://doi.org/10.3390/drones9040244

AMA Style

Mei L, Cheng Y, Chen H, Jia L, Yu Y. Unsupervised Aerial-Ground Re-Identification from Pedestrian to Group for UAV-Based Surveillance. Drones. 2025; 9(4):244. https://doi.org/10.3390/drones9040244

Chicago/Turabian Style

Mei, Ling, Yiwei Cheng, Hongxu Chen, Lvxiang Jia, and Yaowen Yu. 2025. "Unsupervised Aerial-Ground Re-Identification from Pedestrian to Group for UAV-Based Surveillance" Drones 9, no. 4: 244. https://doi.org/10.3390/drones9040244

APA Style

Mei, L., Cheng, Y., Chen, H., Jia, L., & Yu, Y. (2025). Unsupervised Aerial-Ground Re-Identification from Pedestrian to Group for UAV-Based Surveillance. Drones, 9(4), 244. https://doi.org/10.3390/drones9040244

Article Menu

Unsupervised Aerial-Ground Re-Identification from Pedestrian to Group for UAV-Based Surveillance

Abstract

1. Introduction

2. Related Works

2.1. Supervised View-Homogeneous Person Re-Identification

2.2. Supervised View-Heterogeneous Person Re-Identification

2.3. Unsupervised Person Re-Identification

2.4. Group Re-Identification

3. The Proposed Methodology

3.1. Neighbor-Aware Collaborative Learning

3.2. Gradual Graph Correspondence (GGC)

3.3. Collaborative Cross-Modality Association Learning (CCAL)

3.4. Soft Cross-Modality Alignment (SCMA)

3.5. Transformation from Pedestrian to Group Distance

4. Experiments

4.1. Experimental Setting

4.1.1. Evaluating Metrics

4.1.2. Dataset Description

4.1.3. Implementation Details

4.2. Evaluation of Person Re-Identification

4.2.1. Aerial-Ground Person Re-Identification

4.2.2. Person Re-Identification of Aerial Scenario

4.3. Evaluation of Cross-Modality Group Re-Identification

4.4. Ablation Study

4.5. Visualization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI