Pedestrian Re-Identification Algorithm Based on Unmanned Aerial Vehicle Imagery

Song, Lili; Jin, Xin; Han, Jianfeng; Yao, Jie

doi:10.3390/app15031256

Open AccessArticle

Pedestrian Re-Identification Algorithm Based on Unmanned Aerial Vehicle Imagery

¹

School of Information Engineering, Inner Mongolia University of Technology, Hohhot 010080, China

²

Inner Mongolia Autonomous Region Key Laboratory of Intelligent Perception and System Engineering, Hohhot 010080, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(3), 1256; https://doi.org/10.3390/app15031256

Submission received: 26 November 2024 / Revised: 17 January 2025 / Accepted: 22 January 2025 / Published: 26 January 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Pedestrian re-identification in complex scenarios is often hindered by challenges such as viewpoint diversity, background interference, and behavioral complexity, which traditional methods struggle to address effectively in wide-area surveillance. Unmanned Aerial Vehicles (UAVs) offer a promising solution to this problem due to their flexibility and extensive coverage capabilities. However, UAV aerial images introduce additional challenges, including significant viewpoint variations and the complexity of pedestrian behaviors. To address these issues, this paper proposes a Transformer-based model that integrates a multi-scale graph convolution network (MU-GCN) with a non-local attention mechanism to address these challenges. A MU-GCN enhances feature extraction by employing graph convolutional networks to improve feature representation after extracting detailed features at various scales through multi-scale convolution kernels. This strengthens the model’s focus on local information. Meanwhile, the non-local attention mechanism enhances the model’s capacity to capture global contextual information by modeling dependencies between distant regions in the image. This approach is better suited for the unique characteristics of UAV aerial imagery. Experimental results demonstrate that, compared to the baseline model, the proposed method achieves improvements of 9.5% in mean average precision (mAP) and 4.9% in Rank-1 accuracy, validating the effectiveness of the model.

Keywords:

pedestrian re-identification; transformer; graph convolutional network; aerial imagery

1. Introduction

Pedestrian re-identification (ReID) is a crucial technology used in video surveillance, intelligent transportation, and security systems [1]. Its main objective is to accurately identify and match the same pedestrian across various cameras and perspectives, enabling numerous practical applications. Unlike Pedestrian Recognition, which categorizes pedestrians from different cameras and viewpoints, pedestrian ReID focuses on distinguishing individuals within the same category. This requires advanced feature extraction and model development, as ReID must not only differentiate between various pedestrians but also accurately match the same individuals in large-scale datasets.

To address the challenges of pedestrian ReID in large-scale and complex environments, researchers have explored more fine-grained and efficient object identification methods. Specific Object Identification (SOI) has emerged as a significant research direction. Various approaches have been proposed to achieve efficient SOI, including discriminative feature extraction using deep learning networks to capture unique information that distinguishes individuals in pedestrian images; fuzzy clustering and classification methods for handling complex and uncertain data, especially when different individuals belong to the same pedestrian class; and hierarchical clustering methods for more detailed pedestrian classification through multilevel clustering. These methods not only extract unique local features in pedestrian images but also enhance individual recognition through refined classification schemes.

However, to address the challenges posed by unbalanced data distributions and complex feature spaces, reference [2] introduced an enhanced gravitational classifier based on the Geometrical Divide method. This approach addresses the inefficiencies of the traditional 1 ÷ 1 composite base method in managing overlapping categories by geometrically partitioning class data particles. Using the Moons and Circles datasets, the study experimentally validated the Geometrical Divide method, employing multiple data particle quality determination algorithms alongside k-fold cross-validation. The results demonstrate that the Geometrical Divide method outperforms traditional approaches in classifying these datasets. Rybak et al. [3] further explored the Geometrical Divide (GD) method and its improved variant, the Unequal Geometrical Divide (UGD), within a gravitational classification framework. The GD method alleviates inter-class overlap by determining dividing lines through the center of mass and geometric center calculations. The UGD method builds upon this by enhancing the handling of minority class samples. Experimental evaluations confirmed the superior performance of these methods on various unbalanced datasets. While originally designed for general classification tasks, the underlying theoretical framework provides valuable insights for feature optimization and pedestrian re-identification in complex environments.

In recent years, UAVs have shown significant advantages in large-scale surveillance due to their flexibility and wide-area coverage capabilities [4]. UAVs can adjust their flight height and path to capture multi-angle aerial images, making them extensively studied for target detection. Additionally, multi-sensor data fusion technology offers a novel approach for UAV target detection, significantly enhancing both the accuracy and robustness of detections. Vitiello et al. [5] introduced a two-layer fusion strategy. First, they compared visual detection results with radar detection projections in the RGB image plane to eliminate radar clutter. Next, they fused radar and visual information using extended Kalman filtering (EKF) to accurately determine the target’s position and velocity. Marques et al. [6] significantly enhanced detection accuracy in low-visibility environments by superimposing infrared images onto RGB images, creating multi-channel input data. Golcarenarenji et al. [7] addressed human body detection for UAVs in harsh conditions by applying a data hybrid weighting method to integrate visible and infrared images, thereby improving the accuracy of human target detection. Liu et al. [8] combined a twin deep learning network for semantic segmentation of RGB images with data fusion techniques to generate LiDAR point clouds with semantic labels, enabling the detection of obstacles in landing areas and accurate distance measurements.

Although multi-sensor data fusion significantly enhances UAV target detection and offers new opportunities for pedestrian re-identification, it still faces issues such as changing viewpoints [9], diverse pedestrian appearances [10], and complex backgrounds [11]. Despite these advancements, existing methods struggle with extreme viewpoint variations and the complexity of pedestrian behaviors, which severely limit identification accuracy. To overcome these obstacles, this study proposes a multi-scale graph convolutional network (MU-GCN) combined with a non-local attention mechanism for Transformer-based Person Re-identification (MNTReID). The contributions of this paper are summarized as follows:

The design of the multi-scale graph convolution network (MU-GCN): The MU-GCN is integrated into the stitching branch of the Transformer. This network captures detailed local features of pedestrian images using multi-scale convolutional kernels. It further enhances these features through graph convolutional networks. This approach is particularly effective at adapting to different feature scales, which is crucial for handling variable pedestrian poses and size changes in UAV views.
The integration of the non-local attention mechanism: The non-local attention mechanism is introduced into the Transformer’s global branch. This mechanism overcomes the limitations of the traditional self-attention mechanism in capturing long-range dependencies. It significantly improves the model’s ability to integrate global contextual information, thereby enhancing performance in complex environments.
Experimental validation on the UAV aerial image dataset: Experiments were conducted on a UAV aerial image dataset. The results show that the proposed model improves the mean Average Precision (mAP) by 9.5% and Rank-1 accuracy by 4.9% compared to the baseline model. These findings demonstrate the effectiveness of the proposed approach.

2. Related Work

2.1. Conventional Pedestrian Re-Identification

Early ReID methods predominantly relied on global features [12], neglecting the capture of local information and resulting in relatively low identification accuracy. To overcome this limitation, researchers have progressively incorporated local feature learning techniques [13], enabling the capture more detailed characteristics of pedestrians. However, this approach introduces new challenges, including redundancy and instability in local information.

To address these issues and further enhance the model’s identification capabilities, researchers have integrated multi-scale feature fusion [14] and attention mechanisms [15]. These methods improve the model’s ability to effectively capture and interpret features across different scales and granularities. For example, reference [16] enhances the capture of pedestrian contours and local details by introducing edge information. Reference [17] implements an attention pooling mechanism that allows the model to focus on salient regions in pedestrian images, thereby effectively extracting discriminative features and mitigating the effects of scale, pose, and occlusion. Additionally, to optimize the expressiveness of the feature space, various loss functions have been designed to improve identification accuracy. Specifically, reference [18] proposes a novel loss function, Marginal Cosine Softmax Loss (MCSL), which enhances the performance of pedestrian ReID models in metric learning.

With the expansion of datasets and the increase in annotation costs, unsupervised learning has gradually become a research hotspot for pedestrian re-identification [19,20]. SOI effectively identifies and differentiates similar pedestrians by focusing on feature information in pedestrian images. It combines clustering and optimization algorithms to deeply mine the intrinsic structure and patterns of the data, thereby enhancing the model’s adaptability.

The attention-driven framework (AFC) proposed in reference [21] integrates attention mechanisms with clustering optimization techniques. This combination further improves the model’s identification ability, especially when relying on clustering methods and pseudo-labeling optimization, effectively addressing the uncertainty problem.

The Enhanced Feature Representation and Robust Clustering (EFRRC) method introduced in reference [22] enhances feature representation among human body parts by incorporating a relational network. It extracts global features from images through the Global Comparison Pooling (GCP) module, resolving issues related to data inhomogeneity and clustering, thereby improving identification accuracy in complex scenarios. Lin et al. [23] proposed a bottom-up clustering method aimed at making clustering results more consistent with real-world distributions by jointly optimizing the relationship between the model and individual samples. Conversely, Zeng et al. [24] introduced a hierarchical clustering method that employs a hard-batch ternary loss. This loss function is crucial for mining similarities between samples through hierarchical clustering and mitigating the impact of complex samples on clustering results, thereby improving label quality. Graph convolutional networks (GCNs) [25] have also been applied to pedestrian re-identification tasks to better process structural information in images and enhance identification robustness. Additionally, reference [26] proposes a new architecture that combines attribute features by mapping manually labeled pedestrian features to word vectors and integrating them with body part features. This approach enhances feature representation in occlusion situations.

To address these limitations, Transformer-based models have been proposed for pedestrian re-identification [27]. These models offer enhanced capabilities in capturing long-range dependencies. They also improve the understanding of global contextual information. An exemplary application is TransReID [28]. In this model, the Transformer partitions images into multiple patches. It optimizes feature aggregation, resulting in a robust baseline model with improved performance.

2.2. Pedestrian Re-Identification for UAVs

Meta-learning and transfer learning approaches have been utilized to enhance model’s adaptability to UAV-captured images [29,30]. The subspace pooling convolutional feature map technique has demonstrated potential in improving the model’s identification ability in complex environments by extracting rich and diverse features [31]. Additionally, unsupervised learning methods have opened new avenues for enhancing model adaptability to UAV imagery. These methods facilitate the deep exploration of the inherent structure and patterns within the data [32,33]. The integration of covariance information has further strengthened model robustness. This is achieved by combining traditional hand-crafted features with deep learning representations, thereby improving the model’s ability to handle viewpoint variations [34]. Huang et al. [35] proposed a multi-resolution feature-aware network that significantly enhances model performance. This is accomplished by incorporating self-attention and cross-attention modules, which effectively address resolution variations and complex backgrounds. Moreover, independent approaches have been proposed to enhance model robustness and accuracy in complex UAV environments [36]. These approaches integrate ternary group loss, large-margin Gaussian mixture loss, multi-branch architectures, and channel group learning strategies. Reference [37] proposes an innovative Masked Relation Guided Transformer (MRG-T) framework to address the information redundancy problem in pedestrian attribute identification through three modules. The Mask Region Relationship Module (MRRM) focuses on key regions to extract robust features. The Mask Attribute Relationship Module (MARM) mines semantic associations among attributes through attribute label masks. The Region and Attribute Mapping Module (RAMM) aligns spatial regions and attributes based on a cross-attention mechanism. Hu et al. [38] proposed a Transformer-based ReID algorithm that significantly improves fine-grained image retrieval performance. Additionally, reference [39] proposes a Vision Transformer (ViT)-based parametric instance learning method for pedestrian re-identification. Feature alignment and similarity are optimized through self-supervised learning and instance differentiation. The model’s robustness to occlusion and alignment problems is enhanced by zero-padding and displacement techniques. To address scale variation and occlusion problems in pedestrian re-identification for Unmanned Aerial Vehicle (UAV) filming, a Multi-Granularity Intra-Attention (MGAiA) network is proposed in reference [40]. A multi-granularity attention (MGA) module is designed to enhance the global perception of the feature extraction model and to explore discriminative features under scale variation. Additionally, an intra-attention mechanism (AiA) is proposed to generate attention weights with different granularities, reducing the negative impact of occlusion.

3. Improvement of Transformer-Based Pedestrian Re-Identification Algorithm

The Transformer model was initially introduced for natural language processing but has been successfully applied to computer vision tasks. Its self-attention mechanism enables the capture and integration of global contextual information. This mechanism overcomes the limitations of convolutional neural networks (CNNs), which are confined to extracting features within a fixed local receptive field. Consequently, the Transformer model exhibits greater adaptability and robustness in processing pedestrian images under varying poses, dynamic lighting conditions, and complex backgrounds. Vision Transformer (ViT) is an adaptation of the Transformer architecture specifically designed for vision tasks. ViT partitions input images into smaller patches and employs positional encoding to maintain spatial relationships. This approach allows the model to efficiently process image data by capturing both global and local features.

To address the challenges associated with cross-camera variations, ViT incorporates modules such as the Jigsaw Patch Module and the Side Information Embedding Module. These components enhance the model’s robustness to local perturbations and enable the integration of non-visual information, such as camera ID and viewpoint. However, despite these advancements, ViT’s ability to extract local feature and global feature context remains insufficient when handling the complex behaviors of pedestrians in multi-view UAV aerial imagery.

To address these limitations, we propose MNTReID, a novel framework that combines three core components: the Transformer network, the multi-scale graph convolutional network (MU-GCN), and the non-local attention mechanism. In the local feature extraction branch of the Transformer, the MU-GCN utilizes multi-scale convolutional kernels to capture detailed features of pedestrians effectively. The extracted features are further refined through graph convolutional operations, which enhance the model’s ability to focus on complex pedestrian behavioral patterns. In the global feature branch, the non-local attention mechanism addresses the Transformer’s limitations in capturing long-range dependencies. By establishing dependencies between arbitrary positions in the image, this mechanism allows the model to integrate global contextual information more effectively, thereby enhancing performance in multi-view UAV imagery. The structure of MNTReID is illustrated in Figure 1.

During the model training phase, this study employs a combined approach of ArcFace Loss and Triplet Loss. ArcFace Loss improves the separability of inter-class features by incorporating angular constraints during classification. It also reduces intra-class feature variations. This dual effect enhances the model’s ability to distinguish between similar pedestrians. In contrast, Triplet Loss refines the organization of the feature space by minimizing the distance between positive samples and maximizing the distance between negative samples. This process improves the model’s discriminative capability. The joint application of these loss functions aims to enhance both the identification accuracy and the generalization capability of the model.

To fully exploit the complementary nature of global and local features, the global feature vectors, local feature vectors, and multi-scale graph convolutional features are concatenated during the testing phase. This concatenation results in a more comprehensive and discriminative pedestrian embedding. Pedestrian matching is efficiently conducted by computing the Euclidean distance between the integrated feature representation and the stored features in the pedestrian database.

3.1. Multi-Scale Graph Convolutional Networks

In pedestrian re-identification tasks, especially when dealing with pedestrians exhibiting complex behaviors, the extraction of local features is crucial. While the Transformer model is effective at capturing global features, it has inherent limitations in extracting fine-grained local information. To overcome this challenge, this paper proposes a novel model, MU-GCN. MU-GCN is inspired by multi-scale feature fusion and graph convolutional networks. This model is specifically designed to address the complexities associated with pedestrian behavior in UAV aerial imagery. The structure of MU-GCN is illustrated in Figure 2.

Single-scale feature extraction typically emphasizes specific details at fixed scales. However, this approach may not adequately capture the complex behaviors of pedestrians, especially under varying UAV viewpoints. As a result, the model’s identification performance is adversely impacted. To overcome this limitation, MU-GCN employs a multi-scale feature extraction strategy. This strategy enhances the model’s ability to perceive detailed pedestrian features by using convolution kernels of varying scales.

In particular, the global feature map is divided into four distinct local regions (Local 1 to Local 4). Each region is processed using specific convolution kernels of varying sizes (1 × 1, 3 × 3, and 5 × 5). The multi-scale features are then integrated through a dynamic convolution mechanism. This design ensures that the model effectively captures pedestrian features at multiple scales. It also enhances attention to local details, particularly in scenarios involving pedestrians with complex behaviors. The structure of Local 1 is shown in Figure 3.

The correlation among local features remains insufficiently captured following the multi-scale feature fusion process. To further enhance the representational power of local information, this study employs a GCN to process the local features following multi-scale feature fusion. The adjacency matrix A in GCN is constructed by computing the similarity between the feature points, representing among them. The similarity is typically defined as follows:

A_{i j} = e x p (- \frac{{∥L_{i} - L_{j}∥}^{2}}{2 δ^{2}})

(1)

where

{∥L_{i} - L_{j}∥}^{2}

is the Euclidean distance between the feature points

L_{i}

and

L_{j}

. The Euclidean distance is computationally simple, has strong geometric significance, and intuitively reflects the differences between feature points. It also demonstrates good stability in high-dimensional feature spaces, which is crucial for effective feature aggregation in GCNs. This allows the model to capture fine-grained details more effectively, thereby improving re-identification accuracy. Additionally, the scale parameter

δ

further optimizes the impact of the distance.

Through the convolution operation of GCN, the model can transfer information exchange among feature points, enabling the aggregation of local features. The operation at each GCN layer can be expressed as:

H^{l + 1} = δ ({\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}} H^{(l)} W^{(l)}

(2)

where

\tilde{A} = A + I

represents the adjacency matrix A with added the self-loop, I is the identity matrix,

\tilde{D}

is the degree matrix of

\tilde{A}

,

H^{(l)}

is the feature representation of layer l,

W^{(l)}

is the trainable weight matrix of layer l, and

σ

is the activation function.

Through the layer-by-layer graph convolution operations of GCN, the model incrementally captures the relationships among local features. Additionally, the incorporation of dropout effectively mitigates overfitting, enhancing the model’s generalization performance. The structure of a single-layer graph convolution is shown in Figure 4.

The MU-GCN proposed in this study captures fine-grained local information at multiple scales. Additionally, it establishes global dependencies among these local features through graph convolutional networks. This approach improves the model’s identification accuracy and robustness. It is especially effective in addressing the diverse behaviors of pedestrians observed in UAV aerial imagery.

3.2. Non-Local Attention Mechanism

In UAV multi-view scenarios, pedestrians often exhibit complex postures. This results in the dispersion of features from different body parts across the image. Effectively capturing the global relationships among these dispersed features is essential for accurate pedestrian detection. Although the Transformer’s self-attention mechanism is highly effective in capturing global information, it relies on local neighborhoods or fixed context windows. This reliance limits its ability to fully leverage dependencies between distant regions in the image. To overcome this limitation, the non-local attention mechanism is integrated into the global branch of the Transformer model. This enhancement improves the model’s capability to capture and aggregate global dependencies across arbitrary positions in the image. Consequently, it enables a more comprehensive understanding of the global context.

The essence of the non-local attention mechanism lies in its ability to capture correlations between arbitrary regions of the image. This capability is crucial for achieving a holistic understanding of pedestrians. The mechanism generates weighted global features by calculating interactions between different locations in the feature map. This process enhances the model’s ability to effectively capture and utilize global information.

In the implementation, the input features are first passed through three 1 × 1 convolutional layers to generate three different feature maps:

θ

,

ϕ

, and

γ

. The similarity matrix F between

θ

and

ϕ

is computed, typically using the dot product:

f (θ_{i}, ϕ_{j}) = θ_{i}^{⊺} ϕ_{j}

, where i and j represent different positions in the feature map. The similarity matrix is then normalized using the softmax function to ensure that the sum of each row equals 1:

s o f t m a x (f (θ_{i}, ϕ_{j})) = \frac{e x p (f (θ_{i}, ϕ_{j})}{\sum_{k} e x p (f (θ_{i}, ϕ_{k})}

(3)

A weighted sum of

γ

is then computed using the normalized similarity matrix as weights:

y_{i} = \sum_{j} e x p (f (θ_{i}, ϕ_{j}) γ_{j}

(4)

Finally, the output feature maps are concatenated with the original input feature maps through a residual connection, as described by

Z = X + Y

. This residual connection preserves the original features and enhances the model’s representational capacity. The dimensionality of the output feature map remains consistent with that of the input feature map. This consistency enables the model to capture local information within the image. Additionally, the model integrates global contextual information through the non-local attention mechanism. Consequently, this integration improves the model’s ability to handle multi-view variations in UAV footage. The structure is illustrated in Figure 5.

Although the non-local attention mechanism introduces additional computational overhead, it effectively integrates global information by capturing dependencies between distant regions within pedestrian images. This integration significantly enhances the overall structural detection capability. Additionally, it provides substantial advantages for pedestrian re-identification across multiple viewpoints in UAV footage.

4. Experiment

4.1. Experimental Environment

The methods proposed in this paper were implemented using the PyTorch version 2.1.0 deep learning framework, which facilitated the construction of the network. Data acceleration was achieved usingNVIDIA GeForce RTX 3090 GPUs, sourced from NVIDIA, headquartered in Santa Clara, CA, USA, with computations by CUDA version 11.3.

4.2. Dataset

The self-constructed dataset used in this study was captured by an aerial camera, DJI H20, manufactured by DJI, headquartered in Nanshan, Shenzhen, China, at altitudes ranging between 8 and 15 m. It contains a total of 580 pedestrians, encompassing a diverse range of poses. The training set comprises 12,749 images of 383 pedestrians, while the test set includes 5464 gallery images and 605 query images, representing 197 pedestrians. The dataset is characterized by multiple viewpoints, complex backgrounds, clothing similarities, and diverse postures (including standing, walking, hurdling, and falling), and the pedestrian images cover a wide range of viewpoints and environments captured at different heights and angles. This diversity enhances the model’s ability to adapt to dynamic scenes. Example aerial images and datasets are shown in Figure 6 and Figure 7.

The PRAI-1581 dataset, released in 2020, was captured using two DJI drones operating at altitudes ranging from 20 to 60 m, encompassing a diverse range of real-world drone surveillance scenarios. This dataset includes 1581 individuals, with a total of 39,461 pedestrian images.

4.3. Evaluation Metrics

In the pedestrian re-identification task, two key metrics are used to evaluate the model’s performance: retrieval accuracy (Rank-1) and mean average precision (mAP).

Rank-1 refers to the percentage of target pedestrian images that are correctly retrieved and ranked first by the model in the retrieval task. Specifically, for each query image in the test set, the model is required to retrieve the corresponding pedestrian image from the gallery that has the same ID. A retrieval is considered successful if the most similar image in the gallery matches the ID of the probe image. Let

N_{q}

represent the total number of images in the query set and r the number of successful retrievals. The Rank-1 is then computed as:

R a n k_1 = \frac{r}{N_{q}}

(5)

The mean average precision (mAP) is a metric that evaluates the overall retrieval performance of the model on the test set. It is computed by calculating the average precision (AP) of each probe image. Assuming that there are

N_{q}

probe images in the query set and the average precision of the j the probe image is

A P (q_{j})

, mAP is defined as:

m A P = \frac{\sum_{j = 1}^{N_{q}} A P (q_{j})}{N_{q}}

(6)

5. Results and Analysis

5.1. Ablation Experiments

To comprehensively evaluate the impact of MU-GCN on pedestrian re-identification, this paper conducts ablation experiments on the self-constructed dataset using both ResNet50 and Transformer networks. Additionally, using the Transformer network as a baseline, the study further assesses the effect of the multi-scale graph convolutional network combined with the non-local attention mechanism on both the self-built dataset and the PARI-1581 dataset.

The ablation experiments using the ResNet50 network on the self-constructed dataset are presented in Table 1. These experiments separately evaluate the effectiveness of the multi-scale fusion, GCN, and MU-GCN modules.

The inclusion of the multi-scale fusion module results in a 2.2% improvement in mAP. This improvement demonstrates that using convolution kernels of varying scales enables the model to establish stronger correlations between local details and global features. Consequently, the model achieves improved robustness when handling pose variations. In contrast, introducing GCN alone leads to a smaller improvement in both mAP and Rank-1 accuracy. The results are slightly inferior to those achieved with the multi-scale fusion module. This suggests that while GCN enhances feature representation, its ability to capture pedestrian details is more dependent on the quality of the initial features. This dependency contrasts with the multi-scale fusion approach.

However, when both the multi-scale fusion and GCN are combined within the MU-GCN module, the two modules complement each other. This synergy leads to a 6.9% increase in mAP and a 3.8% improvement in Rank-1 accuracy. As a result, the model’s performance is significantly enhanced, especially in handling complex pedestrian poses.

In this study, the TransReID network was used as the baseline model to assess the impact of the MU-GCN and the non-local attention mechanism on model performance. The effects of various pooling methods based on the Transformer network on both the self-constructed dataset and the PARI-1581 dataset are shown in Table 2. Notably, integrating the multi-scale graph convolution network on the self-constructed dataset results in significant performance improvements. Specifically, Rank-1 accuracy and mAP increase by 3.4% and 6.5%, respectively. This demonstrates a substantial enhancement in the model’s ability to capture detailed pedestrian features across multiple scales. The improvement is particularly evident in scenarios involving complex pedestrian behavior.

Adding a non-local attention mechanism to the PRAI-1581 dataset enhances the model’s ability to capture global information. This improvement is particularly effective when the model faces large changes in view angles and when some pedestrian features are scattered across different regions of the image. As a result, the integration of global information is significantly improved. The improvement in mAP and Rank-1 accuracy reaches 58.7% and 62.9%, respectively. Additionally, the model’s robustness to changes in pedestrian view angles is markedly enhanced.

Finally, the highest accuracy is achieved when both MU-GCN and the non-local attention mechanism are integrated. In this configuration, Rank-1 and mAP are improved by 4.9% and 9.5%, respectively, on the self-constructed dataset. By combining these two mechanisms, the model effectively enhances the extraction of both local and global features. This results in substantial performance improvements in pedestrian re-identification, particularly for complex behaviors in multi-view UAV imagery. Meanwhile, Table 2 compares the performance of different models in terms of computational complexity (FLOPS) and the number of parameters. The baseline model, used as a benchmark, has lower FLOPS and fewer parameters. After introducing MU-GCN, the FLOPS and number of parameters increase to 141.6 GFLOPS and 108.8 M, respectively. However, the model performance is significantly improved. When MU-GCN is combined with the non-local attention mechanism, the FLOPS and number of parameters further increase to 143.2 GFLOPS and 110.1 M, respectively. Despite the increased computational complexity, this combination achieves the best performance. This demonstrates that the synergy between multiple modules enhances the model’s feature learning capability and identification accuracy.

5.2. Comparison Experiments

To comprehensively evaluate the effectiveness of the proposed MNTReID model, it is compared with several state-of-the-art algorithms in the field of pedestrian re-identification. The models included in this comparison are OSNet [14], ABDNet [15], AlignedReID [13], Cluster Contrast REID [33], and SVDNet [12] et al. These algorithms are evaluated on both the PRAI-1581 dataset and this paper’s dataset. The results of these comparisons are presented in Table 3 and Table 4.

As shown in the experimental results in Table 3 and Table 4, MNTReID achieves an mAP of 58.1% and a Rank-1 accuracy of 63.7% on the PRAI-1581 dataset. On the dataset presented in this paper, MNTReID achieves an mAP of 88.9% and a Rank-1 accuracy of 94.8%, outperforming existing models. Specifically, MNTReID improves Rank-1 by 6.2% on this study’s dataset and by 16.5% on the PRAI-1581 dataset compared to OSNet. While OSNet effectively captures multi-scale global features through full-scale feature learning and demonstrates strong global feature extraction capabilities, MNTReID not only optimizes the extraction of global features but also enhances the processing of local information through MU-GCN. This combination results in improved performance in aerial imagery.

Although AlignedReID excels in processing local pedestrian features, its ability to integrate global information is limited. In contrast, MNTReID achieves notable improvements in mAP, with increases of 9.5% on this paper’s dataset and 20.9% on the PRAI-1581 dataset. This superior performance is attributed to the integration of the non-local attention module, which substantially enhances robustness in complex scenarios. Compared to TransREID, MNTReID demonstrates significant performance enhancements, achieving improvements of 24.7% in Rank-1 accuracy and 18.8% in mAP on the PRAI-1581 dataset. This improvement is attributed to the synergistic combination of MU-GCN and the non-local attention module, which enhances the model’s capability to handle multi-view and complex pedestrian behaviors with higher robustness. By effectively balancing long-range dependencies and local features, overall identification performance is improved.

On the PRAI-1581 dataset, while MNTReID achieves a slightly lower mAP compared to LTReID, with a difference of 0.2%, it demonstrates a notable improvement of 4.6% in Rank-1 accuracy. This enhancement is attributed to the integration of the MU-GCN and the non-local attention mechanism, which effectively strengthens the model’s overall performance. LTReID uses a multi-head multi-attention mechanism to extract features from different regions of the pedestrian image from a global perspective, which provides a slight advantage in overall matching accuracy. However, MNTReID performs better in complex scenes, particularly in multi-view conditions, making it more suitable for practical applications.

5.3. Robustness Experiments

To thoroughly evaluate the model’s performance under different environmental conditions, this study conducts simulation experiments on the self-constructed dataset. These experiments include low-light, low-resolution, and rainy and foggy weather scenarios. The aim is to validate the robustness of MNTReID in complex and suboptimal environments. The simulated experimental scenarios are as follows: In low-light conditions, pedestrian images are simulated by adjusting image brightness and introducing noise. In low-resolution conditions, the original images are downsampled to mimic pedestrian re-identification performance in low resolution. For rainy and foggy weather, pedestrian re-identification is simulated by adding rain effects to the images. Table 5 presents the performance comparison of MNTReID in these environments. The experimental results demonstrate that MNTReID exhibits strong robustness in these challenging conditions, particularly in rainy and foggy weather and low-resolution environments. Its performance remains at a high level despite these challenges. Under low-resolution conditions, the mAP of MNTReID is 84.7%, which is only slightly lower than the 88.9% achieved under normal weather. This indicates a strong anti-interference capability on low-resolution images. In rainy and foggy weather conditions, MNTReID combined with the non-local attention mechanism effectively integrates global contextual information. This integration reduces environmental interference and maintains high identification accuracy with a Rank-1 of 92.4%. Under low-light conditions, the mAP of MNTReID is 81.3%, which is 7.6% lower than under normal weather. Nonetheless, it still maintains excellent identification accuracy. Overall, the experimental results confirm that MNTReID performs robustly across various challenging environments, demonstrating its effectiveness in pedestrian re-identification tasks under diverse and suboptimal conditions.

5.4. Visualization Results

To intuitively demonstrate the improvements achieved by the proposed model, visualization experiments were conducted on the dataset used in this study. The benchmark network and MNTReID were selected for comparative visualization.

In the visualization results, the first image represents the target pedestrian to be queried. The subsequent ten images are arranged in descending order of matching accuracy with the target pedestrian, displaying the top ten query results. Misidentifications are indicated by red boxes.

As shown in Figure 8, the first set of pedestrians exhibits dynamic posture changes across multiple viewpoints. The baseline model relies solely on fixed-scale feature extraction, which fails to capture these dynamic changes in pedestrian behavior. This limitation is particularly evident in the eighth and ninth images. Here, the model’s lack of sensitivity to local pedestrian features leads to misidentifications. In contrast, MNTReID more effectively captures local features through MU-GCN. Additionally, it integrates global contextual information via the non-local attention mechanism. This integration enables accurate identification of target pedestrians even under dynamic behavior conditions.

In the second group, where pedestrians exhibit similar clothing, the baseline model struggles to differentiate the subtle variations between the garments, leading to misidentification. However, MNTReID outperforms the baseline model by capturing the fine-grained differences in clothing through multi-scale feature extraction, effectively avoiding such errors.

For the third group, which involves pedestrians with complex behaviors, the baseline model performs reasonably well when the postures remain consistent, accurately identifying pedestrians even when significant posture changes occur. However, the model misidentifies pedestrians in the seventh and ninth frames, where subtle changes in posture and slight clothing differences appear. MNTReID, on the other hand, effectively integrates global contextual information via the non-local attention mechanism, which helps mitigate the interference of environmental factors such as lighting variations. Additionally, by leveraging MU-GCN, MNTReID excels in accurately extracting features in scenarios involving complex pedestrian behavior, ensuring highly accurate identification despite the challenges presented by dynamic environments.

In the fourth set of pictures with shadows and pedestrians in complex behavior, the baseline model performs reasonably well when the postures remain consistent, accurately identifying pedestrians even when significant posture changes occur. However, the model misidentifies pedestrians in the seventh and ninth frames, where subtle changes in posture and slight clothing differences appear. MNTReID, on the other hand, effectively integrates global contextual information via the non-local attention mechanism, which helps mitigate the interference of environmental factors such as lighting variations. Additionally, by leveraging MU-GCN, MNTReID excels in accurately extracting features in scenarios involving 339 complex pedestrian behaviors, ensuring highly accurate recognition despite the challenges presented by dynamic environments.

In the fifth group of images, the pedestrian is shown performing the action of leaning over a railing. The baseline model struggles to capture detailed information effectively, resulting in misidentification in the ninth image. In contrast, MNTReID, enhanced with the MU-GCN, successfully captures local features. This enables accurate re-identification, even when the pedestrian is engaged in complex actions.

In the sixth set of images, when the pedestrian is climbing over the railing, the baseline model struggles to capture this complex action, leading to misidentification in the sixth image. In contrast, the MNTReID model introduces MU-GCN, which effectively combines the pedestrian’s gesture information. By no longer relying solely on color features, MNTReID is able to more accurately identify the pedestrian.

In the seventh set of images, the pedestrian performs the complex action of hanging from the railing. This complexity makes it difficult for the baseline model to effectively capture detailed and contextual information, resulting in misidentification in the ninth image. In contrast, MNTReID effectively integrates global contextual information by introducing a non-local attention mechanism. This enhancement enables accurate identification even when the pedestrian’s movements change.

The visualization results demonstrate that MNTReID effectively supports complex behavioral pedestrian re-identification, particularly in response to safety events such as crossing railings or falling. This is achieved through the integration of a multi-scale graph convolutional network and a non-local attention mechanism. This integration enables the model to handle pedestrian re-recognition across multiple UAV views with enhanced accuracy and robustness.

6. Conclusions

The MNTReID model proposed in this paper demonstrates significant effectiveness in pedestrian re-identification, particularly for complex behaviors observed from UAVs in multi-view scenarios. By combining a MU-GCN with a non-local attention mechanism, the model enhances both local and global feature extraction. The multi-scale graph convolutional network captures detailed pedestrian features at different scales, while the non-local attention mechanism incorporates global context and reduces the impact of background noise. This ensures high identification accuracy even in dynamic and varied multi-view environments. Experimental results show that the mAP and Rank-1 metrics of MNTReID outperform other pedestrian re-identification methods across different datasets, validating its effectiveness on UAV aerial images. Furthermore, through experiments with different flight altitudes, we find that the model can achieve a relatively good recognition effect when the flight altitude is lower than 40 m.

However, the model proposed in this paper has a large number of parameters, which makes direct deployment on UAVs challenging. Currently, the model relies on communication between the UAV and the ground station for real-time data transmission. Future research will focus on reducing the model size to enable more efficient real-time deployment. Additionally, the approach demonstrates significant potential for real-world applications, including UAV surveillance, intelligent transportation, and border patrol.

Author Contributions

L.S.: conceptualization, methodology, formal analysis, survey, resources, data management, validation, writing—original draft. X.J.: conceptualization, methodology, formal analysis, investigation, data management, experimental validation, writing—original draft, writing—review and editing. J.H.: supervision, methodology, survey, resources, validation. J.Y.: software, formal analysis, visualization. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data used in this paper can be obtained by contacting the authors of this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; Hoi, S.C. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 2872–2893. [Google Scholar] [CrossRef] [PubMed]
Rybak, Ł.; Dudczyk, J. A geometrical divide of data particle in gravitational classification of moons and circles data sets. Entropy 2020, 22, 1088. [Google Scholar] [CrossRef] [PubMed]
Rybak, Ł.; Dudczyk, J. Variant of data particle geometrical divide for imbalanced data sets classification by the example of occupancy detection. Appl. Sci. 2021, 11, 4970. [Google Scholar] [CrossRef]
Ravindran, R.; Santora, M.J.; Jamali, M.M. Multi-object detection and tracking, based on DNN, for autonomous vehicles: A review. IEEE Sens. J. 2021, 21, 5668–5677. [Google Scholar] [CrossRef]
Vitiello, F.; Causa, F.; Opromolla, R.; Fasano, G. Radar/visual fusion with fuse-before-track strategy for low altitude non-cooperative sense and avoid. Aerosp. Sci. Technol. 2024, 146, 108946. [Google Scholar] [CrossRef]
Marques, T.; Carreira, S.; Miragaia, R.; Ramos, J.; Pereira, A. Applying deep learning to real-time UAV-based forest monitoring: Leveraging multi-sensor imagery for improved results. Expert Syst. Appl. 2024, 245, 123107. [Google Scholar] [CrossRef]
Golcarenarenji, G.; Martinez-Alpiste, I.; Wang, Q.; Alcaraz-Calero, J.M. Illumination-aware image fusion for around-the-clock human detection in adverse environments from unmanned aerial vehicle. Expert Syst. Appl. 2022, 204, 117413. [Google Scholar] [CrossRef]
Liu, F.; Shan, J.Y.; Xiong, B.Y.; Fang, Z. A real-time and multi-sensor-based landing area recognition system for UAVs. Drones 2022, 6, 118. [Google Scholar] [CrossRef]
Liang, B.; Su, J.; Feng, K.; Liu, Y.; Hou, W. Cross-layer triple-branch parallel fusion network for small object detection in uav images. IEEE Access 2023, 11, 39738–39750. [Google Scholar] [CrossRef]
Li, Y.; Li, Q.; Pan, J.; Zhou, Y.; Zhu, H.; Wei, H.; Liu, C. Sod-yolo: Small-object-detection algorithm based on improved yolov8 for UAV images. Remote Sens. 2024, 16, 3057. [Google Scholar] [CrossRef]
Yue, M.; Zhang, L.; Huang, J.; Zhang, H. Lightweight and efficient tiny-object detection based on improved YOLOv8n for UAV aerial images. Drones 2024, 8, 276. [Google Scholar] [CrossRef]
Sun, Y.; Zheng, L.; Deng, W.; Wang, S. Svdnet for pedestrian retrieval. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3800–3808. [Google Scholar]
Zhang, X.; Luo, H.; Fan, X.; Xiang, W.; Sun, Y.; Xiao, Q.; Jiang, W.; Zhang, C.; Sun, J. Alignedreid: Surpassing human-level performance in person re-identification. arXiv 2017, arXiv:1711.08184. [Google Scholar]
Zhou, K.; Yang, Y.; Cavallaro, A.; Xiang, T. Omni-scale feature learning for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3702–3712. [Google Scholar]
Chen, T.; Ding, S.; Xie, J.; Yuan, Y.; Chen, W.; Yang, Y.; Ren, Z.; Wang, Z. Abd-net: Attentive but diverse person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8351–8361. [Google Scholar]
Zhu, C.; Zhou, W.; Ma, J. Person Re-Identification Network Based on Edge-Enhanced Feature Extraction and Inter-Part Relationship Modeling. Appl. Sci. 2024, 14, 8244. [Google Scholar] [CrossRef]
An, F.; Wang, J.; Liu, R. Pedestrian Re-Identification Algorithm Based on Attention Pooling Saliency Region Detection and Matching. IEEE Trans. Comput. Soc. Syst. 2023, 11, 1149–1157. [Google Scholar] [CrossRef]
Yun, X.; Ge, M.; Sun, Y.; Dong, K.; Hou, X. Margin CosReid Network for Pedestrian Re-Identification. Appl. Sci. 2021, 11, 1775. [Google Scholar]
Khaldi, K.; Mantini, P.; Shah, S.K. Unsupervised person re-identification based on skeleton joints using graph convolutional networks. In Proceedings of the International Conference on Image Analysis and Processing, Bologna, Italy, 6–10 September 2022; pp. 135–146. [Google Scholar]
Dai, Z.; Wang, G.; Yuan, W.; Zhu, S.; Tan, P. Cluster contrast for unsupervised person re-identification. In Proceedings of the Asian Conference on Computer Vision, Macau, China, 4–8 December 2022; pp. 1142–1160. [Google Scholar]
Wang, X.; Sun, Z.; Chehri, A.; Jeon, G.; Song, Y. Margin CosReid Network for Pedestrian Re-Identification. Pattern Recognit. 2024, 146, 110045. [Google Scholar] [CrossRef]
Luo, J.; Liu, L. Improving unsupervised pedestrian re-identification with enhanced feature representation and robust clustering. IET Comput. Vis. 2024, 18, 1097–1111. [Google Scholar] [CrossRef]
Lin, Y.; Dong, X.; Zheng, L.; Yan, Y.; Yang, Y. A bottom-up clustering approach to unsupervised person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 8738–8745. [Google Scholar]
Zeng, K.; Ning, M.; Wang, Y.; Guo, Y. Hierarchical clustering with hard-batch triplet loss for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13657–13665. [Google Scholar]
Bai, Z.; Wang, Z.; Wang, J.; Hu, D.; Ding, E. Unsupervised multi-source domain adaptation for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 13–19 June 2020; pp. 12914–12923. [Google Scholar]
An, F.P.; Liu, J.E. Pedestrian re-identification algorithm based on visual attention-positive sample generation network deep learning model. Cinform. Fusion 2022, 86, 136–145. [Google Scholar] [CrossRef]
Alexey, D. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; Jiang, W. Transreid: Transformer-based object re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Cision, Montreal, BC, Canada, 10–17 October 2021; pp. 15013–15022. [Google Scholar]
Xu, L.; Peng, H.; Lu, X.; Xia, D. Meta-transfer learning for person re-identification in aerial imagery. In Proceedings of the CCF Conference on Computer Supported Cooperative Work and Social Computing, Taiyuan, China, 23–25 November 2022; pp. 634–644. [Google Scholar]
Xu, L.; Peng, H.; Lu, X.; Xia, D. Learning to generalize aerial person re-identification using the meta-transfer method. Concurr. Comput. Pract. Exp. 2023, 35, e7687. [Google Scholar] [CrossRef]
Zhang, S.; Zhang, Q.; Yang, Y.; Wei, X.; Wang, P.; Jiao, B.; Zhang, Y. Person re-identification in aerial imagery. IEEE Trans. Multimed. 2020, 23, 281–291. [Google Scholar] [CrossRef]
Lu, Z.; Chen, H.; Lai, J.H.; Jiao, B.; Zhang, Y. Region Aware Transformer with Intra-Class Compact for Unsupervised Aerial Person Re-identification. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Wulumuqi, China, 18–20 October 2024; pp. 243–257. [Google Scholar]
Khaldi, K.; Nguyen, V.D.; Mantini, P. Unsupervised person re-identification in aerial imagery. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 260–269. [Google Scholar]
Khaldi, K.; Nguyen, V.D.; Mantini, P.; Shah, S. Adapted deep feature fusion for person re-identification in aerial images. In Autonomous Systems: Sensors, Vehicles, Security, and the Internet of Everything; SPIE: Orlando, FL, USA, 2018; pp. 128–133. [Google Scholar]
Huang, M.; Hou, C.; Zheng, X.; Wang, Z. Multi-resolution feature perception network for UAV person re-identification. Multimed. Tools Appl. 2024, 83, 62559–62580. [Google Scholar] [CrossRef]
Grigorev, A.; Tian, Z.; Rho, S.; Xiong, J.; Liu, S.; Jiang, F. Deep person re-identification in UAV images. EURASIP J. Adv. Signal Process. 2019, 2019, 1–10. [Google Scholar] [CrossRef]
Zhang, S.; Li, Y.; Wu, X.; Chu, Z.; Li, L. MRG-T: Mask-Relation-Guided Transformer for Remote Vision-Based Pedestrian Attribute Recognition in Aerial Imagery. Remote Sens. 2024, 16, 1216. [Google Scholar] [CrossRef]
Hu, H.F.; Ni, Z.Y.; Zhao, H.T. Transformer based light weight person re-identification in unmanned aerial vehicle images. J. Nanjing Univ. Posts Telecommun. 2024, 44, 48–62. [Google Scholar]
Peng, H.; Lu, X.; Xu, L.; Xia, D.; Xie, X. Parameter instance learning with enhanced vision transformers for aerial person re-identification. Concurr. Comput. Pract. Exp. 2024, 36, e8045. [Google Scholar] [CrossRef]
Xu, S.; Luo, L.; Hong, H.; Hu, J.; Yang, B.; Hu, S. Multi-granularity attention in attention for person re-identification in aerial images. Vis. Comput. 2024, 40, 4149–4166. [Google Scholar] [CrossRef]

Figure 1. General block diagram of MNTReID network.

Figure 2. Multi-scale graph convolution network (MU-GCN) module.

Figure 3. Multi-scale fusion module.

Figure 4. Single-layer graph convolution.

Figure 5. Non-attentive mechanism module.

Figure 6. Partial aeronautical data.

Figure 7. Selected sample images from this study’s self-constructed dataset.

Figure 8. The visualization results of our proposed MNTReID method are compared with the baseline model on the dataset of this paper. In the example images, the red boxes highlight the identification errors made by the baseline model during the query matching process.

Table 1. Ablation experiments in ResNet50 on the dataset of this paper.

Method	mAP/%	Rank-1/%	Rank-10/%
ResNet50	73.2	86.0	94.3
+Multi-scale Fusion	75.4	88.5	95.7
+GCN	74.3	87.3	95.3
+MU-GCN	80.1	89.8	96.6

Table 2. Performance comparison of different methods on the dataset of this paper and PRAI-1581 dataset.

Method	FLOPS/G	Parameter/M	Self-Constructed Dataset			PRAI-1581
			mAP/%	Rank-1/%	Rank-10/%	mAP/%	Rank-1/%	Rank-10/%
Baseline	101.2	96.8	79.4	89.9	95.7	50.7	56.2	67.1
+Multi-scale Fusion	124.6	102.6	84.1	92.9	97.0	52.4	60.1	67.4
+GCN	134.3	105.8	83.4	90.9	97.2	51.6	61.8	63.4
+MU-GCN	141.6	108.8	85.9	93.3	96.8	55.8	63.8	74.6
+Non-attention mechanism	137.5	106.5	82.3	91.2	95.9	53.7	62.9	71.8
MNTReID	143.2	110.1	88.9	94.8	97.4	58.5	70.9	83.2

Table 3. Comparison of different methods on PRAI-1581 dataset.

Method	mAP/%	Rank-1/%
OSNet [14]	42.1	54.4
SVDNet [12]	36.7	46.1
AlignedReID [13]	37.6	48.5
Pretrained ViT [39]	57.3	65.3
Cluster Contrast REID [33]	21.8	23.5
LTReID [38]	58.7	66.3
Meta-Learning [30]	38.1	64.9
GCCReID [29]	25.5	31.3
Subspace Pooling [36]	39.6	49.8
MGAiA [40]	42.7	55.3
MNTReID	58.5	70.9

Table 4. Comparison of different methods on the dataset of this paper.

Method	mAP/%	Rank-1/%
OSNet [14]	79.7	88.6
ABDNet [15]	81.9	90.5
AlignedReID [13]	78.1	88.3
TransREID [28]	79.4	89.9
Cluster Contrast REID [33]	52.1	49.6
MNTReID	88.9	94.8

Table 5. The performance of the proposed method under various environmental conditions on the dataset of this paper.

Environmental Condition	mAP/%	Rank-1/%	Rank-10/%
Low light	81.3	89.2	96.1
Low resolution	84.7	92.8	95.2
Foggy and rainy weather	84.1	92.4	96.6
Normal weather	88.9	94.8	97.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Song, L.; Jin, X.; Han, J.; Yao, J. Pedestrian Re-Identification Algorithm Based on Unmanned Aerial Vehicle Imagery. Appl. Sci. 2025, 15, 1256. https://doi.org/10.3390/app15031256

AMA Style

Song L, Jin X, Han J, Yao J. Pedestrian Re-Identification Algorithm Based on Unmanned Aerial Vehicle Imagery. Applied Sciences. 2025; 15(3):1256. https://doi.org/10.3390/app15031256

Chicago/Turabian Style

Song, Lili, Xin Jin, Jianfeng Han, and Jie Yao. 2025. "Pedestrian Re-Identification Algorithm Based on Unmanned Aerial Vehicle Imagery" Applied Sciences 15, no. 3: 1256. https://doi.org/10.3390/app15031256

APA Style

Song, L., Jin, X., Han, J., & Yao, J. (2025). Pedestrian Re-Identification Algorithm Based on Unmanned Aerial Vehicle Imagery. Applied Sciences, 15(3), 1256. https://doi.org/10.3390/app15031256

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Pedestrian Re-Identification Algorithm Based on Unmanned Aerial Vehicle Imagery

Abstract

1. Introduction

2. Related Work

2.1. Conventional Pedestrian Re-Identification

2.2. Pedestrian Re-Identification for UAVs

3. Improvement of Transformer-Based Pedestrian Re-Identification Algorithm

3.1. Multi-Scale Graph Convolutional Networks

3.2. Non-Local Attention Mechanism

4. Experiment

4.1. Experimental Environment

4.2. Dataset

4.3. Evaluation Metrics

5. Results and Analysis

5.1. Ablation Experiments

5.2. Comparison Experiments

5.3. Robustness Experiments

5.4. Visualization Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI