Aerial-Ground Cross-View Vehicle Re-Identification: A Benchmark Dataset and Baseline

Shang, Linzhi; Min, Chen; Wang, Juan; Xiao, Liang; Zhao, Dawei; Nie, Yiming

doi:10.3390/rs17152653

Open AccessArticle

Aerial-Ground Cross-View Vehicle Re-Identification: A Benchmark Dataset and Baseline

by

Linzhi Shang

¹

,

Chen Min

²

,

Juan Wang

³,

Liang Xiao

¹

,

Dawei Zhao

^1,* and

Yiming Nie

¹

Defense Innovation Institute, Beijing 100071, China

²

Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China

³

Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(15), 2653; https://doi.org/10.3390/rs17152653

Submission received: 31 May 2025 / Revised: 25 July 2025 / Accepted: 28 July 2025 / Published: 31 July 2025

(This article belongs to the Special Issue Machine Learning for Intelligent Processing and Applications of Multi-Source Remote Sensing Data)

Download

Browse Figures

Versions Notes

Abstract

Vehicle re-identification (Re-ID) is a critical computer vision task that aims to match the same vehicle across spatially distributed cameras, especially in the context of remote sensing imagery. While prior research has primarily focused on Re-ID using remote sensing images captured from similar, typically elevated viewpoints, these settings do not fully reflect complex aerial-ground collaborative remote sensing scenarios. In this work, we introduce a novel and challenging task: aerial-ground cross-view vehicle Re-ID, which involves retrieving vehicles in ground-view image galleries using query images captured from aerial (top-down) perspectives. This task is increasingly relevant due to the integration of drone-based surveillance and ground-level monitoring in multi-source remote sensing systems, yet it poses substantial challenges due to significant appearance variations between aerial and ground views. To support this task, we present AGID (Aerial-Ground Vehicle Re-Identification), the first benchmark dataset specifically designed for aerial-ground cross-view vehicle Re-ID. AGID comprises 20,785 remote sensing images of 834 vehicle identities, collected using drones and fixed ground cameras. We further propose a novel method, Enhanced Self-Correlation Feature Computation (ESFC), which enhances spatial relationships between semantically similar regions and incorporates shape information to improve feature discrimination. Extensive experiments on the AGID dataset and three widely used vehicle Re-ID benchmarks validate the effectiveness of our method, which achieves a Rank-1 accuracy of 69.0% on AGID, surpassing state-of-the-art approaches by 2.1%.

Keywords:

aerial-ground; cross-view; remote sensing; vehicle re-identification

1. Introduction

Intelligent visual surveillance has recently attracted significant attention in broad real-world applications, particularly in smart remote sensing systems [1,2]. Among these, vehicle re-identification (Re-ID) is a key task [3]. Vehicle Re-ID aims to locate images of the same vehicle instance across non-overlapping camera views, typically based on instance-level ID annotations rather than category labels [4,5]. This task has garnered considerable attention and has made significant advancements in recent years [6]. Current vehicle Re-ID research predominantly concentrates on data collected from stationary cameras positioned on urban roads, where camera perspectives are relatively consistent [7]. Although progress has been made in vehicle Re-ID in intelligent transportation in recent years [8,9], there is a lack of benchmark datasets and baselines for aerial-ground collaborative vehicle Re-ID in remote sensing.

In this work, we introduce a novel vehicle Re-ID task called aerial-ground cross-view vehicle Re-ID, which aims to locate the same vehicle in ground-view images, given query images in aerial perspective. This task is highly significant, as surveillance cameras are frequently installed on highways for security purposes, capturing target vehicles from a top-down perspective, while a large number of target vehicle images may be captured through cameras either held by pedestrians or mounted in vehicles. Aerial-ground cross-view vehicle Re-ID enhances the capability of user-end cameras to perform collaborative tasks.

Despite the growing demand for aerial-ground cross-view vehicle Re-ID applications, no benchmark dataset has yet been constructed for this domain. To stimulate the advancement of aerial-ground cross-view vehicle Re-ID, we introduce a novel dataset, named AGID (Aerial-Ground Vehicle Re-Identification), comprising 20,785 remote sensing images spanning 834 identities. This dataset is divided into a training set and a testing set. The training set includes 10,886 images with 491 identities captured from the aerial and ground viewpoints distinctly. The testing set includes a query set from the aerial view and a gallery set from the ground view.

Aerial-ground cross-view vehicle Re-ID presents distinct challenges compared to typical Re-ID tasks [10,11,12,13]. Figure 1 illustrates that the widely utilized VehicleID dataset primarily comprises images taken from the front and rear of vehicles [14]. Conversely, the VeRi dataset exhibits a broader range of perspectives [15]. Nonetheless, both datasets are composed of images captured from an overhead perspective typical of road monitoring and usually only enable visibility of the vehicle’s roof. In contrast, the proposed AGID dataset consists of aerial-view images, where all objects are captured from top-down by drones, leading to significant visual differences compared to ground-level images. For example, in aerial images, most of the vehicle’s roof and 1-2 sides are visible, while in ground images, usually 1-2 sides are seen, and the roof is rarely visible. Moreover, ground-view images often face occlusions from pedestrians and obstacles. The distinctly different viewing perspectives pose new challenges in aerial-ground cross-view vehicle Re-ID.

This task is related to cross-view image matching, particularly aerial-ground geo-localization, where the severe differences in viewpoints present significant challenges. Lin et al. [16] and Liu et al. [17] addressed scene-level alignment between aerial and ground views through deep feature learning and orientation-aware modeling. Although their approaches primarily target scene-level geo-localization, the strategies they employ offer valuable insights for handling extreme perspective discrepancies. In addition to vision-based approaches, fine-grained identification tasks have also been studied in other sensing modalities. For instance, in the field of radar signal recognition, Dudczyk et al. [18] proposed a data particle geometrical divide algorithm to improve specific emitter identification, enabling more accurate recognition of individual radar sources. These developments demonstrate the broader relevance of individual-level recognition techniques across different sensing domains and motivate further research into cross-modality Re-ID. In contrast, our work focuses on fine-grained, instance-level matching—an aspect that remains relatively underexplored in the current literature. While existing Re-ID methods can address multi-view scenarios, they are predominantly trained on images captured from cameras positioned at similar heights [12,19,20,21]. Moreover, recent methods in vehicle Re-ID often employ multi-layer transformer architectures to learn global representations. However, they frequently overlook shape-based cues, which are essential for differentiating vehicle instances across dramatically varying perspectives [11,22,23]. In typical multi-layer transformer architectures, shallow layers tend to capture low-level features such as edges and textures, whereas deeper layers encode high-level semantics and global context, enabling the modeling of long-range dependencies [24]. Despite this hierarchical capability, many transformer-based Re-ID approaches [25,26] rely exclusively on the final-layer

[c l s]

token for global feature representation. This practice neglects the rich local information embedded in earlier layers and the potentially informative outputs from other spatial positions within the feature map.

Inspired by these observations, we introduce a novel enhanced self-correlation feature computation method specifically designed to address the challenges of aerial-ground cross-view vehicle Re-ID. Our proposed self-correlation feature computation method extracts self-correlation features from model outputs across different layers, enriching the original Transformer outputs with essential shape information. Furthermore, we introduce a multi-scale convolutional enhancement module that captures multi-level local information through convolutions of varying sizes, strengthening the deep features used in self-correlation computation. The enhanced self-correlation features establish stronger spatial relationships between patches with similar semantic contexts, providing additional shape information to the global image features. This ultimately results in features better suited for aerial-ground cross-view vehicle Re-ID tasks, improving the model performance.

In summary, this work presents four key contributions:

(1) We introduce a new and essential vehicle Re-ID task, aerial-ground cross-view vehicle Re-ID, which aims to re-identify vehicles in ground-view cameras using query images captured from an elevated perspective. This task has important applications across various domains but presents unique challenges to existing Re-ID models due to the large differences between vehicles seen in aerial and ground views.

(2) To foster and drive progress in this field, we created the first aerial-ground cross-view remote sensing vehicle Re-ID dataset. The AGID dataset includes 20,785 images of 834 identities, captured by drones and ground-view cameras.

(3) We propose a novel enhanced self-correlation feature computation method, which incorporates shape information into a transformer-based method through enhanced self-correlation features, obtaining image features better suited for aerial-ground cross-view vehicle Re-ID.

(4) Through large-scale empirical evaluations, we benchmarked the performance of our model and six state-of-the-art Re-ID models on AGID, providing comprehensive performance baselines. We also conducted experiments on three widely used Re-ID datasets, demonstrating the effectiveness of our method.

2. Related Work

2.1. Vehicle Re-Identification

Current vehicle Re-ID methods primarily focus on general Re-ID tasks, particularly the Re-ID of vehicles in urban road surveillance images. They can mainly be divided into global feature-based methods [8,20,27], path-based methods [28,29], view-based methods [9,30], and local information enhancement methods [3,23]. Approaches based on global features first extract features and then process them using specific metric learning techniques to produce the final predictions. Lou et al. [12] designed a model based on Generative Adversarial Networks (GANs) [31] tailored for vehicle Re-ID. Shen et al. [20] introduced the Hybrid Pyramid Graph Network (HPGN), constructed from ResNet-50 [32] and Pyramid Graph Network (PGN) [33]. Rao et al. [8] proposed a Counterfactual Attention Learning (CAL) method to enhance attention learning based on causal reasoning. He et al. [22] applied the Vision Transformer (ViT) [34], which employs a multi-head self-attention mechanism, to Re-ID tasks. In addition, Shen et al. [27] proposed an efficient multiresolution network (EMRN) for vehicle Re-ID, which allows the model to extract fixed-dimensional features from images at different resolutions. Path-based methods generally refine retrieval results by using spatiotemporal information to eliminate unreasonable vehicles during the inference phase, following feature extraction. Shen et al. [28] proposed a two-stage method that outputs similarity scores between two query images. Wang et al. [29] introduced a spatiotemporal regularization module, modeling spatiotemporal constraints using a log-normal distribution and refining retrieval results. View-based methods aim to handle viewpoint variations and learn multi-view features through metric learning. These methods often generate other view images from a single input view or judge and align the viewpoint of input images for more robust training. Zhou et al. [30] proposed a Cross-View Generative Adversarial Network (XVGAN) that generates cross-view vehicle images from input images, combining features of the original image with those of generated views. zhang et al. [9] introduced a method that aligns and distinguishes key features by learning higher-order relationships and topological information. Local information enhancement methods typically increase inter-class differences in vehicle Re-ID by leveraging stable discriminative cues such as vehicle key points and parts. He et al. [35] proposed a part regularization method that integrates local and non-local features into a unified architecture. Khorramshahi et al. [36] designed an autoencoder that generates vehicle image templates without manufacturer logos or wheel patterns, constructing residual images by obtaining pixel-level differences from the original images. Zhao et al. [37] introduced a Heterogeneous Relation Complementary Network (HRCN), which treats region-specific features and cross-level features as complements to the original high-level features, embedding these heterogeneous features into a unified high-dimensional space via a graph-based relational module to construct more robust feature representations. Shen et al. [1] proposed an adaptively attention-driven cascade part-based graph embedding framework that effectively fuses node features with topological information on multi-scale structured part graphs.

Despite the significant progress made by these methods in vehicle Re-ID, they do not account for the difference between aerial and ground-view vehicle images, unlike our approach. This limitation leads to a considerable performance drop in aerial-ground cross-view vehicle Re-ID tasks.

2.2. Vehicle Re-ID Datasets

Recent vehicle Re-ID methods are primarily evaluated on two public datasets, VehicleID [14] and VeRi [15,38,39]. Although significant progress has been made on these datasets, real-world vehicle Re-ID challenges remain unresolved. VehicleID does not adequately account for real-world challenges, as it only includes very limited viewpoints (front and rear views). While the images in VeRi were captured from 18 cameras along a 1.0 square kilometer circular road, offering relatively diverse viewpoints, they still mostly consist of top-down perspectives. Other public datasets, such as VERI-Wild [12], have a larger scale, with vehicle images collected from a network of 174 cameras covering an extensive urban area. Additionally, the DN-Wild [13] and DN348 [13] datasets are used to study cross-domain Re-ID tasks under day-night traffic conditions. However, these datasets primarily focus on the general vehicle Re-ID problem with images captured from road surveillance cameras, where camera positions are typically high and fixed, resulting in mainly top-down viewpoints. Although VRU [2] dataset was collected using unmanned aerial vehicles, it still lacks consideration of vehicle images captured from horizontal (ground-level) viewpoints. To the best of our knowledge, there is currently no dataset that combines both top-down and horizontal views for vehicle Re-ID.

3. Cross View Dataset: AGID

To address the limitations of existing vehicle Re-ID datasets, which primarily consist of urban road surveillance data with uniform collection scenarios and fixed camera perspectives, we collected synchronized multi-platform data from both aerial and ground perspectives. The remote sensing images captured from different viewpoints were systematically organized to construct AGID, a dataset tailored for aerial-ground cross-view vehicle Re-ID. The data were collected from two distinct environments: urban and rural settings.

The data collection process includes 80 real-world urban or rural scene segments. Examples of the data collection scenes are shown in Figure 2, each containing image data captured by one aerial camera and two ground cameras (mounted on two separate vehicles) during a corresponding time frame. Each segment lasts 40 s with a frame rate of 10 Hz.

The collected data were manually annotated and processed, with privacy-sensitive elements such as license plates and faces blurred to protect privacy. The dataset was then divided into training and testing subsets. The training set includes remote sensing images captured by both aerial and ground cameras, while the testing set is further divided into a query set and a gallery set. The query set contains only aerial images captured by drones, whereas the gallery set comprises images taken by two ground cameras. The training set consists of 10,886 images from 491 identities, while the testing set includes 3946 query images from 245 identities and 5953 gallery images from 343 identities.

As shown in Table 1, compared to existing vehicle Re-ID datasets, AGID features a more diverse range of perspectives and scenes, including overhead perspectives from aerial and the ground view from ground cameras. It contains more uncertain factors, such as changes in illumination, variations in view angle and inconsistent resolution, resulting in increased recognition difficulty. Figure 3 shows example images of the same vehicle captured from different viewpoints within the dataset. The dataset includes data collected from both urban and rural environments, addressing the limitation of existing datasets, which often consist primarily of urban road surveillance data with a single scene type. Furthermore, when dividing the dataset, we separated the aerial and ground perspectives in the testing set, enabling this dataset to be suitable not only for conventional vehicle Re-ID tasks but also for aerial-ground cross-view Re-ID challenges.

4. The Proposed Method for AGID

The aerial-ground cross-view vehicle Re-ID task presents unique challenges, as aerial view provides limited appearance information and differs greatly from ground-view vehicles. As illustrated in Figure 1, these challenges stem from significant viewpoint differences, with ground view facing more real-world occlusions and camera inconsistencies. In response to the unique challenges of aerial-ground cross-view vehicle Re-ID, we propose a novel Enhanced Self-Correlation Feature Computation (ESFC) method. By computing enhanced self-correlation features, our method strengthens the spatial relationships between patches with similar semantic contexts [40] and incorporates additional shape information [41]. This approach complements global features, enabling the model to learn more discriminative representations for aerial-ground cross-view vehicle Re-ID.

4.1. Overview

We adopt a Transformer-based method [22] as our baseline. As shown in Figure 4, an image

x \in R^{H \times W \times C}

, where H, W, C represent the height, width, and channel count, is segmented into M fixed-size patches

{x_{p}^{i} | i = 1, 2, \dots, M}

. A sliding window method is employed to create patches. A supplementary learnable

[c l s]

embedding token, which matches the dimensions of the other patches, is denoted as

x_{cls}

and is added to the input sequence. Additionally, learnable positional embeddings

[p_{0}, p_{1}, . . ., p_{M}]

are included to capture spatial information. Learnable viewpoint embeddings v are introduced to capture aerial and ground viewpoint information, with these embeddings being shared across all patches within an image. The output from the

x_{cls}

token acts as the global feature representation

f_{g}

.

The sequence input to the Transformer layer can be expressed as:

X = [x_{cls} + p_{0} + v; H (x_{p}^{1}) + p_{1} + v; \dots; H (x_{p}^{M}) + p_{M} + v] .

(1)

Here,

X

represents the input sequence embeddings,

p_{i} (i \in {0, 1, \dots, M})

signifies the positional embedding, and v indicates whether the image is captured from an aerial or ground viewpoint. The function

H

performs a linear projection of the patches into D dimensions. We utilize l Transformer layers to learn feature representations.

The self-correlation feature computation method we propose can be inserted into the Transformer structure. It takes outputs from both high-level and low-level Transformer layers as inputs and utilizes the multi-scale convolutional enhancement module to obtain enhanced deep features. These enhanced deep features are then combined with low-level features to derive enhanced self-correlation features. The enhanced self-correlation features are summed with global features for classification and loss calculation.

4.2. Self-Correlation Feature Computation Method

For the aerial-ground cross-view vehicle Re-ID task, the incorporation of additional shape information is crucial. Relying solely on the global information from the final layer’s

[c l s]

token may result in the loss of essential details such as raw shapes. Therefore, we propose a Self-correlation Feature Computation (SFC) method to enhance shape information, thereby complementing the deep generalized and high-level information. As shown in Figure 5, we extract the shallow features

f_{s}

and deep features

f_{d}

from the outputs of the Transformer layers

l_{1}

and

l_{2}

(

l_{1} < l_{2}

), corresponding to the positional embeddings

[p_{1}, p_{2}, . . ., p_{M}]

:

f_{s} = {Transformer}_{l_{1}} (x + [p_{1}, p_{2}, \dots, p_{M}] + v),

(2)

f_{d} = {Transformer}_{l_{2}} (x + [p_{1}, p_{2}, \dots, p_{M}] + v),

(3)

here,

x + [p_{1}, p_{2}, \dots, p_{M}] + v

represents the addition of the input features, which include viewpoint embeddings, and the specific segment of the positional embeddings. It is important to note that although the input consists of

[p_{0}, p_{1}, \dots, p_{M}]

, we only make use of the portion from

p_{1}

to

p_{M}

.

{Transformer}_{l_{1}}

and

{Transformer}_{l_{2}}

denote Transformer layers that have passed through layers

l_{1}

and

l_{2}

, respectively.

The self-correlation features

f_{c o r r}

is then computed by performing matrix multiplication between all pairs of deep features

f_{d}

and shallow features

f_{s}

vectors, as given by the following formula:

f_{c o r r} = Softmax (\frac{1}{H W} \sum_{i = 1}^{H W} {(\frac{f_{s} \cdot f_{d}^{T}}{\sqrt{D}})}_{i}),

(4)

here,

f_{s}, f_{d} \in R^{D \times H W}

, where D represents the channel dimension and

H W

is the number of patches. The symbol T denotes the matrix transpose. The self-correlation features

f_{c o r r} \in R^{d}

are averaged across each patch, followed by normalization using the

S o f t m a x

function along the last dimension.

4.3. Multi-Scale Convolutional Enhancement Module

We propose a Multi-scale Convolution Enhancement (MCE) module that utilizes convolution to acquire additional local features, enriching the high-level features needed for calculating self-correlation features as shown in Figure 4. By employing convolution kernels of four different scales, we explore the high-level output features across multiple dimensions. The convolution features from these scales are concatenated with the pooled features to produce the concatenated feature

f_{concat}

. The

1 \times 1

convolution reduces dimensionality and introduces non-linearity, addressing channel information. The

2 \times 2

convolution focuses on capturing smaller local features, essential for identifying challenging samples in vehicle Re-ID through fine details. The

3 \times 3

convolution balances local and global feature extraction, capturing edges and textures, while the

4 \times 4

convolution targets larger local regions for recognizing complex patterns. Average pooling preserves important feature information while reducing the spatial dimensions of the features, enhancing flexibility. The concatenated feature

f_{concat}

can be formulated as:

f_{concat} = [f_{d} * K_{1 \times 1}, f_{d} * K_{2 \times 2}, f_{d} * K_{3 \times 3}, f_{d} * K_{4 \times 4}, Pool (f_{d})],

(5)

here,

* K_{n \times n} (n \in {1, 2, 3, 4})

represents the convolution performed with a

n \times n

convolution kernel. An additional

1 \times 1

convolution is applied to decrease the number of channels in the concatenated features. We combine the shallow features

f_{s}

with the multi-scale convolution and pooling feature

f_{concat}

, denoted as

f_{e}

, to augment the enhanced features with original and local information. The shallow features

f_{s}

also undergo

1 \times 1

convolution to reduce channel numbers, as their numerous channels may overshadow the relevance of high-level features and complicate computation. The enhanced feature

f_{e}

is expressed as:

f_{e} = [f_{concat} * K_{1 \times 1}, f_{s} * K_{1 \times 1}] .

(6)

In our proposed enhanced self-correlation feature computation method, the final enhanced self-correlation feature in Equation (4) is represented as:

f_{e_c o r r} = Softmax (\frac{1}{H W} \sum_{i = 1}^{H W} {(\frac{f_{s} \cdot f_{e}^{T}}{\sqrt{D}})}_{i}) .

(7)

The features extracted by the model are represented as:

f = f_{g} + λ \cdot f_{e_c o r r},

(8)

where

λ

is a tuning coefficient used to balance the influence of global features and self-correlation features.

4.4. Overall Training

The network is optimized by constructing ID loss and triplet loss. The ID loss

L_{i d}

is defined as the cross-entropy loss without label smoothing:

L_{i d} = \sum_{k = 1}^{N} - q_{k} l o g (p_{k}),

(9)

where

p_{k}

refers to the logits associated with the ID prediction for class k, and

q_{k}

is the value in the target distribution. The triplet loss

L_{t r i}

with soft-margin is expressed as follows:

L_{t r i} = l o g [1 + e x p (d_{p} - d_{n})],

(10)

where

d_{p}

and

d_{n}

denote the Euclidean distances between feature vectors for positive and negative pairs, respectively.

L = L_{i d} + L_{t r i} .

(11)

Our method is adaptable to other transformer-based architectures, offering shape information to derive more robust features. As depicted in Figure 6, it can be applied to the image encoder of CLIP-based model [42] without the need to modify the original loss function.

4.5. Computational Complexity

The time complexity of the self-correlation operation is

Θ (n^{2})

, and the space complexity is

Θ (n^{2})

. The overall computation also includes additional linear-cost components introduced by the convolutional enhancement module. Despite the presence of a quadratic term, the actual inference overhead remains modest, as confirmed by our empirical evaluation (see Section 5.6 and Table 2).

5. Experiments

5.1. Datasets

To establish performance benchmarks and assess the effectiveness of our model, we conducted a comprehensive empirical study utilizing the AGID dataset, evaluating our model alongside six advanced Re-ID models. Section 3 provides a detailed introduction to the AGID dataset. Our method was also validated on prominent general vehicle Re-ID datasets, VeRi [15], VehicleID [14] and VERI-Wild [12], to investigate its suitability within a general vehicle Re-ID framework.

VeRi [15]. The VeRi dataset contains diverse traffic scenes captured by 20 cameras within an urban area, spanning approximately one square kilometer with various perspectives. The dataset has a total of 49,357 images of 776 different vehicles. Each vehicle is captured by 2 to 18 cameras, resulting in images with different viewing perspectives, illumination, occlusion, and resolution. To facilitate evaluation, the dataset is split into a training set (37,778 images of 576 vehicles) and a test set (11,579 images of 200 vehicles). Within the test set, 1678 images of the 200 vehicles are designated as the detection set, with the remaining images serving as the candidate set.

VehicleID [14]. The VehicleID dataset contains 221,763 images of 26,267 vehicles captured in real-world traffic scenes using two cameras. Similar to VeRi, VehicleID provides various vehicle colors, types, lighting conditions, and scenes. The vehicle images in VehicleID are limited to only two views—front and back. The dataset is divided into 13,134 vehicles, with a total of 110,178 training images, and 111,585 test images which are divided into 4 subsets. The number of the test subsets is 3200, 2400, 1600, and 800. In each test subset, one image from each vehicle is chosen as the detection set, while the remaining images form the candidate sets. In line with the majority of current approaches, we employed a test set size of 800 to enable effective comparison of results.

VERI-Wild [12]. VERI-Wild is a large-scale vehicle Re-ID dataset captured over one month (30 × 24 h) using 174 surveillance cameras deployed across an urban area of more than 200 km²; under unconstrained conditions. The dataset includes images taken during both day and night. The training set consists of 277,797 images of 30,671 vehicle identities. Three testing subsets of different scales are provided: Test3000 contains 3000 probe images and 41,816 gallery images; Test5000 includes 5000 probe images and 69,389 gallery images; and Test10000 comprises 10,000 probe images and 138,517 gallery images.

5.2. Implementation Details

Our baseline utilized TransReID [22], including the Jigsaw Patch Module. In the experiments, stride values of 12 and 16 were tested. Vehicle images were resized to 256 × 256 pixels, with data augmentation applied via random horizontal flips, padding, cropping, and random erasing. The batch size was set to 64, with four images per ID. We used an SGD optimizer with a momentum of 0.9 and weight decay of 1e-4, and the learning rate was initialized at 0.008 with cosine decay. The ViT weights were pre-trained on ImageNet-21K and fine-tuned on ImageNet-1K. For the AGID dataset,

λ

was set to 0.3 with a stride of 12, and 0.8 with a stride of 16. For the VeRi dataset,

λ

was 0.3 for stride 12 and 0.5 for stride 16. For the VehicleID dataset,

λ

was 0.3 with a stride of 12 and 0.1 with a stride of 16. For all datasets,

l_{1}

is set to 3, and

l_{2}

is set to 10.

Additionally, we conducted experiments on AGID and two general Re-ID Datasets by incorporating enhanced self-correlation features into the Transformer (ViT-B/16) image encoder of the CLIP-ReID model [42]. In these experiments,

l_{1}

was set to 2,

l_{2}

to 8, and

λ

to 0.5.

Following [37], we employed widely used evaluation metrics, namely mean Average Precision (mAP) and Cumulative Match Characteristics (CMC) rank-1/5/10, as the assessment criteria in this study. For the VeRi dataset, we list mAP and rank-1 following [42], and for the VehicleID dataset, we report rank-1 and rank-5 for comparison.

5.3. Comparison on AGID Datasets

To conduct a comprehensive evaluation of the AGID dataset, we compared our proposed method with six advanced models: ResNet-50 [32], AGW [43], BoT [44], MGN [45], TransReID [22] and CLIP-ReID [42]. AGW enhances the performance of the ResNet-50 backbone through the incorporation of a Non-local Attention Block, Generalized-Mean Pooling, and Weighted Regularization Triplet Loss. BoT improves model performance by integrating six tricks into widely used baselines. MGN is characterized by a multi-branch deep network that aggregates discriminative information at various granularities. CLIP-ReID is a Re-ID approach that employs a vision-language model in a two-stage framework. TransReID, a transformer-based Re-ID method, serves as our baseline. The reported results are based on this released code, and all parameter settings for each method were consistent with those outlined in their respective publications. The evaluation results are presented in Table 3.

Our method consistently outperformed all six comparison models on the AGID dataset. We experimented with the enhanced self-correlation feature computation method under two configurations: stride-16, which divides images into non-overlapping patches, and stride-12, which employs a sliding window for overlapping [22]. The stride-12 configuration achieved a mAP of 51.4%, rank-1 of 65.2%, rank-5 of 79.0%, and rank-10 of 84.6%. Relative to the baseline, our approach improved mAP, rank-1, rank-5, and rank-10 by 1.1%, 2.3%, 0.9%, and 1.7%, respectively, demonstrating that the addition of self-correlation features enhances local information and feature effectiveness. In the stride-16 setting, we also observed improvements of 2.4% in mAP and 1.9% in rank-1, confirming our method’s effectiveness across both configurations. The overlapping patches in stride-12 enrich local adjacent structures, while our self-correlation features provide broader shape information, further enhancing model performance.

Additionally, by incorporating our proposed enhanced self-correlation features into the image encoder of the CLIP-ReID model, we significantly improved the experimental results. Specifically, the rank-1 score increased by 2.1%, reaching 69.0%, which is the best result. This demonstrates that our method can provide additional information in vision-language-based models for aerial-ground cross-view vehicle Re-ID tasks, leading to more effective feature extraction. These results highlight the strong generality and high adaptability of our approach in aerial-ground cross-view tasks.

The advantage of our method over existing approaches primarily stems from the proposed multi-scale convolution-enhanced self-correlation features, specifically designed to address the discrepancies between aerial view and ground view images. While the self-attention-based method TransReID (baseline), CLIP-ReID outperforms most other competitive methods, the incorporation of our self-correlation features significantly enhances experimental performance.

5.4. Comparison on General Re-ID Datasets

Our model was also benchmarked against various state-of-the-art Re-ID methods on two widely-used vehicle Re-ID datasets, VeRi and VehicleID, to further demonstrate its effectiveness. For comparison, baseline results for both stride 12 and stride 16 were provided. As shown in Table 4, our method achieved excellent performance, significantly outperforming previous methods on the VehicleID dataset. Based on TransReID, with the best results at stride 16, yielding a rank-1 accuracy of 85.8%. Compared to the baseline, our method improved rank-1 accuracy by 2.2% at stride 16. On the VeRi dataset, the best performance was achieved with stride 12, with a rank-1 accuracy of 97.7% and mAP of 82.1%. Although our mAP of TransReID based method is lower than CLIP-ReID, due to its use of a vision-language model that introduces extra semantic information, our method surpassed CLIP-ReID in rank-1, indicating its advantage in top-ranked retrieval. The superiority of our method demonstrates that incorporating enhanced self-correlation features significantly improves image feature representation and model performance.

In addition, we conducted experiments on the more challenging VERI-Wild dataset, with the results presented in Table 5. Due to the large scale of the dataset and limitations of the computational resources, we only evaluated the model under the stride-16 setting, which may have constrained its full performance potential. The experimental results demonstrate the effectiveness of the proposed method across different dataset scales. Compared with the baseline TransReID, our method achieves consistent improvements in mAP and Top-1 accuracy across the Small, Medium, and Large subsets. Notably, the mAP increases by 0.8%, 1.0%, and 1.1% on the Medium and Large sets, respectively, while maintaining comparable Top-5 performance. These results validate the effectiveness of the proposed ESFC in enhancing feature representation and improving retrieval accuracy under varying levels of difficulty.

Additionally, we integrated enhanced self-correlation features into the CLIP-ReID [42], achieving improvements over the original method across various metrics on the VehicleID dataset, demonstrating the effectiveness of our approach on general datasets. Although our method shows lower mAP performance on the VeRi dataset compared to the original approach, it achieves a boost in Rank-1 accuracy, indicating that our approach is more effective for top-ranked samples.

5.5. Ablation Studies and Analysis

5.5.1. Ablation Study of SFC and MCE

We performed an ablation study on the AGID dataset to investigate the contribution of each component of our model. Our framework incorporates a self-correlation feature computation method and a multi-scale convolutional enhancement module. To assess their significance, we progressively added the self-correlation feature and multi-scale convolution enhancement module to the baseline, labeled as Base, SFC, and MCE in Table 6. As illustrated in Figure 5, when we only introduced the self-correlation feature, we computed them using deep features

f_{d}

and shallow features

f_{s}

through Equation 4, without applying multi-scale convolutional enhancement module to enhance deep features. Comparing the Base + SFC to the Base, we observed substantial improvements, with mAP increasing by 0.9%, and both rank-1 and rank-5 increasing by 1.0% each. This demonstrates that adding the self-correlation feature to the baseline effectively complements additional information, improving model performance. After introducing multi-scale convolutional enhancement module to enhance the feature, the results improved further, with rank-1 and rank-5 increasing by 1.3% and 1.9%, respectively. This confirms that the multi-scale convolution enhancement module strengthens local information in the deep features, working in synergy with self-correlation feature computation to further enhance the model’s effectiveness.

5.5.2. Visualization of Enhanced Self-Correlation Features

We employed the visualization method from [41] to visualize a set of enhanced self-correlation features. The visualization results are shown in Figure 7. As observed, the enhanced self-correlation features effectively capture shape information from the input images, complementing the global features extracted by the baseline model. This enhancement improves the model’s ability to focus on the vehicle itself by reinforcing shape perception.

5.5.3. Visualization of Attention Maps

We utilized the Score-CAM [58,59] for attention visualizations, as depicted in Figure 8, where the baseline is used for comparison. In contrast to the baseline, which in some cases overly focuses on the surrounding environment or fails to adequately highlight the distinctive parts of the vehicle, our method effectively redirects scattered attention from the background to the vehicle itself, paying more attention to the vehicle regions that contribute to differentiation.

5.5.4. Ablation Study of $λ$

We analyzed the effect of the enhanced self-correlation feature weight

λ

on model performance in Figure 9. When

λ

= 0, the self-correlation feature is not used, which represents the baseline of this paper. The results show that the model achieves optimal performance when

λ

= 0.8, with mAP of 50.3% and Rank-1 of 63.7%. For the mAP metric, improvements were observed for all tested values of

λ

, indicating that the proposed enhanced self-correlation feature effectively complements image additional information and improves model performance. When

λ

= 0.5, the Rank-1 score decreased by 0.2%, but the mAP increased by 0.8%, suggesting that this proportion of feature integration is more beneficial to overall performance improvement.

5.5.5. Visualization of Rank List

Figure 10 provides a visualization of the ranking on the AGID dataset, highlighting the qualitative results achieved by our method. The results demonstrate that our approach more accurately matches identities between the query and the gallery, as evidenced by correct candidates being ranked within the top-k positions. In contrast, the baseline model frequently selects similar but incorrect samples as matches. Therefore, our method effectively identifies high-quality samples, significantly enhancing the model’s ability to distinguish between confusing identities.

5.6. Computational Efficiency Analysis

To assess the computational overhead introduced by the proposed ESFC, we compare the model complexity and inference time between our method and the baseline. As shown in Table 2, our method contains 119.2 million parameters, compared to 102.7 million in the baseline model. For inference speed, we measure the average processing time for a batch of 256 images on a single NVIDIA A6000 GPU. The results show that our method requires 1191.2 ms per batch, compared to 1109.8 ms for the baseline. These results demonstrate that the ESFC introduces only a moderate computational cost while bringing significant performance improvements in cross-view vehicle Re-ID.

6. Conclusions

In this paper, we introduce a novel and essential task in vehicle Re-ID: aerial-ground cross-view Re-ID. This task focuses on identifying individual vehicles from two distinct perspectives—namely, aerial and ground views. To advance research in this area, we present AGID, a large-scale aerial-ground cross-view remote sensing vehicle Re-ID dataset. Through extensive empirical evaluations, we benchmark a variety of state-of-the-art Re-ID models on AGID to establish performance baselines. The results demonstrate a substantial performance gap between aerial-ground cross-view vehicle Re-ID and traditional vehicle Re-ID datasets, which proves the challenge of aerial-ground cross-view vehicle Re-ID task. Furthermore, we propose a novel approach called the enhanced self-correlation feature computation method, which strengthens the spatial relationships between patches with similar semantic contexts. This method introduces additional shape information, thereby complementing existing transformer-based approaches. This method is specifically designed to address the challenges posed by aerial-ground cross-view vehicle Re-ID. Our model consistently outperforms all current SOTA methods by a large margin on the aerial-ground cross-view vehicle Re-ID dataset and achieves highly competitive performance on traditional Re-ID datasets as well.

While AGID already covers a wide range of locations and temporal conditions, we recognize that certain challenging scenarios—such as nighttime environments, adverse weather conditions (e.g., rain and fog), and explicit multi-angle views of vehicles—are not yet fully represented. We have discussed these limitations in detail and plan to address them in future versions of AGID. These enhancements will contribute to building a more comprehensive and robust benchmark for advancing aerial-ground cross-view vehicle Re-ID research.

Author Contributions

Conceptualization, methodology, and writing, L.S., D.Z., C.M. and Y.N.; software, validation, and data curation, J.W. and L.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by projects of the Defense Innovation Institution.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shen, B.; Zhang, R.; Chen, H. An adaptively attention-driven cascade part-based graph embedding framework for UAV object re-identification. Remote Sens. 2022, 14, 1436. [Google Scholar] [CrossRef]
Lu, M.; Xu, Y.; Li, H. Vehicle re-identification based on UAV viewpoint: Dataset and method. Remote Sens. 2022, 14, 4603. [Google Scholar] [CrossRef]
Sheng, H.; Wang, S.; Chen, H.; Yang, D.; Huang, Y.; Shen, J.; Ke, W. Discriminative feature learning with co-occurrence attention network for vehicle ReID. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 3510–3522. [Google Scholar] [CrossRef]
Guo, H.; Zhu, K.; Tang, M.; Wang, J. Two-level attention network with multi-grain ranking loss for vehicle re-identification. IEEE Trans. Image Process. 2019, 28, 4328–4338. [Google Scholar] [CrossRef]
Zhou, X.; Zhong, Y.; Cheng, Z.; Liang, F.; Ma, L. Adaptive sparse pairwise loss for object re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 18–22 June 2023; pp. 19691–19701. [Google Scholar]
Gu, J.; Wang, K.; Luo, H.; Chen, C.; Jiang, W.; Fang, Y.; Zhang, S.; You, Y.; Zhao, J. Msinet: Twins contrastive search of multi-scale interaction for object reid. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 18–22 June 2023; pp. 19243–19253. [Google Scholar]
Yao, Y.; Gedeon, T.; Zheng, L. Large-scale training data search for object re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 18–22 June 2023; pp. 15568–15578. [Google Scholar]
Rao, Y.; Chen, G.; Lu, J.; Zhou, J. Counterfactual attention learning for fine-grained visual categorization and re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 11–17 October 2021; pp. 1025–1034. [Google Scholar]
Zhang, C.; Yang, C.; Wu, D.; Dong, H.; Deng, B. Cross-view vehicle re-identification based on graph matching. Appl. Intell. 2022, 52, 14799–14810. [Google Scholar] [CrossRef]
Yan, C.; Pang, G.; Wang, L.; Jiao, J.; Feng, X.; Shen, C.; Li, J. BV-person: A large-scale dataset for bird-view person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 11–17 October 2021; pp. 10943–10952. [Google Scholar]
Yu, Z.; Pei, J.; Zhu, M.; Zhang, J.; Li, J. Multi-attribute adaptive aggregation transformer for vehicle re-identification. Inf. Process. Manag. 2022, 59, 102868. [Google Scholar] [CrossRef]
Lou, Y.; Bai, Y.; Liu, J.; Wang, S.; Duan, L. Veri-wild: A large dataset and a new method for vehicle re-identification in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3235–3243. [Google Scholar]
Li, H.; Chen, J.; Zheng, A.; Wu, Y.; Luo, Y. Day-Night Cross-domain Vehicle Re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Washington, USA, 17–21 June 2024; pp. 12626–12635. [Google Scholar]
Liu, H.; Tian, Y.; Yang, Y.; Pang, L.; Huang, T. Deep relative distance learning: Tell the difference between similar vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2167–2175. [Google Scholar]
Liu, X.; Liu, W.; Mei, T.; Ma, H. A deep learning-based approach to progressive vehicle re-identification for urban surveillance. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 869–884. [Google Scholar]
Lin, T.Y.; Cui, Y.; Belongie, S.; Hays, J. Learning deep representations for ground-to-aerial geolocalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5007–5015. [Google Scholar]
Liu, L.; Li, H. Lending orientation to neural networks for cross-view geo-localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 5624–5633. [Google Scholar]
Dudczyk, J.; Rybak, Ł. Application of data particle geometrical divide algorithms in the process of radar signal recognition. Sensors 2023, 23, 8183. [Google Scholar] [CrossRef]
Sun, Z.; Nie, X.; Xi, X.; Yin, Y. CFVMNet: A multi-branch network for vehicle re-identification based on common field of view. In Proceedings of the 28th ACM International Conference on Multimedia, Online, 12–16 October 2020; pp. 3523–3531. [Google Scholar]
Shen, F.; Zhu, J.; Zhu, X.; Xie, Y.; Huang, J. Exploring spatial significance via hybrid pyramidal graph network for vehicle re-identification. IEEE Trans. Intell. Transp. Syst. 2021, 23, 8793–8804. [Google Scholar] [CrossRef]
Ye, M.; Chen, S.; Li, C.; Zheng, W.S.; Crandall, D.; Du, B. Transformer for object re-identification: A survey. Int. J. Comput. Vis. 2025, 133, 2410–2440. [Google Scholar] [CrossRef]
He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; Jiang, W. Transreid: Transformer-based object re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 11–17 October 2021; pp. 15013–15022. [Google Scholar]
Qian, W.; Luo, H.; Peng, S.; Wang, F.; Chen, C.; Li, H. Unstructured feature decoupling for vehicle re-identification. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 336–353. [Google Scholar]
Shen, F.; Xie, Y.; Zhu, J.; Zhu, X.; Zeng, H. Git: Graph interactive transformer for vehicle re-identification. IEEE Trans. Image Process. 2023, 32, 1039–1051. [Google Scholar] [CrossRef]
Wei, R.; Gu, J.; He, S.; Jiang, W. Transformer-Based Domain-Specific Representation for Unsupervised Domain Adaptive Vehicle Re-Identification. IEEE Trans. Intell. Transp. Syst. 2022, 24, 2935–2946. [Google Scholar] [CrossRef]
Zhu, H.; Ke, W.; Li, D.; Liu, J.; Tian, L.; Shan, Y. Dual cross-attention learning for fine-grained visual categorization and object re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 4692–4702. [Google Scholar]
Shen, F.; Zhu, J.; Zhu, X.; Huang, J.; Zeng, H.; Lei, Z.; Cai, C. An efficient multiresolution network for vehicle reidentification. IEEE Internet Things J. 2021, 9, 9049–9059. [Google Scholar] [CrossRef]
Shen, Y.; Xiao, T.; Li, H.; Yi, S.; Wang, X. Learning deep neural networks for vehicle re-id with visual-spatio-temporal path proposals. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1900–1909. [Google Scholar]
Wang, Z.; Tang, L.; Liu, X.; Yao, Z.; Yi, S.; Shao, J.; Yan, J.; Wang, S.; Li, H.; Wang, X. Orientation invariant feature embedding and spatial temporal regularization for vehicle re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 379–387. [Google Scholar]
Zhou, Y.; Shao, L. Cross-view GAN based vehicle generation for re-identification. In Proceedings of the British Machine Vision Conference (BMVC), London, UK, 4–7 September 2017; Volume 1, pp. 1–12. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
He, B.; Li, J.; Zhao, Y.; Tian, Y. Part-regularized near-duplicate vehicle re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3997–4005. [Google Scholar]
Khorramshahi, P.; Peri, N.; Chen, J.C.; Chellappa, R. The devil is in the details: Self-supervised attention for vehicle re-identification. In Proceedings of the European Conference on Computer Vision, Online, 23–28 August 2020; pp. 369–386. [Google Scholar]
Zhao, J.; Zhao, Y.; Li, J.; Yan, K.; Tian, Y. Heterogeneous relational complement for vehicle re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 11–17 October 2021; pp. 205–214. [Google Scholar]
Liu, X.; Liu, W.; Mei, T.; Ma, H. Provid: Progressive and multimodal vehicle reidentification for large-scale urban surveillance. IEEE Trans. Multimed. 2017, 20, 645–658. [Google Scholar] [CrossRef]
Liu, X.; Liu, W.; Ma, H.; Fu, H. Large-scale vehicle re-identification in urban surveillance videos. In Proceedings of the IEEE International Conference on Multimedia and Expo, Seattle, WA, USA, 11–15 July 2016; pp. 1–6. [Google Scholar]
Li, Z.Y.; Gao, S.; Cheng, M.M. Sere: Exploring feature self-relation for self-supervised transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15619–15631. [Google Scholar] [CrossRef]
Sun, B.; Yang, Y.; Zhang, L.; Cheng, M.M.; Hou, Q. Corrmatch: Label propagation via correlation matching for semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 3097–3107. [Google Scholar]
Li, S.; Sun, L.; Li, Q. Clip-reid: Exploiting vision-language model for image re-identification without concrete text labels. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 1405–1413. [Google Scholar]
Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; Hoi, S.C. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 2872–2893. [Google Scholar] [CrossRef]
Luo, H.; Gu, Y.; Liao, X.; Lai, S.; Jiang, W. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Wang, G.; Yuan, Y.; Chen, X.; Li, J.; Zhou, X. Learning Discriminative Features with Multiple Granularities for Person Re-Identification. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, South Korea, 22–26 October 2018; pp. 274–282. [Google Scholar]
Qian, J.; Jiang, W.; Luo, H.; Yu, H. Stripe-based and attribute-aware network: A two-branch deep model for vehicle re-identification. Meas. Sci. Technol. 2020, 31, 095401. [Google Scholar] [CrossRef]
Jin, X.; Lan, C.; Zeng, W.; Chen, Z. Uncertainty-aware multi-shot knowledge distillation for image-based object re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, New York Hilton Midtown, New York City, NY, USA, 7–12 February 2020; Volume 34, pp. 11165–11172. [Google Scholar]
Chu, R.; Sun, Y.; Li, Y.; Liu, Z.; Zhang, C.; Wei, Y. Vehicle re-identification with viewpoint-aware metric learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, South Korea, 27 October–2 November 2019; pp. 8282–8291. [Google Scholar]
Chen, T.S.; Liu, C.T.; Wu, C.W.; Chien, S.Y. Orientation-aware vehicle re-identification with semantics-guided part attention network. In Proceedings of the European Conference on Computer Vision, Online, 23–28 August 2020; pp. 330–346. [Google Scholar]
Zhang, X.; Zhang, R.; Cao, J.; Gong, D.; You, M.; Shen, C. Part-guided attention learning for vehicle instance retrieval. IEEE Trans. Intell. Transp. Syst. 2020, 23, 3048–3060. [Google Scholar] [CrossRef]
Meng, D.; Li, L.; Liu, X.; Li, Y.; Yang, S.; Zha, Z.J.; Gao, X.; Wang, S.; Huang, Q. Parsing-based view-aware embedding network for vehicle re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 14–19 June 2020; pp. 7103–7112. [Google Scholar]
Suprem, A.; Pu, C. Looking glamorous: Vehicle re-id in heterogeneous cameras networks with global and local attention. arXiv 2020, arXiv:2002.02256. [Google Scholar]
He, S.; Luo, H.; Chen, W.; Zhang, M.; Zhang, Y.; Wang, F.; Li, H.; Jiang, W. Multi-domain learning and identity mining for vehicle re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Online, 14–19 June 2020; pp. 582–583. [Google Scholar]
Yan, C.; Pang, G.; Bai, X.; Liu, C.; Ning, X.; Gu, L.; Zhou, J. Beyond triplet loss: Person re-identification with fine-grained difference-aware pairwise loss. IEEE Trans. Multimed. 2021, 24, 1665–1677. [Google Scholar] [CrossRef]
Yang, L.; Luo, P.; Change Loy, C.; Tang, X. A large-scale car dataset for fine-grained categorization and verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3973–3981. [Google Scholar]
Khorramshahi, P.; Kumar, A.; Peri, N.; Rambhatla, S.S.; Chen, J.C.; Chellappa, R. A dual-path model with adaptive attention for vehicle re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, South Korea, 27 October–2 November 2019; pp. 6132–6141. [Google Scholar]
Bai, Y.; Lou, Y.; Dai, Y.; Liu, J.; Chen, Z.; Duan, L.Y.; Pillar, I. Disentangled Feature Learning Network for Vehicle Re-Identification. In Proceedings of the International Joint Conference on Artificial Intelligence, Online, 7–15 January 2021; pp. 474–480. [Google Scholar]
Wang, H.; Wang, Z.; Du, M.; Yang, F.; Zhang, Z.; Ding, S.; Mardziel, P.; Hu, X. Score-CAM: Score-weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Online, 14–19 June 2020; pp. 24–25. [Google Scholar]
Gildenblat, J.; Contributors. PyTorch Library for CAM Methods. 2021. Available online: https://github.com/jacobgil/pytorch-grad-cam (accessed on 1 September 2024).

Figure 1. Dataset examples. Figure (a)–(c) show images of a randomly selected vehicle from the VehicleID, VeRi, and AGID datasets, respectively. These figures illustrate the variations in the images of the same vehicle captured from different angles across the three datasets.

Figure 2. Examples of dataset collection scenes. The license plates in the images have been obscured for privacy.

Figure 3. Example images of the same vehicle captured from different viewpoints in AGID dataset. The leftmost column shows aerial perspective, while the remaining four columns display ground perspective.

Figure 4. The framework of our proposed Enhanced Self-Correlation Feature Computation. The input images are processed for feature extraction using a Transformer-based architecture. The proposed enhanced self-correlation feature computation method can be integrated into the Transformer structure, taking both high- and low-level Transformer layer outputs as inputs. These are then passed through the MCE module to obtain enhanced features. The enhanced features, combined with the low-level features, are used to compute enhanced self-correlation features. The enhanced self-correlation features, along with the global features, are then fused to generate the image features used for classification and loss computation.

f_{d}

denotes the deep features,

f_{s}

stands for the shallow features, and

f_{e}

refers to the enhanced features.

Figure 4. The framework of our proposed Enhanced Self-Correlation Feature Computation. The input images are processed for feature extraction using a Transformer-based architecture. The proposed enhanced self-correlation feature computation method can be integrated into the Transformer structure, taking both high- and low-level Transformer layer outputs as inputs. These are then passed through the MCE module to obtain enhanced features. The enhanced features, combined with the low-level features, are used to compute enhanced self-correlation features. The enhanced self-correlation features, along with the global features, are then fused to generate the image features used for classification and loss computation.

f_{d}

denotes the deep features,

f_{s}

stands for the shallow features, and

f_{e}

refers to the enhanced features.

Figure 5. Illustration of the self-correlation feature computation method. In the Transformer-based feature extraction architecture, both shallow and deep features are extracted and used to compute self-correlation features. These self-correlation features supplement the original image features.

Figure 6. The integration of enhanced self-correlation features in CLIP-ReID [42]. The enhanced self-correlation features are incorporated into the image encoder, overlaying the original image features without altering the other components, including the text encoder.

Figure 7. Visualization of enhanced self-correlation features. We conducted visualization experiments using the method used in [41] to illustrate the enhanced self-correlation features extracted by the model. The original input images are displayed in the first row, while the visualization results are presented in the second row.

Figure 8. Visualization of attention regions. We employed the method introduced by [58,59] to perform visualization experiments for identifying the model’s attention regions. The first row displays the original input images, the second row shows the attention regions of the baseline model (TransReID [22]) without the multi-scale convolution-enhanced self-correlation feature, and the final row illustrates the attention regions of our proposed method.

Figure 9. The impact of parameter

λ

in AGID. The variations in mAP and Rank-1 metrics for different values of the parameter

λ

are provided, the strite is 16. The case where

λ

is set to 0 corresponds to the baseline (TransReID [22]) presented in this paper.

Figure 9. The impact of parameter

λ

in AGID. The variations in mAP and Rank-1 metrics for different values of the parameter

λ

are provided, the strite is 16. The case where

λ

is set to 0 corresponds to the baseline (TransReID [22]) presented in this paper.

Figure 10. Top-10 Ranking on AGID Dataset. The top row presents the ten best retrieval results produced by the baseline model (TransReID [22]), with the bottom row showing the corresponding results from our model. Green boxes denote correct identifications, while red boxes indicate incorrect results.

Table 1. Comparisons among the VeRi [15], VehicleID [14], VERI-Wild [12], DN-Wild [13], DN348 [13], and the created AGID datasets for vehicle Re-ID.

Dataset	Identities	Aerial	Ground	Cross-View	Urban	Rural
VeRi [15]	776	✓	✗	✗	✓	✗
VehicleID [14]	26,267	✓	✗	✗	✓	✗
VERI-Wild [12]	40,671	✓	✗	✗	✓	✗
DN-Wild [13]	2286	✓	✗	✗	✓	✗
DN348 [13]	348	✓	✗	✗	✓	✗
AGID	834	✓	✓	✓	✓	✓

Table 2. Computational efficiency comparison.

Method	Parameters (M)	Inference Time (ms/Batch)
Baseline	102.7	1109.8
Baseline + ESFC	119.2	1191.2

Table 3. Comparison with State-of-the-Arts in AGID. Bold indicate the best performance.

AGID	mAP	R1	R5	R10
ResNet-50 [32]	32.9	48.2	62.0	68.5
AGW [43]	38.1	54.1	66.8	72.5
BoT [44]	34.4	49.4	62.7	69.6
MGN [45]	42.2	57.6	71.6	78.3
TransReID(stride16) [22]	47.9	61.8	76.8	82.8
TransReID(stride16) + ESFC	50.3	63.7	76.4	83.3
TransReID(stride12) [22]	50.3	62.9	78.1	82.9
TransReID(stride12) + ESFC	51.4	65.2	79.0	84.6
CLIP-ReID [42]	54.2	66.9	81.8	87.2
CLIP-ReID + ESFC	54.4	69.0	82.9	88.2

Table 4. Comparison with State-of-the-Arts in VeRi and VehicleID. Bold indicate the best performance. Underlined indicate the second-best performance.

	VeRi		VehicleID
Method	mAP	R1	R1	R5
PRRe-ID [35]	72.5	93.3	72.6	88.6
SAN [46]	72.5	93.3	79.7	94.3
UMTS [47]	75.9	95.8	80.9	87.0
VANet [48]	66.3	89.8	83.3	96.0
SPAN [49]	68.9	94.0	-	-
PGAN [50]	79.3	96.5	78.0	93.2
PVEN [51]	79.5	95.6	84.7	97.0
SAVER [36]	79.6	96.4	79.9	95.2
CFVMNet [19]	77.1	95.3	81.4	94.1
GLAMOR [52]	80.3	96.5	78.6	93.6
MDIM [53]	79.8	95.0	-	-
CAL [8]	74.3	95.4	82.5	94.7
FIDI [54]	77.6	95.7	78.5	91.9
DCAL [26]	80.2	96.9	-	-
TransReID(stride16) [22]	80.6	96.9	83.6	97.1
TransReID(stride16) + ESFC	81.5	97.0	85.8	97.5
TransReID(stride12) [22]	82.0	97.1	85.2	97.5
TransReID(stride12) + ESFC	82.1	97.7	85.7	97.4
CLIP-ReID [42]	83.3	97.4	85.3	97.6
CLIP-ReID + ESFC	83.0	97.6	85.4	97.8

Table 5. Comparison with State-of-the-Arts in VERI-Wild.

	Small			Medium			Large
Methods	mAP	R1	R5	mAP	R1	R5	mAP	R1	R5
GoogleNet [55]	24.3	57.2	75.1	24.2	53.2	71.1	21.5	44.6	63.6
FDA-Net(VGGM) [12]	35.1	64.0	82.8	29.8	57.8	78.3	22.8	49.4	70.5
FDA-Net(Resnet50) [12]	61.6	73.6	91.2	52.7	64.3	85.4	45.8	58.8	81.0
AAVER [56]	62.2	75.8	92.7	53.7	68.2	88.9	41.7	58.7	81.6
DFLNet [57]	68.2	80.7	93.2	60.1	70.7	89.3	49.0	61.6	82.7
UMTS [47]	72.7	84.5	-	66.1	79.3	-	54.2	72.8	-
BoT [44]	76.6	90.8	97.3	70.1	87.5	95.2	61.3	82.6	92.7
HPGN [20]	80.4	91.4	-	75.2	88.2	-	65.0	82.7	-
PVEN [51]	79.8	94.0	98.1	73.9	92.0	97.2	66.2	88.6	95.3
SAVER [36]	80.9	93.8	97.9	75.3	92.7	97.5	67.7	89.5	95.8
TransReID [22]	80.1	92.4	97.7	74.1	89.4	96.5	65.2	85.1	94.2
TransReID + ESFC	80.9	92.3	97.5	75.1	90.2	96.4	66.3	85.9	94.3

Table 6. Ablation study on AGID dataset.

Base	SFC	MCE	mAP	R1	R5	R10
✓			50.3	62.9	78.1	82.9
✓	✓		51.2	63.9	77.1	83.7
✓	✓	✓	51.4	65.2	79.0	84.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shang, L.; Min, C.; Wang, J.; Xiao, L.; Zhao, D.; Nie, Y. Aerial-Ground Cross-View Vehicle Re-Identification: A Benchmark Dataset and Baseline. Remote Sens. 2025, 17, 2653. https://doi.org/10.3390/rs17152653

AMA Style

Shang L, Min C, Wang J, Xiao L, Zhao D, Nie Y. Aerial-Ground Cross-View Vehicle Re-Identification: A Benchmark Dataset and Baseline. Remote Sensing. 2025; 17(15):2653. https://doi.org/10.3390/rs17152653

Chicago/Turabian Style

Shang, Linzhi, Chen Min, Juan Wang, Liang Xiao, Dawei Zhao, and Yiming Nie. 2025. "Aerial-Ground Cross-View Vehicle Re-Identification: A Benchmark Dataset and Baseline" Remote Sensing 17, no. 15: 2653. https://doi.org/10.3390/rs17152653

APA Style

Shang, L., Min, C., Wang, J., Xiao, L., Zhao, D., & Nie, Y. (2025). Aerial-Ground Cross-View Vehicle Re-Identification: A Benchmark Dataset and Baseline. Remote Sensing, 17(15), 2653. https://doi.org/10.3390/rs17152653

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Aerial-Ground Cross-View Vehicle Re-Identification: A Benchmark Dataset and Baseline

Abstract

1. Introduction

2. Related Work

2.1. Vehicle Re-Identification

2.2. Vehicle Re-ID Datasets

3. Cross View Dataset: AGID

4. The Proposed Method for AGID

4.1. Overview

4.2. Self-Correlation Feature Computation Method

4.3. Multi-Scale Convolutional Enhancement Module

4.4. Overall Training

4.5. Computational Complexity

5. Experiments

5.1. Datasets

5.2. Implementation Details

5.3. Comparison on AGID Datasets

5.4. Comparison on General Re-ID Datasets

5.5. Ablation Studies and Analysis

5.5.1. Ablation Study of SFC and MCE

5.5.2. Visualization of Enhanced Self-Correlation Features

5.5.3. Visualization of Attention Maps

5.5.4. Ablation Study of $λ$

5.5.5. Visualization of Rank List

5.6. Computational Efficiency Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Aerial-Ground Cross-View Vehicle Re-Identification: A Benchmark Dataset and Baseline

Abstract

1. Introduction

2. Related Work

2.1. Vehicle Re-Identification

2.2. Vehicle Re-ID Datasets

3. Cross View Dataset: AGID

4. The Proposed Method for AGID

4.1. Overview

4.2. Self-Correlation Feature Computation Method

4.3. Multi-Scale Convolutional Enhancement Module

4.4. Overall Training

4.5. Computational Complexity

5. Experiments

5.1. Datasets

5.2. Implementation Details

5.3. Comparison on AGID Datasets

5.4. Comparison on General Re-ID Datasets

5.5. Ablation Studies and Analysis

5.5.1. Ablation Study of SFC and MCE

5.5.2. Visualization of Enhanced Self-Correlation Features

5.5.3. Visualization of Attention Maps

5.5.4. Ablation Study of λ

5.5.5. Visualization of Rank List

5.6. Computational Efficiency Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.5.4. Ablation Study of $λ$