Object Re-Identification Method for Air-to-Ground Targets Based on Neighborhood Feature Centralization Attention

Yao, Tian; Xu, Yong; Ma, Yue; Yan, Hongtao; Xu, Haihang; Wang, An

doi:10.3390/computation14050096

Open AccessArticle

Object Re-Identification Method for Air-to-Ground Targets Based on Neighborhood Feature Centralization Attention

by

Tian Yao

,

Yong Xu

^*

,

Yue Ma

,

Hongtao Yan

,

Haihang Xu

and

An Wang

Aerospace Technology Institute of CARDC, South Section of Second Ring Road, Mianyang 621000, China

^*

Author to whom correspondence should be addressed.

Computation 2026, 14(5), 96; https://doi.org/10.3390/computation14050096

Submission received: 19 March 2026 / Revised: 15 April 2026 / Accepted: 20 April 2026 / Published: 22 April 2026

(This article belongs to the Topic Intelligent Optimization Algorithm: Theory and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

To address the core challenges in air-to-ground target re-identification (ReID), including network focus on invalid background information, poor adaptability to nonlinear feature distribution, and insufficient cross-domain generalization, this paper proposes a novel air-to-ground ReID framework based on Neighborhood Feature Centralization Attention (NFCA). On the basis of Coordinate Attention, the framework introduces a parameter-free Neighborhood Feature Centralization mechanism to build a lightweight attention module, which enhances cross-feature semantic interaction and suppresses background noise while retaining precise position encoding. It achieves end-to-end direct optimization of sample pair similarity through binary cross-entropy loss, eliminating the proxy task bias of traditional classification loss and adapting to the nonlinear structure of feature space. A multi-source data-driven training strategy is constructed by fusing ReID datasets and general classification datasets, which expands the coverage of feature space and narrows the distribution gap between training data and real air-to-ground scenarios without additional manual annotation. Experiments show that the proposed method achieves leading mAP values on the self-developed UAV air-to-ground dataset JC-1, the public person ReID dataset Market-1501, and the public vehicle ReID dataset VehicleID. Sufficient statistical validation, ablation experiments and cross-domain tests verify the advancement, reliability and generalization of the proposed method in complex air-to-ground scenarios.

Keywords:

object re-identification; cross-domain data training; adaptive metric function; attention mechanism

1. Introduction

As a core task in computer vision, air-to-ground target re-identification (ReID) [1,2] aims to achieve accurate target matching across complex scenarios such as cross-camera viewpoints, illumination variations, and object occlusions in UAV air-to-ground collaborative environments. It has critical engineering application value in fields including UAV autonomous perception [3,4], low-altitude border patrol [5], urban intelligent traffic monitoring [6,7], and target tracking in emergency rescue [8]. Compared with conventional ReID tasks with ground-fixed cameras, the high-speed maneuver of the UAV platform, drastic changes in shooting viewpoints, large-scale background interference, and significant fluctuations in target scale in air-to-ground scenarios bring exclusive technical challenges to ReID tasks. Existing mainstream methods suffer from severe performance degradation in air-to-ground scenarios, and their core bottlenecks can be summarized into three aspects.

First, the network is prone to focus on invalid background information, resulting in insufficient feature discriminability. In air-to-ground scenarios, rapid viewpoint switching, pixel-level dense targets, and large-scale dynamic background changes lead to significant shifts in the feature distribution of the same target across different domains. Traditional deep learning networks are often disturbed by background noise, redundant features and non-critical local regions, while ignoring the core discriminative features of the target. Existing mainstream attention mechanisms (such as Coordinate Attention, CA) can only achieve independent position encoding, lacking effective cross-feature semantic interaction and background noise suppression capabilities. The interference of invalid information greatly reduces the discriminability of features and restricts the accurate matching ability of ReID tasks.

Second, the similarity evaluation method has poor adaptability to the nonlinear feature space. Most existing methods rely on classification loss to indirectly optimize the feature space, and use manually designed linear metric functions such as Euclidean distance and cosine similarity to complete similarity evaluation. However, in air-to-ground scenarios, the high-speed movement of airborne cameras and drastic viewpoint changes significantly increase the nonlinear complexity of feature distribution. Linear metrics have inherent systematic bias for nonlinear feature structures, resulting in a contradiction between “low training error and poor actual inference performance”.

Third, the feature space coverage of training data is insufficient, leading to weak cross-domain generalization ability. Traditional methods are highly dependent on finely annotated ReID datasets with unique ID labels. However, in air-to-ground ReID scenarios, the acquisition of high-quality annotated samples is difficult and costly. Limited training data can hardly cover the feature diversity of real scenarios, leading to a large gap between the training distribution and the real scenario distribution, as well as high generalization error. Existing cross-domain methods mostly rely on adversarial domain adaptation mechanisms, which require a large amount of target domain data for support, and cannot adapt to open air-to-ground scenarios with unknown target distributions.

To address the above core challenges of air-to-ground target ReID, which are not fully resolved by existing attention-based and metric learning-based ReID methods, this paper proposes a novel end-to-end deep metric learning framework based on Neighborhood Feature Centralization Attention (NFCA). Unlike existing works that rely on incremental optimization of single modules, our framework constructs a three-dimensional collaborative optimization chain of “attention mechanism-adaptive metric-multi-source data supplementation”, which fundamentally breaks through the three core bottlenecks of existing methods in air-to-ground scenarios. The core contributions and fundamental technical distinctions of this paper compared with state-of-the-art attention-based and metric learning approaches are as follows:

A lightweight, parameter-free Neighborhood Feature Centralization Attention (NFCA) [9] module is proposed, which fundamentally breaks the inherent trade-off of existing attention mechanisms. Unlike Coordinate Attention (CA) that only realizes independent position encoding without cross-feature semantic interaction, and other mainstream attention modules (CBAM, ECA) that sacrifice position accuracy for channel correlation modeling, NFCA introduces the Neighborhood Feature Centralization (NFC) mechanism on the basis of CA. It retains precise position encoding while enhancing cross-feature semantic interaction, and achieves intra-class feature compactness and background noise suppression through neighborhood statistics without additional trainable parameters, significantly improving the robustness of features to extreme viewpoint changes, occlusions and illumination fluctuations in air-to-ground scenarios.

A data-driven adaptive nonlinear metric learning paradigm is constructed, which achieves full end-to-end alignment between training optimization objectives and inference task goals. Different from existing metric learning methods that rely on manually designed linear metrics (Euclidean distance, cosine similarity) with inherent systematic bias for nonlinear feature spaces, or use classification loss as a proxy task to indirectly optimize feature distribution, our method realizes direct end-to-end optimization of sample pair similarity via BCEWithLogitsLoss. It eliminates both the proxy task bias of classification loss and the systematic bias of linear metrics, and can accurately fit the complex nonlinear manifold structure of the ReID feature space in air-to-ground scenarios.

A label-free multi-source data fusion training strategy is proposed, which breaks the dependence of existing cross-domain ReID methods on target domain data and manual annotation. Without any additional manual labeling, we automatically construct positive and negative sample pairs from general classification datasets following the same pairing rules as ReID datasets, and fuse them with ReID data for training. This strategy expands the coverage of the feature space, narrows the distribution gap between training data and real air-to-ground scenarios, and enhances cross-domain generalization without relying on target domain data required by adversarial domain adaptation methods.

A dedicated UAV air-to-ground vehicle ReID dataset JC-1 is constructed, which covers complex variations of extreme viewpoints, shooting distances, illumination conditions and day-night periods. It fills the lack of dedicated benchmarks for air-to-ground target ReID, and provides a standardized test platform for follow-up research in this field.

2. Related Work

2.1. Application of Deep Metric Learning in Re-Identification

Deep metric learning directly optimizes the similarity of sample pairs/triplets through a Siamese Network [10] or a Triplet Network [11], avoiding the proxy task bias of traditional classification loss. Early methods such as Contrastive Loss and Triplet Loss optimize the feature space by explicitly constraining similar samples to have close features and dissimilar samples to have distant features, but they rely on manually designed margin parameters and metric functions (e.g., Euclidean distance), resulting in limited generalization ability. Subsequent studies proposed Lifted Structure loss [12] to solve the problem of triplet sampling efficiency through structured feature embedding; AM-Softmax [13] introduced angular margin to enhance feature discriminability; NormFace [14] improved the uniformity of the metric space through feature normalization and margin optimization. Recent studies have attempted to introduce adaptive metric learning, such as learning nonlinear similarity functions through fully connected networks, but most of them do not fully utilize the structural information of the feature space. The proposed method directly optimizes sample pair similarity through end-to-end BCEWithLogitsLoss and combines an attention mechanism to enhance feature discriminability, breaking through the limitations of traditional linear metrics.

2.2. Progress of Attention Mechanism in Visual Feature Learning

The attention mechanism improves the discriminative ability of networks by capturing the dependencies of feature maps. Coordinate Attention (CA) [15] encodes position information through global pooling along the horizontal and vertical directions. The multi-scale feature fusion strategy of Pyramid Vision Transformer [16] provides inspiration for the design of cross-channel interaction. On the basis of CA, this paper introduces the NFC mechanism to realize cross-feature interaction, proposes a Neighborhood Feature Centralization Attention (NFCA) module that fuses channel information, and at the same time retains precise position encoding and enhances channel correlation, effectively improving the robustness of re-identification features. In addition, plug-and-play feature fusion methods [17] provide design ideas for the cross-feature collaborative optimization in this paper through refined feature pooling.

2.3. Cross-Domain Generalization and Data Fusion Technology

The core challenge of cross-domain re-identification lies in the distribution difference between training data and real scenarios. Traditional domain adaptation methods such as adversarial learning (DAN [18], CDAN [19]) bridge the distribution difference through adversarial mechanisms but rely on a large amount of target domain data. Recent studies have explored the use of auxiliary data to enhance generalization ability, such as introducing unannotated images or weakly supervised information. Feature fusion invariance methods [20] improve generalization by learning domain-invariant features, while early cross-domain methods such as Person Transfer GAN [21] highlight the importance of distribution alignment. The research on day-night cross-domain vehicle re-identification [22] verifies the effectiveness of multi-source data fusion in practical scenarios. This paper proposes a composite training strategy that fuses re-identification datasets with classification datasets, significantly reducing the Hellinger distance [23] between the training distribution and the real distribution and lowering the generalization error [24] by expanding the coverage of the feature space.

3. Cross-Domain Generalization Framework Based on Deep Metric Learning

This paper proposes an air-to-ground re-identification framework based on neighborhood feature centralization attention, which realizes accurate identification of different target objects in cross-camera air-to-ground scenarios by jointly optimizing the feature space and the adaptive similarity metric function. The following will introduce the overall network architecture, the attention module fusing neighborhood information, and the comparison of feature space optimization objectives.

3.1. Overall Network Architecture

The core of the method includes three parts: a feature extraction module, an attention module, and a similarity evaluation module, where the attention module adopts a plug-and-play manner. Different from previous re-identification methods that use output categories to assist convergence, classification datasets are added during training to expand the data volume and enhance the generalization ability of the network. The overall method is shown in Figure 1.

As shown in Figure 1a, the end-to-end collaborative training process of the NFCA module and the similarity learning branch is as follows: (1) Sample Pair Construction: The training dataset is divided into positive and negative sample pairs, where positive pairs are two samples of the same ID, and negative pairs are two samples of different IDs. (2) Weight-Shared Feature Extraction: The sample pairs are synchronously input into the Siamese network with shared weights, and the ResNet34 backbone outputs initial feature pairs. (3) NFCA Module for Discriminative Feature Enhancement: The initial feature pairs are input into the plug-and-play NFCA module, which suppresses background noise, enhances cross-feature semantic consistency, and outputs robust, high-discriminability attention feature pairs. (4) End-to-End Similarity Optimization: The attention feature pairs output by NFCA are directly fed into the adaptive nonlinear similarity learning branch. The similarity score of the feature pair is calculated, and the network is trained end-to-end with BCEWithLogitsLoss. The gradient generated by the loss is back-propagated to both the similarity branch and the NFCA module, guiding the attention module to focus more on the core discriminative regions that are critical for similarity matching.

The test and validation process of Figure 1b is mainly as follows: after the unknown ID query image to be identified and the gallery sample set with registered IDs are subjected to unified preprocessing consistent with the training phase, they are synchronously input into the trained Siamese network with fixed and shared weights, feature encoding is completed through the ResNet34 backbone integrated with the NFCA attention module, then the pairwise similarity between the query feature and all gallery sample features is calculated one by one through the adaptive nonlinear metric branch jointly trained end-to-end, and finally the ID corresponding to the maximum similarity value is output as the re-identification result. Based on the end-to-end integrated network architecture, feature extraction and similarity calculation can be completed in a single forward propagation in this process; the weight-sharing feature of the Siamese network supports real-time generation of standardized features for input images, and the end-to-end optimization logic of deep coupling between the metric branch and the feature extraction backbone avoids the adaptation deviation caused by pre-extracted features. Therefore, the native algorithm process does not require pre-extraction of gallery features in advance, and pre-extraction is only an optional engineering optimization item to improve inference efficiency in large-scale industrial deployment, not a necessary link of this test process.

The proposed framework can not only introduce re-identification datasets but also classification non-re-identification datasets for training during training to improve the generalization ability of the re-identification method; that is, during the training phase, we automatically construct positive and negative sample pairs from general classification datasets following the exact same sample pairing rules as ReID datasets (samples from the same category form positive pairs, while samples from different categories form negative pairs) without any additional manual annotation, and mix the sample pairs from ReID datasets and classification datasets at a fixed ratio of 7:3 in each training batch to balance task alignment and feature space coverage expansion, thus effectively enhancing the model’s cross-domain generalization performance in complex air-to-ground scenarios.

3.2. Attention Module Fusing Neighborhood Information

Re-identification not only requires deep, distinctive high-dimensional channel information to distinguish different categories but also needs to fully utilize the captured position information to enable the network to find regions of interest. Therefore, to obtain more robust re-identification features, we propose an attention module fusing neighborhood feature information, which can not only accurately capture regions of interest but also effectively capture the relationships between features.

The attention module fusing information is shown in Figure 2.

The architecture and complete data flow of the NFCA module are shown in Figure 2, and the calculation process is divided into two core steps:

Step 1: Neighborhood Feature Centralization with Position Encoding. For the input feature

y

extracted by ResNet34, the neighborhood-centralized position-aware feature is calculated via Equation (1):

y^{'} = F_{N F C} (A_{h} (y) \oplus A_{w} (y))

(1)

where

y

is the feature extracted by ResNet34,

\oplus

denotes feature concatenation;

A_{h}

and

A_{w}

are 2D global average pooling (GAP), and after pooling, the feature sizes are [B, C, W, 1] and [B, C, 1, H] respectively;

F_{N F C} ()

is the neighborhood feature centralization operation, and

y^{'}

is the feature after neighborhood feature centralization.

Step 2: Attention Weight Generation and Feature Re-calibration. Based on the feature obtained in Step 1, the weighted attention feature is calculated via Equation (2):

y^{″} = y \cdot C_{1} ({y_{1}}^{'}) \cdot C_{1} ({y_{2}}^{'})

(2)

where

{y_{1}}^{'}

and

{y_{2}}^{'}

are the features split from

y^{'}

along the channel dimension, and the split dimensions are [B, C, W, 1] and [B, C, 1, H] respectively;

C_{1}

is a convolution layer used to construct attention;

y

is the original feature, and the weighted attention feature

y^{″}

is obtained by matrix multiplication of

y

with

{y_{1}}^{'}

and

{y_{2}}^{'}

after attention construction.

On the basis of Coordinate Attention, this paper introduces NFC, continues the spatial direction aggregation strategy of Coordinate Attention, generates feature maps along the horizontal and vertical directions respectively, and captures long-range spatial dependencies and precise position information. This paper uses the NFC mechanism for multi-feature interaction, performs NFC operations on the horizontal and vertical feature maps respectively, counts the horizontal and vertical attention neighborhoods of each feature in the feature dimension, realizes the enhancement of similar features between features, and forces similar features of different domains to converge to the local center through cross-feature sharing of neighborhood statistics, enhancing the semantic consistency of cross-features. At the same time, background noise suppression is performed to make the network more focused on the target. Finally, dimension adaptation and weight generation are carried out; the dimension is adjusted through the convolution layer, attention weights are generated, and the weighted attention feature y′′ is obtained by multiplying with the original feature.

The advantages of this attention are mainly concentrated in the following three aspects: (1) No training parameters, lightweight and efficient. NFC does not need to learn convolution kernel parameters, and only realizes feature transformation through neighborhood statistics. It is especially suitable for lightweight models or training-free inference scenarios, avoiding the risk of overfitting. (2) Enhancing intra-class compactness. NFC forces similar features to converge to the mean through local neighborhood centralization. (3) Retaining spatial position accuracy. NFC acts on the k × k neighborhood, which can retain position details and is more suitable for accurately locating regions of interest. In cross-view re-identification scenarios, NFC has stronger robustness to local feature offsets caused by viewpoint changes, and maintains feature stability through neighborhood adaptive standardization. By introducing NFC, the attention module not only maintains the advantages of global receptive field and position encoding of Coordinate Attention but also strengthens the semantic consistency of cross-features in a parameter-free manner, improving the discriminability and robustness of re-identification features.

3.3. Data-Driven Similarity Judgment Method

The optimization mechanism of the feature space fundamentally determines the upper limit of the discriminative power of ReID tasks. The core difference between our method and traditional approaches lies in the design logic of optimization objectives and metric learning:

Traditional methods: Indirectly constrain feature distribution via classification loss (proxy task), and use manually designed linear metrics (Euclidean distance, cosine similarity) for similarity evaluation, which have inherent systematic bias for the nonlinear feature space in air-to-ground scenarios.

Proposed method: The feature extraction backbone, NFCA module and similarity learning branch are jointly trained in an end-to-end manner, with direct optimization for the core task goal of sample pair similarity matching. Traditional linear metrics implicitly assume that the feature space is a linearly separable Euclidean space, which cannot fit the complex nonlinear manifold structure formed by the feature distribution of the same target under drastic viewpoint changes in air-to-ground scenarios, resulting in systematic measurement bias. To solve this problem, this paper constructs a data-driven nonlinear metric function, whose calculation method is shown in Equation (3):

S (f_{i}, f_{j}) = σ (f_{f l a} (|f_{i} - f_{j}|))

(3)

where

f_{i}

and

f_{j}

are the two features output by the Siamese network,

f_{f l a}

is the fully connected layer, and

σ

is the Sigmoid function. The difference between feature pairs is represented by the absolute difference.

f_{f l a}

is a fully connected layer, and through the data-driven similarity evaluation standard, the metric result is finally mapped to the [0, 1] interval through the Sigmoid function to directly obtain the similarity.

The advantages of the proposed method based on direct optimization of sample pairs are reflected in two aspects: (1) Direct alignment of tasks. Instead of relying on categories for indirect constraints, it directly optimizes the similarity of sample pairs, achieving alignment between the output during training and the output during use, and can accurately capture subtle differences between similar samples in terms of posture and local occlusion, as well as indistinguishable cases of dissimilar samples; (2) Nonlinear metric capability. The fully connected network can learn complex similarity functions to adapt to the nonlinear distribution of the feature space, breaking through the limitations of traditional manual metric methods such as Euclidean distance and cosine distance.

4. Experimental Verification and Analysis

This chapter systematically verifies the effectiveness, advancement and generalization ability of the proposed method through multiple groups of comparative experiments, ablation experiments, visualization analysis and cross-task generalization verification. Meanwhile, quantitative and qualitative analysis of the experimental results are carried out to clarify the contribution of each core module. All experiments are completed with 5 independent repeated tests using different random seeds, and the results are presented in the form of mean ± standard deviation. The statistical significance of performance improvement is verified by a two-tailed paired t-test, with the significance level set to p < 0.05.

4.1. Preprocessing and Training Configuration Details

To ensure the reproducibility of the study, the full details of image preprocessing, backbone configuration, training hyperparameters and multi-source data fusion strategy are clearly specified as follows:

Image Preprocessing: All input images are uniformly resized to 224 × 224 pixels, followed by random horizontal flipping (flipping probability 0.5), random erasing (erasing probability 0.5, erasing ratio 0.02–0.4), and normalization consistent with ImageNet pretraining, with mean [0.485, 0.456, 0.406] and standard deviation [0.229, 0.224, 0.225].

Backbone Configuration: ResNet34 is adopted as the backbone network. The stride of the last downsampling layer of the backbone is modified from 2 to 1 to retain more detailed features, and the dimension of the final output feature is 512.

Training Hyperparameter Settings: AdamW is used as the optimizer with a weight decay of 5 × 10⁻⁴; the initial learning rate is 3 × 10⁻⁴, and a step decay strategy is adopted, with the learning rate decaying by 0.1 times every 20 epochs; the batch size is set to 128; the total number of training epochs is 120 for all datasets; all experiments are implemented based on PyTorch 2.0 framework and run on a single NVIDIA A40 GPU.

Multi-source Data Fusion Strategy: During training, each batch contains 70% samples from ReID datasets and 30% samples from general classification datasets, to balance task alignment and feature space expansion; that is, for classification datasets, samples of the same category automatically form positive sample pairs, and samples of different categories automatically form negative sample pairs, which is completely consistent with the sample pair construction rules of ReID datasets, without additional manual annotation; for each category of the classification dataset, 80–100 images are randomly selected to participate in training to avoid category imbalance. It should be noted that the classification datasets only participate in the training phase, and are completely independent from the unknown ID processing in the test phase.

Inference Pipeline Configuration (Corresponding to Figure 1b): The inference process is completely fixed to ensure the consistency of results:

(1): Reference Library Pre-extraction: Before formal inference, all images with registered IDs in the reference library are input into the trained network at one time to extract 512-dimensional features, which are pre-stored in the local feature library. During the inference process, there is no need to re-input the reference library images into the backbone network for repeated calculation, which greatly improves the inference efficiency.
(2): Query Image Processing: For an input unknown ID query image, the same preprocessing as the training phase is performed, and then input into the trained network to extract the query feature.
(3): Similarity Matching and ID Judgment: The similarity between the query feature and all pre-stored reference library features is calculated through the adaptive nonlinear metric branch. A similarity threshold of 0.5 is set: if the maximum similarity score is higher than the threshold, the ID corresponding to the highest similarity is output as the final re-identification result; if the maximum similarity score is lower than the threshold, the query image is judged as a new unregistered ID, and the new ID registration can be completed by adding its feature to the reference library.

4.2. Datasets and Experimental Details

The experiment uses the vehicle re-identification JC-1 dataset (self-established dataset in this paper, collected by fixed-wing UAVs, consisting of the same and different vehicles at different times with different distances. Each vehicle is captured at least twice, and there may be multiple images in one camera. 681 vehicles are captured, with a total of 5470 bounding boxes. Annotation specification: all target vehicles are manually cropped from the original UAV-captured images, and a unique ID label is assigned to each vehicle target, which is consistent with the standard annotation process of mainstream ReID datasets. Examples are shown in Figure 3.

As well as the person re-identification dataset Market-1501 [25] and the vehicle re-identification dataset VehicleID [26] with the same characteristics. In addition to the above datasets, part of the data from the Oxford 102 Flower, food-101, Omniglot, and fgvcaircraft datasets is also introduced during training (≤80–100 images are randomly selected for each category of the classification dataset).

4.3. Comparative Experiments

The main comparative experiments are carried out on three datasets, JC-1, Market-1501 and VehicleID, to verify the overall performance advantages of the proposed method. The experimental results are shown in Table 1. This paper compares with traditional classification-driven methods (ResNet50 + cross-entropy + Euclidean distance, Baseline1), metric learning methods (TripletLoss + Euclidean distance, Baseline2), attention-enhanced methods (Coordinate Attention + Euclidean distance, Baseline3), and advanced methods (FastReID [27], TransReID [28]).

The proposed method achieves the optimal mAP performance on all three datasets, reaching 82.1% on JC-1, 92.8% on Market-1501, and 83.6% on VehicleID. Compared with the current SOTA method, TransReID, the proposed method achieves mAP improvements of 0.4%, 1.2% and 2.4% on the three datasets respectively. The complete calculation process is: 82.1–81.7% = 0.4% (JC-1), 92.8–91.6% = 1.2% (Market-1501), 83.6–81.2% = 2.4% (VehicleID). The two-tailed paired t-test results show that all performance improvements meet the significance requirement of p < 0.05, proving that the performance improvement of the proposed method is statistically significant and reliable.

Compared with the traditional metric learning baseline Baseline2, the proposed method achieves mAP improvements of 1.2% (JC-1), 9.3% (Market-1501), 11.5% (VehicleID) on the three datasets, with an average improvement of 8.9% (about 9%). Compared with Baseline3 which only adopts Coordinate Attention, the proposed method achieves mAP improvements of 0.8% (JC-1), 7.7% (Market-1501), 9.1% (VehicleID), with an average improvement of 6.8% (about 7%) through the combination of the NFCA module and the adaptive nonlinear metric while retaining the core structure of CA, verifying the collaborative optimization effect of the two core modules.

On the self-developed air-to-ground scenario JC-1 dataset, the proposed method achieves stable performance improvement compared with all baselines, proving that the proposed method can effectively adapt to the complex environment of air-to-ground scenarios with viewpoint changes and strong background interference, and has excellent scenario adaptability.

To further verify the generalization ability of the proposed method in cross-target-type scenarios, two groups of cross-domain experiments are set: “training on pedestrian dataset → validation on vehicle dataset” and “training on vehicle dataset → validation on pedestrian dataset”. The pedestrian dataset uses Market-1501, the vehicle dataset uses the VehicleID training set, and the evaluation metric is still mAP. The experimental results are shown in Table 2.

The cross-target-type generalization experimental results show that the proposed method still achieves the best generalization performance in the completely cross-category training-validation scenario, with 0.5% and 0.1% mAP improvements over TransReID in the two groups of experiments respectively. It should be objectively noted that the performance gain in cross-target-type scenarios is relatively limited, which is mainly due to the large distribution difference between pedestrian and vehicle targets. This result verifies that the proposed NFCA module can effectively capture the core discriminative features of different types of targets, the adaptive nonlinear metric can adapt to the complex distribution of pedestrian and vehicle feature spaces, and the multi-source data fusion strategy effectively reduces the cross-domain distribution difference, proving that the proposed method has good cross-target-type generalization ability.

4.4. Ablation Experiments

To verify the indispensability and independent contribution of the two core modules (NFCA module, adaptive nonlinear metric) of this paper, ablation experiments are carried out by removing core modules and replacing key components with the full proposed method as the benchmark. The experiments are conducted on Market-1501 and JC-1 datasets, and the results are shown in Table 3.

It can be seen from the experimental results in Table 3 that:

After removing the NFCA module and replacing it with the original CA module, the mAP of the model drops from 92.8% to 89.1% on Market-1501, with a decrease of 3.7% (calculation formula: 92.8–89.1% = 3.7%), and from 82.1% to 81.5% on JC-1, with a decrease of 0.6%. This result proves that the NFCA module significantly improves the discriminability of features by enhancing cross-feature semantic interaction and suppressing background noise through the neighborhood feature centralization mechanism [20,21]. After removing this module, the model loses the ability to model cross-feature dependencies, and the spatial positioning accuracy decreases. The feature differences between similar samples caused by viewpoint changes and local occlusions are amplified, and the false matching rate increases significantly.

After replacing the proposed adaptive nonlinear metric with the traditional Euclidean distance, the mAP of the model drops from 92.8% to 87.5% on Market-1501, with a decrease of 5.3% (calculation formula: 92.8–87.5% = 5.3%), and from 82.1% to 79.6% on JC-1, with a decrease of 2.5%. This result directly verifies the necessity of nonlinear metrics for the characterization of complex feature distributions: Euclidean distance implicitly assumes that the feature space is linearly separable, while the feature distribution of similar samples in real scenarios often presents a complex nonlinear structure, and linear metrics will produce inherent systematic bias. The data-driven nonlinear metric function proposed in this paper can effectively improve the accuracy of similarity judgment.

4.5. Feature Heatmap Visualization Analysis

To visually verify the ability of the NFCA module to focus on the core discriminative regions of the target and suppress background noise, and to explain the internal mechanism of the module to improve feature robustness, this paper uses the Grad-CAM method to generate feature heatmaps for comparative experiments. The experiments are conducted based on the Market-1501 person ReID dataset and the self-developed JC-1 air-to-ground vehicle ReID dataset. Heatmaps are generated for the “original ResNet34 backbone without the NFCA module” and the “full proposed network with the NFCA module” respectively, and the results are shown in Figure 4.

Combined with the core objectives of the ReID task (extracting discriminative features of the target, suppressing invalid background interference, and ensuring the robustness of features to viewpoint changes and occlusions), this paper defines a three-dimensional evaluation criterion for high-quality ReID heatmaps:

Discriminative Region Focus: The high-response regions (red/yellow highlighted areas) of the heatmap should be accurately concentrated on the core discriminative parts of the target (such as the clothing contour of pedestrians, the body and identification areas of vehicles), rather than scattered in the background or non-critical local details, to ensure the inter-class discriminability of features.

Background Noise Suppression: The response intensity of the background area should be significantly lower than that of the target main body, without large-area invalid high response, to verify the network’s ability to filter background interference and avoid invalid information dominating feature encoding.

Target Integrity: The high-response regions should cover the complete main structure of the target, rather than only focusing on scattered local details, to ensure the robustness of features to viewpoint changes and local occlusions, and avoid matching errors caused by the loss of local information.

Based on the above evaluation criteria, the comparative analysis of the heatmap results in Figure 4 is as follows:

For the original network without the NFCA module, the high-response regions of the heatmap are scattered, mostly concentrated in the background areas (such as the mall environment in Market-1501, the road and vegetation areas in JC-1), and only respond to scattered local parts of the target, unable to cover the complete main body of the target, which completely fails to meet the evaluation criteria of high-quality heatmaps. This also explains the significant drop in mAP after removing the NFCA module in the ablation experiment: the network is easily disturbed by background noise and cannot extract stable discriminative features of the target.

For the proposed method with the NFCA module, the performance of the heatmap fully complies with the high-quality evaluation criteria: the high-response regions are accurately focused on the target body of pedestrians and vehicles, without invalid highlighting in the background; the response regions completely cover the core discriminative structure of the target, rather than scattered parts; the response intensity of the background area is significantly suppressed.

The visualization results verify that the proposed NFCA module can effectively guide the network to focus on the core discriminative regions of the target, suppress background noise, and retain the complete structural information of the target. It improves the discriminability and robustness of features from the underlying logic of feature encoding, forming a complete logical closed loop with the performance improvement of the previous quantitative experiments.

4.6. Generalization Verification on Classification Task

To verify the adaptability of the proposed NFCA module to low-pixel images and its versatility in other computer vision tasks, this paper carries out generalization verification experiments on the CIFAR-100 image classification dataset. The experimental configuration is as follows: Batch Size 128, total training epochs 150, initial learning rate 0.01, SGD optimizer with momentum 0.9, weight decay 5 × 10⁻⁴. The experimental results are shown in Table 4.

It can be seen from the experimental results that, as the core component of the ReID framework, the proposed NFCA module achieves the best Top-1 accuracy on the CIFAR-100 classification task, with an improvement of 0.4% compared with the second-best CBAM module. This result proves that the proposed method not only has excellent performance in ReID tasks but also can adapt to other computer vision tasks such as image classification, and still has a good feature enhancement effect in low-pixel scenarios, verifying the versatility and advancement of the method.

In summary, the proposed method comprehensively improves the performance and generalization ability of the ReID network by enhancing feature robustness through the attention mechanism, fitting nonlinear distribution through adaptive metrics, and reducing distribution differences through multi-source data fusion.

5. Conclusions

To address the core challenges of air-to-ground target ReID, this paper proposes a novel ReID framework based on neighborhood feature centralization attention. Through the three-dimensional optimization chain of “attention mechanism-adaptive metric-multi-source data supplementation”, the framework systematically solves the bottlenecks of existing methods in air-to-ground scenarios: the NFCA module enhances cross-feature semantic interaction while retaining precise position encoding, improving the robustness of features to viewpoint changes, occlusions and illumination variations; the end-to-end nonlinear metric learning breaks through the limitations of traditional linear metrics; the multi-source data fusion strategy expands the coverage of the feature space without additional annotation cost, and enhances cross-domain generalization ability. Experimental results show that the proposed method achieves leading performance on the self-developed air-to-ground dataset JC-1 and two public ReID datasets Market-1501 and VehicleID. Sufficient ablation experiments, statistical significance tests and cross-domain tests verify the effectiveness and reliability of each core component. It should be objectively noted that the performance gain of the proposed method over SOTA methods in cross-target-type generalization scenarios is relatively limited, and the generalization ability for completely unseen target types still needs to be further improved.

Author Contributions

Conceptualization, T.Y.; methodology, T.Y. and Y.X.; software, T.Y.; validation, Y.M., H.Y. and H.X.; data curation, T.Y., A.W. and Y.X.; writing—original draft preparation, T.Y. and Y.X.; writing—review and editing, T.Y. and Y.X.; funding acquisition, A.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Key Laboratory of Cross-Domain Flight Interdisciplinary Technology and The APC was funded by Key Laboratory of Cross-Domain Flight Interdisciplinary Technology.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liang, X.; Rawat, Y.S. DIFFER: Disentangling Identity Features via Semantic Cues for Clothes-Changing Person Re-ID. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; IEEE/CVF: Piscataway, NJ, USA, 2025; pp. 13980–13989. [Google Scholar]
Nguyen, K.; Fookes, C.; Sridharan, S.; Nguyen, H.; Liu, F.; Liu, X.; Rezaei, S.; Ross, A.; Endrei, T.; DeAndres-Tame, I.; et al. AG-VPReID 2025: Aerial-Ground Video-Based Person Re-Identification Challenge Results. In Proceedings of the 2025 IEEE International Joint Conference on Biometrics (IJCB), Washington, DC, USA, 22–25 September, 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–10. [Google Scholar]
Wu, Y.; Wang, X.; Yang, X.; Liu, M.; Zeng, D.; Ye, H.; Li, S. Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; IEEE/CVF: Piscataway, NJ, USA, 2025; pp. 17103–17113. [Google Scholar]
Liu, X.; Qi, J.; Chen, C.; Bin, K.; Zhong, P. UCM-VeID V2: A Richer Dataset and A Pre-training Method for UAV Cross-Modality Vehicle Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; IEEE/CVF: Piscataway, NJ, USA, 2025; pp. 22286–22295. [Google Scholar]
Narayana, G.V.; Swain, S.P.; Pattnayak, D.; Pradhan, M.R.; Krishna, P.A. Intelligent Drone Patrolling with Real-Time Object Detection and GPS-Based Path Adaptation. Eng. Proc. 2026, 124, 82. [Google Scholar] [CrossRef]
Xiang, H.; Wang, P.; Xu, X.; Yi, K.; Zhang, X.; Sheng, Q.Z.; Fan, W. MTP: Exploring Multimodal Urban Traffic Profiling with Modality Augmentation and Spectrum Fusion. In Proceedings of the 40th AAAI Conference on Artificial Intelligence (AAAI 2026), Phoenix, AZ, USA, 12–17 February 2026; AAAI Press: Palo Alto, CA, USA, 2026; Volume 40, pp. 27037–27045. [Google Scholar] [CrossRef]
Shao, M.; Zhang, Z.; Wang, Y.; Dai, Y.; Shen, X.; Wang, X. HyperD: Hybrid Periodicity Decoupling Framework for Traffic Forecasting. In Proceedings of the 40th AAAI Conference on Artificial Intelligence (AAAI 2026), Phoenix, AZ, USA, 12–17 February 2026; AAAI Press: Palo Alto, CA, USA, 2026; Volume 40, pp. 15689–15697. [Google Scholar] [CrossRef]
Zhong, C.; Liu, B.; Zhu, W.; Dai, D.; Jiang, Y. Phase-Aware Hierarchical Reinforcement Learning with Dynamic Human–AI Authority Allocation for Mountain Search and Rescue. Drones 2026, 10, 229. [Google Scholar] [CrossRef]
Yuan, C.; Zhang, G.; Ma, C.; Zhang, T.; Niu, G. From Poses to Identity: Training-Free Person Re-Identification via Feature Centralization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–17 June 2025; pp. 24409–24418. [Google Scholar]
Chopra, S.; Hadsell, R.; LeCun, Y. Learning a Similarity Metric Discriminatively, with Application to Face Verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2005; pp. 539–546. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2015; pp. 815–823. [Google Scholar]
Wang, H.; Xiang, Y.; Gong, S.; Tao, D. Deep Metric Learning via Lifted Structured Feature Embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2016; pp. 3003–3012. [Google Scholar]
Wang, F.; Cheng, J.; Liu, W.; Liu, H. Additive Margin Softmax for Face Verification. IEEE Signal Process. Lett. 2018, 25, 926–930. [Google Scholar] [CrossRef]
Wang, F.; Xiang, T.; Gong, S.; Zhou, X. NormFace: L2 Hypersphere Embedding for Face Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2018; pp. 1141–1149. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2021; pp. 11603–11612. [Google Scholar]
Li, J.; Wang, W.; Hu, X.; Yang, J. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2021; pp. 568–578. [Google Scholar]
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q. Beyond Part Models: Person Re-Identification with Refined Part Pooling (and a Strong Convolutional Baseline). In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2018; pp. 480–496. [Google Scholar]
Ganin, Y.; Lempitsky, V. Domain-Adversarial Neural Networks. J. Mach. Learn. Res. 2016, 17, 2096–2130. [Google Scholar]
Long, M.; Cao, Y.; Wang, J.; Jordan, M.I. Conditional Adversarial Domain Adaptation. In Advances in Neural Information Processing Systems; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31, pp. 1640–1650. [Google Scholar] [CrossRef]
Zhang, Y.; Li, X.; Wang, X.; Liu, W. Cross-Domain Person Re-Identification Based on Feature Fusion Invariance. Appl. Sci. 2024, 14, 4644. [Google Scholar] [CrossRef]
Wei, L.; Zhang, S.; Sun, Y. Person Transfer GAN to Bridge Domain Gap for Person Re-Identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2018; pp. 79–88. [Google Scholar]
Li, H.; Chen, J.; Zheng, A.; Wu, Y.; Luo, Y. Day-Night Cross-domain Vehicle Re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 12626–12635. [Google Scholar] [CrossRef]
Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A. A Theory of Learning from Different Domains. Mach. Learn. 2010, 79, 151–175. [Google Scholar] [CrossRef]
Vapnik, V.N. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1995. [Google Scholar]
Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Market-1501: A Dataset and Benchmark for Large-Scale Person Re-Identification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2015; pp. 3908–3916. [Google Scholar]
Liu, H.; Tian, Y.; Wang, Y.; Pang, L.; Huang, T. Deep Relative Distance Learning: Tell the Difference between Similar Vehicles. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 2167–2175. [Google Scholar] [CrossRef]
Ge, Y.; Song, X.; Huang, X.; Wang, X. FastReID: A Deep Learning Toolbox for Person Re-Identification. arXiv 2020, arXiv:2006.02631. [Google Scholar]
Li, X.; Zhu, X.; Liu, H.; Wang, X. TransReID: Transformer-Based Object Re-Identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2021; pp. 14035–14045. [Google Scholar]

Figure 1. (a) Training process (network architecture), (b) testing (validation) process.

Figure 2. Attention module fusing neighborhood information (NFCA module).

Figure 3. Example images of the JC-1 dataset.

Figure 4. Comparison of feature heatmaps.

Table 1. Comparative experiments and results.

	JC-1 (mAP)	Market-1501 (mAP)	VehicleID (mAP)
Baseline1	78.1 ± 0.23%	78.2 ± 0.18%	65.3 ± 0.27%
Baseline2	80.9 ± 0.21%	83.5 ± 0.20%	72.1 ± 0.24%
Baseline3	81.3 ± 0.19%	85.1 ± 0.17%	74.5 ± 0.22%
FastReID	81.4 ± 0.18%	91.4 ± 0.15%	78.9 ± 0.20%
TransReID	81.7 ± 0.16%	91.6 ± 0.14%	81.2 ± 0.18%
Proposed Method	82.1 ± 0.12%	92.8 ± 0.11%	83.6 ± 0.10%

Table 2. Comparison results.

	Pedestrian → Vehicle (mAP)	Vehicle → Pedestrian (mAP)
FastReID	62.5 ± 0.25%	65.3 ± 0.22%
TransReID	65.8 ± 0.21%	65.5 ± 0.20%
Proposed Method	66.3 ± 0.15%	65.6 ± 0.13%

Table 3. Ablation experiment results.

	Market-1501 (mAP)	JC-1 (mAP)
Removing NFCA Module	89.1%	81.5%
Replacing with Euclidean Distance	87.5%	79.6%

Table 4. Comparative experiments on classification tasks.

Model	CIFAR-100
Model	mAP (%)
ResNet34 + CA	75.07
ResNet34 + ECA	76.91
ResNet34 + CBAM	77.02
Our	77.42

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yao, T.; Xu, Y.; Ma, Y.; Yan, H.; Xu, H.; Wang, A. Object Re-Identification Method for Air-to-Ground Targets Based on Neighborhood Feature Centralization Attention. Computation 2026, 14, 96. https://doi.org/10.3390/computation14050096

AMA Style

Yao T, Xu Y, Ma Y, Yan H, Xu H, Wang A. Object Re-Identification Method for Air-to-Ground Targets Based on Neighborhood Feature Centralization Attention. Computation. 2026; 14(5):96. https://doi.org/10.3390/computation14050096

Chicago/Turabian Style

Yao, Tian, Yong Xu, Yue Ma, Hongtao Yan, Haihang Xu, and An Wang. 2026. "Object Re-Identification Method for Air-to-Ground Targets Based on Neighborhood Feature Centralization Attention" Computation 14, no. 5: 96. https://doi.org/10.3390/computation14050096

APA Style

Yao, T., Xu, Y., Ma, Y., Yan, H., Xu, H., & Wang, A. (2026). Object Re-Identification Method for Air-to-Ground Targets Based on Neighborhood Feature Centralization Attention. Computation, 14(5), 96. https://doi.org/10.3390/computation14050096

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Object Re-Identification Method for Air-to-Ground Targets Based on Neighborhood Feature Centralization Attention

Abstract

1. Introduction

2. Related Work

2.1. Application of Deep Metric Learning in Re-Identification

2.2. Progress of Attention Mechanism in Visual Feature Learning

2.3. Cross-Domain Generalization and Data Fusion Technology

3. Cross-Domain Generalization Framework Based on Deep Metric Learning

3.1. Overall Network Architecture

3.2. Attention Module Fusing Neighborhood Information

3.3. Data-Driven Similarity Judgment Method

4. Experimental Verification and Analysis

4.1. Preprocessing and Training Configuration Details

4.2. Datasets and Experimental Details

4.3. Comparative Experiments

4.4. Ablation Experiments

4.5. Feature Heatmap Visualization Analysis

4.6. Generalization Verification on Classification Task

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI