1. Introduction
As a core task in computer vision, air-to-ground target re-identification (ReID) [
1,
2] aims to achieve accurate target matching across complex scenarios such as cross-camera viewpoints, illumination variations, and object occlusions in UAV air-to-ground collaborative environments. It has critical engineering application value in fields including UAV autonomous perception [
3,
4], low-altitude border patrol [
5], urban intelligent traffic monitoring [
6,
7], and target tracking in emergency rescue [
8]. Compared with conventional ReID tasks with ground-fixed cameras, the high-speed maneuver of the UAV platform, drastic changes in shooting viewpoints, large-scale background interference, and significant fluctuations in target scale in air-to-ground scenarios bring exclusive technical challenges to ReID tasks. Existing mainstream methods suffer from severe performance degradation in air-to-ground scenarios, and their core bottlenecks can be summarized into three aspects.
First, the network is prone to focus on invalid background information, resulting in insufficient feature discriminability. In air-to-ground scenarios, rapid viewpoint switching, pixel-level dense targets, and large-scale dynamic background changes lead to significant shifts in the feature distribution of the same target across different domains. Traditional deep learning networks are often disturbed by background noise, redundant features and non-critical local regions, while ignoring the core discriminative features of the target. Existing mainstream attention mechanisms (such as Coordinate Attention, CA) can only achieve independent position encoding, lacking effective cross-feature semantic interaction and background noise suppression capabilities. The interference of invalid information greatly reduces the discriminability of features and restricts the accurate matching ability of ReID tasks.
Second, the similarity evaluation method has poor adaptability to the nonlinear feature space. Most existing methods rely on classification loss to indirectly optimize the feature space, and use manually designed linear metric functions such as Euclidean distance and cosine similarity to complete similarity evaluation. However, in air-to-ground scenarios, the high-speed movement of airborne cameras and drastic viewpoint changes significantly increase the nonlinear complexity of feature distribution. Linear metrics have inherent systematic bias for nonlinear feature structures, resulting in a contradiction between “low training error and poor actual inference performance”.
Third, the feature space coverage of training data is insufficient, leading to weak cross-domain generalization ability. Traditional methods are highly dependent on finely annotated ReID datasets with unique ID labels. However, in air-to-ground ReID scenarios, the acquisition of high-quality annotated samples is difficult and costly. Limited training data can hardly cover the feature diversity of real scenarios, leading to a large gap between the training distribution and the real scenario distribution, as well as high generalization error. Existing cross-domain methods mostly rely on adversarial domain adaptation mechanisms, which require a large amount of target domain data for support, and cannot adapt to open air-to-ground scenarios with unknown target distributions.
To address the above core challenges of air-to-ground target ReID, which are not fully resolved by existing attention-based and metric learning-based ReID methods, this paper proposes a novel end-to-end deep metric learning framework based on Neighborhood Feature Centralization Attention (NFCA). Unlike existing works that rely on incremental optimization of single modules, our framework constructs a three-dimensional collaborative optimization chain of “attention mechanism-adaptive metric-multi-source data supplementation”, which fundamentally breaks through the three core bottlenecks of existing methods in air-to-ground scenarios. The core contributions and fundamental technical distinctions of this paper compared with state-of-the-art attention-based and metric learning approaches are as follows:
A lightweight, parameter-free Neighborhood Feature Centralization Attention (NFCA) [
9] module is proposed, which fundamentally breaks the inherent trade-off of existing attention mechanisms. Unlike Coordinate Attention (CA) that only realizes independent position encoding without cross-feature semantic interaction, and other mainstream attention modules (CBAM, ECA) that sacrifice position accuracy for channel correlation modeling, NFCA introduces the Neighborhood Feature Centralization (NFC) mechanism on the basis of CA. It retains precise position encoding while enhancing cross-feature semantic interaction, and achieves intra-class feature compactness and background noise suppression through neighborhood statistics without additional trainable parameters, significantly improving the robustness of features to extreme viewpoint changes, occlusions and illumination fluctuations in air-to-ground scenarios.
A data-driven adaptive nonlinear metric learning paradigm is constructed, which achieves full end-to-end alignment between training optimization objectives and inference task goals. Different from existing metric learning methods that rely on manually designed linear metrics (Euclidean distance, cosine similarity) with inherent systematic bias for nonlinear feature spaces, or use classification loss as a proxy task to indirectly optimize feature distribution, our method realizes direct end-to-end optimization of sample pair similarity via BCEWithLogitsLoss. It eliminates both the proxy task bias of classification loss and the systematic bias of linear metrics, and can accurately fit the complex nonlinear manifold structure of the ReID feature space in air-to-ground scenarios.
A label-free multi-source data fusion training strategy is proposed, which breaks the dependence of existing cross-domain ReID methods on target domain data and manual annotation. Without any additional manual labeling, we automatically construct positive and negative sample pairs from general classification datasets following the same pairing rules as ReID datasets, and fuse them with ReID data for training. This strategy expands the coverage of the feature space, narrows the distribution gap between training data and real air-to-ground scenarios, and enhances cross-domain generalization without relying on target domain data required by adversarial domain adaptation methods.
A dedicated UAV air-to-ground vehicle ReID dataset JC-1 is constructed, which covers complex variations of extreme viewpoints, shooting distances, illumination conditions and day-night periods. It fills the lack of dedicated benchmarks for air-to-ground target ReID, and provides a standardized test platform for follow-up research in this field.
3. Cross-Domain Generalization Framework Based on Deep Metric Learning
This paper proposes an air-to-ground re-identification framework based on neighborhood feature centralization attention, which realizes accurate identification of different target objects in cross-camera air-to-ground scenarios by jointly optimizing the feature space and the adaptive similarity metric function. The following will introduce the overall network architecture, the attention module fusing neighborhood information, and the comparison of feature space optimization objectives.
3.1. Overall Network Architecture
The core of the method includes three parts: a feature extraction module, an attention module, and a similarity evaluation module, where the attention module adopts a plug-and-play manner. Different from previous re-identification methods that use output categories to assist convergence, classification datasets are added during training to expand the data volume and enhance the generalization ability of the network. The overall method is shown in
Figure 1.
As shown in
Figure 1a, the end-to-end collaborative training process of the NFCA module and the similarity learning branch is as follows: (1) Sample Pair Construction: The training dataset is divided into positive and negative sample pairs, where positive pairs are two samples of the same ID, and negative pairs are two samples of different IDs. (2) Weight-Shared Feature Extraction: The sample pairs are synchronously input into the Siamese network with shared weights, and the ResNet34 backbone outputs initial feature pairs. (3) NFCA Module for Discriminative Feature Enhancement: The initial feature pairs are input into the plug-and-play NFCA module, which suppresses background noise, enhances cross-feature semantic consistency, and outputs robust, high-discriminability attention feature pairs. (4) End-to-End Similarity Optimization: The attention feature pairs output by NFCA are directly fed into the adaptive nonlinear similarity learning branch. The similarity score of the feature pair is calculated, and the network is trained end-to-end with BCEWithLogitsLoss. The gradient generated by the loss is back-propagated to both the similarity branch and the NFCA module, guiding the attention module to focus more on the core discriminative regions that are critical for similarity matching.
The test and validation process of
Figure 1b is mainly as follows: after the unknown ID query image to be identified and the gallery sample set with registered IDs are subjected to unified preprocessing consistent with the training phase, they are synchronously input into the trained Siamese network with fixed and shared weights, feature encoding is completed through the ResNet34 backbone integrated with the NFCA attention module, then the pairwise similarity between the query feature and all gallery sample features is calculated one by one through the adaptive nonlinear metric branch jointly trained end-to-end, and finally the ID corresponding to the maximum similarity value is output as the re-identification result. Based on the end-to-end integrated network architecture, feature extraction and similarity calculation can be completed in a single forward propagation in this process; the weight-sharing feature of the Siamese network supports real-time generation of standardized features for input images, and the end-to-end optimization logic of deep coupling between the metric branch and the feature extraction backbone avoids the adaptation deviation caused by pre-extracted features. Therefore, the native algorithm process does not require pre-extraction of gallery features in advance, and pre-extraction is only an optional engineering optimization item to improve inference efficiency in large-scale industrial deployment, not a necessary link of this test process.
The proposed framework can not only introduce re-identification datasets but also classification non-re-identification datasets for training during training to improve the generalization ability of the re-identification method; that is, during the training phase, we automatically construct positive and negative sample pairs from general classification datasets following the exact same sample pairing rules as ReID datasets (samples from the same category form positive pairs, while samples from different categories form negative pairs) without any additional manual annotation, and mix the sample pairs from ReID datasets and classification datasets at a fixed ratio of 7:3 in each training batch to balance task alignment and feature space coverage expansion, thus effectively enhancing the model’s cross-domain generalization performance in complex air-to-ground scenarios.
3.2. Attention Module Fusing Neighborhood Information
Re-identification not only requires deep, distinctive high-dimensional channel information to distinguish different categories but also needs to fully utilize the captured position information to enable the network to find regions of interest. Therefore, to obtain more robust re-identification features, we propose an attention module fusing neighborhood feature information, which can not only accurately capture regions of interest but also effectively capture the relationships between features.
The attention module fusing information is shown in
Figure 2.
The architecture and complete data flow of the NFCA module are shown in
Figure 2, and the calculation process is divided into two core steps:
Step 1: Neighborhood Feature Centralization with Position Encoding. For the input feature
extracted by ResNet34, the neighborhood-centralized position-aware feature is calculated via Equation (1):
where
is the feature extracted by ResNet34,
denotes feature concatenation;
and
are 2D global average pooling (GAP), and after pooling, the feature sizes are [B, C, W, 1] and [B, C, 1, H] respectively;
is the neighborhood feature centralization operation, and
is the feature after neighborhood feature centralization.
Step 2: Attention Weight Generation and Feature Re-calibration. Based on the feature obtained in Step 1, the weighted attention feature is calculated via Equation (2):
where
and
are the features split from
along the channel dimension, and the split dimensions are [B, C, W, 1] and [B, C, 1, H] respectively;
is a convolution layer used to construct attention;
is the original feature, and the weighted attention feature
is obtained by matrix multiplication of
with
and
after attention construction.
On the basis of Coordinate Attention, this paper introduces NFC, continues the spatial direction aggregation strategy of Coordinate Attention, generates feature maps along the horizontal and vertical directions respectively, and captures long-range spatial dependencies and precise position information. This paper uses the NFC mechanism for multi-feature interaction, performs NFC operations on the horizontal and vertical feature maps respectively, counts the horizontal and vertical attention neighborhoods of each feature in the feature dimension, realizes the enhancement of similar features between features, and forces similar features of different domains to converge to the local center through cross-feature sharing of neighborhood statistics, enhancing the semantic consistency of cross-features. At the same time, background noise suppression is performed to make the network more focused on the target. Finally, dimension adaptation and weight generation are carried out; the dimension is adjusted through the convolution layer, attention weights are generated, and the weighted attention feature y′′ is obtained by multiplying with the original feature.
The advantages of this attention are mainly concentrated in the following three aspects: (1) No training parameters, lightweight and efficient. NFC does not need to learn convolution kernel parameters, and only realizes feature transformation through neighborhood statistics. It is especially suitable for lightweight models or training-free inference scenarios, avoiding the risk of overfitting. (2) Enhancing intra-class compactness. NFC forces similar features to converge to the mean through local neighborhood centralization. (3) Retaining spatial position accuracy. NFC acts on the k × k neighborhood, which can retain position details and is more suitable for accurately locating regions of interest. In cross-view re-identification scenarios, NFC has stronger robustness to local feature offsets caused by viewpoint changes, and maintains feature stability through neighborhood adaptive standardization. By introducing NFC, the attention module not only maintains the advantages of global receptive field and position encoding of Coordinate Attention but also strengthens the semantic consistency of cross-features in a parameter-free manner, improving the discriminability and robustness of re-identification features.
3.3. Data-Driven Similarity Judgment Method
The optimization mechanism of the feature space fundamentally determines the upper limit of the discriminative power of ReID tasks. The core difference between our method and traditional approaches lies in the design logic of optimization objectives and metric learning:
Traditional methods: Indirectly constrain feature distribution via classification loss (proxy task), and use manually designed linear metrics (Euclidean distance, cosine similarity) for similarity evaluation, which have inherent systematic bias for the nonlinear feature space in air-to-ground scenarios.
Proposed method: The feature extraction backbone, NFCA module and similarity learning branch are jointly trained in an end-to-end manner, with direct optimization for the core task goal of sample pair similarity matching. Traditional linear metrics implicitly assume that the feature space is a linearly separable Euclidean space, which cannot fit the complex nonlinear manifold structure formed by the feature distribution of the same target under drastic viewpoint changes in air-to-ground scenarios, resulting in systematic measurement bias. To solve this problem, this paper constructs a data-driven nonlinear metric function, whose calculation method is shown in Equation (3):
where
and
are the two features output by the Siamese network,
is the fully connected layer, and
is the Sigmoid function. The difference between feature pairs is represented by the absolute difference.
is a fully connected layer, and through the data-driven similarity evaluation standard, the metric result is finally mapped to the [0, 1] interval through the Sigmoid function to directly obtain the similarity.
The advantages of the proposed method based on direct optimization of sample pairs are reflected in two aspects: (1) Direct alignment of tasks. Instead of relying on categories for indirect constraints, it directly optimizes the similarity of sample pairs, achieving alignment between the output during training and the output during use, and can accurately capture subtle differences between similar samples in terms of posture and local occlusion, as well as indistinguishable cases of dissimilar samples; (2) Nonlinear metric capability. The fully connected network can learn complex similarity functions to adapt to the nonlinear distribution of the feature space, breaking through the limitations of traditional manual metric methods such as Euclidean distance and cosine distance.
4. Experimental Verification and Analysis
This chapter systematically verifies the effectiveness, advancement and generalization ability of the proposed method through multiple groups of comparative experiments, ablation experiments, visualization analysis and cross-task generalization verification. Meanwhile, quantitative and qualitative analysis of the experimental results are carried out to clarify the contribution of each core module. All experiments are completed with 5 independent repeated tests using different random seeds, and the results are presented in the form of mean ± standard deviation. The statistical significance of performance improvement is verified by a two-tailed paired t-test, with the significance level set to p < 0.05.
4.1. Preprocessing and Training Configuration Details
To ensure the reproducibility of the study, the full details of image preprocessing, backbone configuration, training hyperparameters and multi-source data fusion strategy are clearly specified as follows:
Image Preprocessing: All input images are uniformly resized to 224 × 224 pixels, followed by random horizontal flipping (flipping probability 0.5), random erasing (erasing probability 0.5, erasing ratio 0.02–0.4), and normalization consistent with ImageNet pretraining, with mean [0.485, 0.456, 0.406] and standard deviation [0.229, 0.224, 0.225].
Backbone Configuration: ResNet34 is adopted as the backbone network. The stride of the last downsampling layer of the backbone is modified from 2 to 1 to retain more detailed features, and the dimension of the final output feature is 512.
Training Hyperparameter Settings: AdamW is used as the optimizer with a weight decay of 5 × 10−4; the initial learning rate is 3 × 10−4, and a step decay strategy is adopted, with the learning rate decaying by 0.1 times every 20 epochs; the batch size is set to 128; the total number of training epochs is 120 for all datasets; all experiments are implemented based on PyTorch 2.0 framework and run on a single NVIDIA A40 GPU.
Multi-source Data Fusion Strategy: During training, each batch contains 70% samples from ReID datasets and 30% samples from general classification datasets, to balance task alignment and feature space expansion; that is, for classification datasets, samples of the same category automatically form positive sample pairs, and samples of different categories automatically form negative sample pairs, which is completely consistent with the sample pair construction rules of ReID datasets, without additional manual annotation; for each category of the classification dataset, 80–100 images are randomly selected to participate in training to avoid category imbalance. It should be noted that the classification datasets only participate in the training phase, and are completely independent from the unknown ID processing in the test phase.
Inference Pipeline Configuration (Corresponding to
Figure 1b): The inference process is completely fixed to ensure the consistency of results:
- (1)
Reference Library Pre-extraction: Before formal inference, all images with registered IDs in the reference library are input into the trained network at one time to extract 512-dimensional features, which are pre-stored in the local feature library. During the inference process, there is no need to re-input the reference library images into the backbone network for repeated calculation, which greatly improves the inference efficiency.
- (2)
Query Image Processing: For an input unknown ID query image, the same preprocessing as the training phase is performed, and then input into the trained network to extract the query feature.
- (3)
Similarity Matching and ID Judgment: The similarity between the query feature and all pre-stored reference library features is calculated through the adaptive nonlinear metric branch. A similarity threshold of 0.5 is set: if the maximum similarity score is higher than the threshold, the ID corresponding to the highest similarity is output as the final re-identification result; if the maximum similarity score is lower than the threshold, the query image is judged as a new unregistered ID, and the new ID registration can be completed by adding its feature to the reference library.
4.2. Datasets and Experimental Details
The experiment uses the vehicle re-identification JC-1 dataset (self-established dataset in this paper, collected by fixed-wing UAVs, consisting of the same and different vehicles at different times with different distances. Each vehicle is captured at least twice, and there may be multiple images in one camera. 681 vehicles are captured, with a total of 5470 bounding boxes. Annotation specification: all target vehicles are manually cropped from the original UAV-captured images, and a unique ID label is assigned to each vehicle target, which is consistent with the standard annotation process of mainstream ReID datasets. Examples are shown in
Figure 3.
As well as the person re-identification dataset Market-1501 [
25] and the vehicle re-identification dataset VehicleID [
26] with the same characteristics. In addition to the above datasets, part of the data from the Oxford 102 Flower, food-101, Omniglot, and fgvcaircraft datasets is also introduced during training (≤80–100 images are randomly selected for each category of the classification dataset).
4.3. Comparative Experiments
The main comparative experiments are carried out on three datasets, JC-1, Market-1501 and VehicleID, to verify the overall performance advantages of the proposed method. The experimental results are shown in
Table 1. This paper compares with traditional classification-driven methods (ResNet50 + cross-entropy + Euclidean distance, Baseline1), metric learning methods (TripletLoss + Euclidean distance, Baseline2), attention-enhanced methods (Coordinate Attention + Euclidean distance, Baseline3), and advanced methods (FastReID [
27], TransReID [
28]).
The proposed method achieves the optimal mAP performance on all three datasets, reaching 82.1% on JC-1, 92.8% on Market-1501, and 83.6% on VehicleID. Compared with the current SOTA method, TransReID, the proposed method achieves mAP improvements of 0.4%, 1.2% and 2.4% on the three datasets respectively. The complete calculation process is: 82.1–81.7% = 0.4% (JC-1), 92.8–91.6% = 1.2% (Market-1501), 83.6–81.2% = 2.4% (VehicleID). The two-tailed paired t-test results show that all performance improvements meet the significance requirement of p < 0.05, proving that the performance improvement of the proposed method is statistically significant and reliable.
Compared with the traditional metric learning baseline Baseline2, the proposed method achieves mAP improvements of 1.2% (JC-1), 9.3% (Market-1501), 11.5% (VehicleID) on the three datasets, with an average improvement of 8.9% (about 9%). Compared with Baseline3 which only adopts Coordinate Attention, the proposed method achieves mAP improvements of 0.8% (JC-1), 7.7% (Market-1501), 9.1% (VehicleID), with an average improvement of 6.8% (about 7%) through the combination of the NFCA module and the adaptive nonlinear metric while retaining the core structure of CA, verifying the collaborative optimization effect of the two core modules.
On the self-developed air-to-ground scenario JC-1 dataset, the proposed method achieves stable performance improvement compared with all baselines, proving that the proposed method can effectively adapt to the complex environment of air-to-ground scenarios with viewpoint changes and strong background interference, and has excellent scenario adaptability.
To further verify the generalization ability of the proposed method in cross-target-type scenarios, two groups of cross-domain experiments are set: “training on pedestrian dataset → validation on vehicle dataset” and “training on vehicle dataset → validation on pedestrian dataset”. The pedestrian dataset uses Market-1501, the vehicle dataset uses the VehicleID training set, and the evaluation metric is still mAP. The experimental results are shown in
Table 2.
The cross-target-type generalization experimental results show that the proposed method still achieves the best generalization performance in the completely cross-category training-validation scenario, with 0.5% and 0.1% mAP improvements over TransReID in the two groups of experiments respectively. It should be objectively noted that the performance gain in cross-target-type scenarios is relatively limited, which is mainly due to the large distribution difference between pedestrian and vehicle targets. This result verifies that the proposed NFCA module can effectively capture the core discriminative features of different types of targets, the adaptive nonlinear metric can adapt to the complex distribution of pedestrian and vehicle feature spaces, and the multi-source data fusion strategy effectively reduces the cross-domain distribution difference, proving that the proposed method has good cross-target-type generalization ability.
4.4. Ablation Experiments
To verify the indispensability and independent contribution of the two core modules (NFCA module, adaptive nonlinear metric) of this paper, ablation experiments are carried out by removing core modules and replacing key components with the full proposed method as the benchmark. The experiments are conducted on Market-1501 and JC-1 datasets, and the results are shown in
Table 3.
It can be seen from the experimental results in
Table 3 that:
After removing the NFCA module and replacing it with the original CA module, the mAP of the model drops from 92.8% to 89.1% on Market-1501, with a decrease of 3.7% (calculation formula: 92.8–89.1% = 3.7%), and from 82.1% to 81.5% on JC-1, with a decrease of 0.6%. This result proves that the NFCA module significantly improves the discriminability of features by enhancing cross-feature semantic interaction and suppressing background noise through the neighborhood feature centralization mechanism [
20,
21]. After removing this module, the model loses the ability to model cross-feature dependencies, and the spatial positioning accuracy decreases. The feature differences between similar samples caused by viewpoint changes and local occlusions are amplified, and the false matching rate increases significantly.
After replacing the proposed adaptive nonlinear metric with the traditional Euclidean distance, the mAP of the model drops from 92.8% to 87.5% on Market-1501, with a decrease of 5.3% (calculation formula: 92.8–87.5% = 5.3%), and from 82.1% to 79.6% on JC-1, with a decrease of 2.5%. This result directly verifies the necessity of nonlinear metrics for the characterization of complex feature distributions: Euclidean distance implicitly assumes that the feature space is linearly separable, while the feature distribution of similar samples in real scenarios often presents a complex nonlinear structure, and linear metrics will produce inherent systematic bias. The data-driven nonlinear metric function proposed in this paper can effectively improve the accuracy of similarity judgment.
4.5. Feature Heatmap Visualization Analysis
To visually verify the ability of the NFCA module to focus on the core discriminative regions of the target and suppress background noise, and to explain the internal mechanism of the module to improve feature robustness, this paper uses the Grad-CAM method to generate feature heatmaps for comparative experiments. The experiments are conducted based on the Market-1501 person ReID dataset and the self-developed JC-1 air-to-ground vehicle ReID dataset. Heatmaps are generated for the “original ResNet34 backbone without the NFCA module” and the “full proposed network with the NFCA module” respectively, and the results are shown in
Figure 4.
Combined with the core objectives of the ReID task (extracting discriminative features of the target, suppressing invalid background interference, and ensuring the robustness of features to viewpoint changes and occlusions), this paper defines a three-dimensional evaluation criterion for high-quality ReID heatmaps:
Discriminative Region Focus: The high-response regions (red/yellow highlighted areas) of the heatmap should be accurately concentrated on the core discriminative parts of the target (such as the clothing contour of pedestrians, the body and identification areas of vehicles), rather than scattered in the background or non-critical local details, to ensure the inter-class discriminability of features.
Background Noise Suppression: The response intensity of the background area should be significantly lower than that of the target main body, without large-area invalid high response, to verify the network’s ability to filter background interference and avoid invalid information dominating feature encoding.
Target Integrity: The high-response regions should cover the complete main structure of the target, rather than only focusing on scattered local details, to ensure the robustness of features to viewpoint changes and local occlusions, and avoid matching errors caused by the loss of local information.
Based on the above evaluation criteria, the comparative analysis of the heatmap results in
Figure 4 is as follows:
For the original network without the NFCA module, the high-response regions of the heatmap are scattered, mostly concentrated in the background areas (such as the mall environment in Market-1501, the road and vegetation areas in JC-1), and only respond to scattered local parts of the target, unable to cover the complete main body of the target, which completely fails to meet the evaluation criteria of high-quality heatmaps. This also explains the significant drop in mAP after removing the NFCA module in the ablation experiment: the network is easily disturbed by background noise and cannot extract stable discriminative features of the target.
For the proposed method with the NFCA module, the performance of the heatmap fully complies with the high-quality evaluation criteria: the high-response regions are accurately focused on the target body of pedestrians and vehicles, without invalid highlighting in the background; the response regions completely cover the core discriminative structure of the target, rather than scattered parts; the response intensity of the background area is significantly suppressed.
The visualization results verify that the proposed NFCA module can effectively guide the network to focus on the core discriminative regions of the target, suppress background noise, and retain the complete structural information of the target. It improves the discriminability and robustness of features from the underlying logic of feature encoding, forming a complete logical closed loop with the performance improvement of the previous quantitative experiments.
4.6. Generalization Verification on Classification Task
To verify the adaptability of the proposed NFCA module to low-pixel images and its versatility in other computer vision tasks, this paper carries out generalization verification experiments on the CIFAR-100 image classification dataset. The experimental configuration is as follows: Batch Size 128, total training epochs 150, initial learning rate 0.01, SGD optimizer with momentum 0.9, weight decay 5 × 10
−4. The experimental results are shown in
Table 4.
It can be seen from the experimental results that, as the core component of the ReID framework, the proposed NFCA module achieves the best Top-1 accuracy on the CIFAR-100 classification task, with an improvement of 0.4% compared with the second-best CBAM module. This result proves that the proposed method not only has excellent performance in ReID tasks but also can adapt to other computer vision tasks such as image classification, and still has a good feature enhancement effect in low-pixel scenarios, verifying the versatility and advancement of the method.
In summary, the proposed method comprehensively improves the performance and generalization ability of the ReID network by enhancing feature robustness through the attention mechanism, fitting nonlinear distribution through adaptive metrics, and reducing distribution differences through multi-source data fusion.