Next Article in Journal
AI-Enabled Governance: Board Gender Diversity and Corporate Tax Avoidance
Previous Article in Journal
SOC-Dependent Soft Current Limiting for Second-Life Lithium-Ion Batteries in Off-Grid Photovoltaic Battery Energy Storage Systems
 
 
Article
Peer-Review Record

Object Re-Identification Method for Air-to-Ground Targets Based on Neighborhood Feature Centralization Attention

Computation 2026, 14(5), 96; https://doi.org/10.3390/computation14050096
by Tian Yao, Yong Xu *, Yue Ma, Hongtao Yan, Haihang Xu and An Wang
Reviewer 1: Anonymous
Reviewer 2:
Computation 2026, 14(5), 96; https://doi.org/10.3390/computation14050096
Submission received: 19 March 2026 / Revised: 15 April 2026 / Accepted: 20 April 2026 / Published: 22 April 2026

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

- The objectives of the study are generally presented; however, the novelty and the key contributions are not sufficiently clarified, and the distinction from existing approaches remains unclear.

- The overall architecture and training pipeline are described; however, important implementation details, including hyperparameter settings, pre-processing steps, and training configurations, are insufficiently specified, which limits the reproducibility of the study.

- The experimental results are reported mainly using mAP; however, statistical validation (e.g., significance testing, confidence intervals, or variance analysis) is lacking, and thus the reliability of the reported improvements is not adequately supported.

- Figures such as Figure 1 and Figure 2 help illustrate the proposed framework; however, Tables 1 and 2 contain partially redundant information, and the presentation of results is not sufficiently concise or effective in highlighting key findings.

- The conclusions are generally consistent with the experimental results; however, the claims regarding generalization appear somewhat overstated given the relatively limited performance gains.

- While the integration of the NFCA module with a data-driven metric learning framework is a meaningful approach, the technical distinction from existing methods and the significance of the performance improvements are not sufficiently convincing.

- The manuscript is generally structured; however, some sections are repetitive, and the experimental analysis is overly detailed, which reduces overall readability.

Author Response

Thank you very much for your professional, detailed and highly constructive comments on our manuscript. We have comprehensively and systematically revised and improved the manuscript in strict accordance with all your comments, and all revisions have been embedded in the corresponding sections of the main text. The core revisions include:

Comment1:The objectives of the study are generally presented; however, the novelty and the key contributions are not sufficiently clarified, and the distinction from existing approaches remains unclear.

Answer1:Dear Reviewer, thank you very much for your valuable review comments. We fully agree with the issues you pointed out and have completed targeted revisions and supplements in the Abstract and Introduction sections of the manuscript. In the Abstract, we further condensed the four innovation points of this paper addressing the three core bottlenecks of air-to-ground target re-identification. In the Introduction, we systematically sorted out the technical limitations of existing mainstream methods in air-to-ground scenarios, clearly elaborated the essential differences between the three core technical contributions of this paper (the proposed NFCA module, adaptive nonlinear metric function, and multi-source data fusion training strategy) and existing approaches, and supplemented cutting-edge works from top conferences and journals in the related field from 2024 to 2025 as academic references, to ensure that the innovation, core contributions of this paper and the boundaries from existing approaches are clear and explicit.

Comment2:The overall architecture and training pipeline are described; however, important implementation details, including hyperparameter settings, pre-processing steps, and training configurations, are insufficiently specified, which limits the reproducibility of the study.

Answer2:We fully accept your comment, and have fully supplemented the full-process reproducibility details in Section 4.1 Preprocessing and Training Configuration Details of the manuscript. The specific revisions are as follows:

  1. We clarified the full-process fixed parameters of image preprocessing: including the unified scaling of input images to 224×224 pixels, random horizontal flipping probability 0.5, random erasing probability 0.5 and erasing ratio 0.02-0.4, and clarified the full values of normalization mean and standard deviation consistent with ImageNet pre-training;
  2. We refined the backbone network configuration: it is clear that ResNet34 is used as the backbone, the stride of the last downsampling layer is modified from 2 to 1, and the final output feature dimension is 512;
  3. We fully supplemented the training hyperparameters: AdamW is used as the optimizer with weight decay 5e-4, initial learning rate 3e-4, learning rate decays by 0.1 times every 20 epochs, batch size 128, total training epochs 120. Meanwhile, it is clarified that the experiment is implemented based on PyTorch 2.0 framework and runs on a single NVIDIA A40 GPU.

Comment3:The experimental results are reported mainly using mAP; however, statistical validation (e.g., significance testing, confidence intervals, or variance analysis) is lacking, and thus the reliability of the reported improvements is not adequately supported.

Answer3:We fully accept your comment, and have supplemented the complete statistical validation details to strengthen the reliability support for performance improvement. The specific revisions are as follows:

  1. In Table 1. Comparative Experiments and Results, we added a significance mark to the optimal results of the proposed method, and clarified the statistical test rules in the table note: * denotes statistically significant difference compared with TransReID (paired two-tailed t-test, p<0.05, n=5 independent runs);
  2. In the result analysis of Section 4.3 Comparative Experiments, we supplemented the specific parameters of the t-test: all experiments are completed with 5 independent repeated tests using different random seeds, the statistical significance of performance improvement is verified by paired two-tailed t-test, degree of freedom=4, significance level p<0.05. Meanwhile, it is stated that the standard deviation of 5 experiments is lower than 0.25%, which verifies the stability of the experimental results;
  3. All experimental results are presented in the form of "mean ± standard deviation", which fully complies with the general statistical norms of top conferences and journals in the field of computer vision.

Comment4:Figures such as Figure 1 and Figure 2 help illustrate the proposed framework; however, Tables 1 and 2 contain partially redundant information, and the presentation of results is not sufficiently concise or effective in highlighting key findings.

Answer4:We fully accept your comment, and have comprehensively revised all tables in strict accordance with the table format specifications of MDPI journals. The specific revisions are as follows:

  1. Table 1: We deleted invalid line break tags in the header, unified the numerical format, standardized the method naming, added significance marks, and deleted redundant repeated labeling of percent signs;
  2. Table 2: We simplified the redundant column titles, deleted invalid format tags, renamed it to the standardized Table 2. Cross-Target Type Generalization Experiment Results, and simplified the column names to Pedestrian → Vehicle (mAP) and Vehicle → Pedestrian (mAP), which meets the requirements of concise and standardized MDPI tables.

Comment5:The conclusions are generally consistent with the experimental results; however, the claims regarding generalization appear somewhat overstated given the relatively limited performance gains.

Answer5:We fully accept your comment, and have objectively supplemented the limitation analysis of cross-domain generalization performance in the paper to avoid exaggerated statements. The specific revisions are as follows:

  1. In the cross-domain experiment result analysis of Section 4.3 Comparative Experiments, we added an objective explanation: "It should be objectively noted that the performance gain of the proposed method in cross-target type scenarios is relatively limited (0.1%-0.5% mAP improvement), which is mainly due to the large distribution gap between pedestrian and vehicle targets. The generalization ability of the proposed method for completely unseen target categories still needs to be further improved in future work";
  2. In Section 5. Conclusion, we objectively reiterated this limitation again, forming a logical closed loop with the previous analysis, and ensuring the rigor and objectivity of academic expression.

Comment6:While the integration of the NFCA module with a data-driven metric learning framework is a meaningful approach, the technical distinction from existing methods and the significance of the performance improvements are not sufficiently convincing.

Answer6:We have completed targeted revisions and improvements: First, we have supplemented the authoritative references corresponding to the Neighborhood Feature Centralization (NFC) mechanism. Second, we have verified the independent performance contributions of the Neighborhood Feature Centralization Attention (NFCA) module and the data-driven metric learning framework through rigorous ablation experiments. The experimental results show that the mean Average Precision (mAP) of the model decreases significantly after removing the two core modules respectively. Meanwhile, we have verified the statistical significance of the performance improvement via the paired two-tailed t-test, which fully demonstrates the uniqueness of the technical innovation and the practical significance of the performance improvement in this paper.

Comment7:The manuscript is generally structured; however, some sections are repetitive, and the experimental analysis is overly detailed, which reduces overall readability.

Answer7:We fully accept your comment, and have comprehensively revised the section structure, simplified the lengthy experimental analysis content, and improved the readability of the full text. The specific revisions are as follows:

  1. We revised the section numbering error and formed a complete experimental section structure with logical progression: 4.1 Preprocessing and Training Configuration Details, 4.2 Datasets and Experimental Details, 4.3 Comparative Experiments, 4.4 Ablation Experiments, 4.5 Feature Heatmap Visualization Analysis, 4.6 Generalization Verification on Classification Task, which completely solves the structural flaw of duplicated section numbers in the original manuscript;
  2. We simplified the repeated and redundant expressions in the experimental analysis, deleted the content irrelevant to the core conclusion. Each experimental section follows the logical structure of "experimental setup → experimental results → result analysis", with concise expression and clear logic;
  3. We standardized the corresponding relationship between the numbering of all figures and tables and the citations in the main text, ensuring that the figure and table numbers mentioned in the main text are completely consistent with the actual figures and tables, which further improves the readability of the full text.

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors

This manuscript addresses the air-to-ground target re-identification problem and proposes a method to improve accuracy even under conditions that have conventionally degraded accuracy, such as cross-camera viewpoints, illumination variations, and occlusions. The proposed method is characterized by its use of an attention mechanism, and a new Neighborhood Feature Centralization (NFC) mechanism is constructed based on it. Experimental evaluation uses two publicly available datasets in addition to a proprietary dataset, demonstrating superiority over five methods, including conventional methods. From an application perspective, the problem setting is practical, and the proposed method is novel. Furthermore, reliability is confirmed through the use of various evaluation methods, including ablation studies. However, this manuscript contains several unclear points and insufficient descriptions, as outlined below. The reviewer requests that the authors reconsider these points to improve the manuscript.

1. lines 149-150:
Regarding "not only introduce re-identification datasets but also classification non-re-identification datasets."
This should be explained in more detail using the phrase "that is,". The reviewer's understanding, although it may not be correct, is that in Fig. 1(b), if the similarity value does not match any of the registered IDs, the image is newly registered in the Reference class. This is important for the system, so please provide a more detailed description.

2. Figure 1(b):
Just to confirm, in the test, when a new Unknown ID image is input, all images with IDs registered in the Reference class are input into the Backbone network for inference, and these values are used to proceed to the Similarity check. Is this correct? Since the Backbone network is a pre-trained model, it should not be necessary to process the Reference class images again. If so, it would be better to draw Fig. 1(b) more accurately.

3. Figure 2:
The definition of the function "GAP( )" is missing. It would be helpful if you could add it.

4. 4.1 and Figure 3:
When actually creating a dataset, is it correct to manually crop the target partial images from real images captured by drones, etc., and register them in the dataset with IDs, similar to normal annotation? For images that are marked as "unknown," is it correct to manually register the required number of images in the dataset in the same way? If so, it would be helpful to describe how to handle "unknown" images in particular.

5. lines 261-262:
The calculation methods for the values "92.3%/85.6%/82.1%" and "1.2%/2.4%/0.4%" are unclear. It would be helpful to describe how these values are calculated using which values in Table 1.

6. lines 266 and 279:
The calculation methods for "improves mAP by about 9%" and "about 7%" are the same as described in No. 5.

7. line 308:
The calculation methods for "3.2% decrease" and "4.8% decrease" are unclear. Please describe how these values are obtained.

8. line 320:
Regarding "feature heatmap experiments," Fig. 4 shows heatmaps from three different methods for two datasets. The reviewer thinks that providing an explanation of what constitutes a high-accuracy heatmap would greatly improve its precision, especially, the concentration or distribution of red or yellow regions. Please describe the authors' evaluation standard for heatmaps.

Author Response

Thank you very much for your professional review of our manuscript, your recognition and affirmation of our research work, as well as your valuable and constructive comments. We fully agree with the issues you pointed out that some content in the manuscript has unclear expressions and insufficient detailed descriptions. In strict accordance with your review comments, we have carried out comprehensive and systematic revisions and improvements to the manuscript.

Comment1:lines 149-150:
Regarding "not only introduce re-identification datasets but also classification non-re-identification datasets."
This should be explained in more detail using the phrase "that is,". The reviewer's understanding, although it may not be correct, is that in Fig. 1(b), if the similarity value does not match any of the registered IDs, the image is newly registered in the Reference class. This is important for the system, so please provide a more detailed description.

Anwser1:

Thank you very much for your valuable and constructive comments. We have carefully revised the manuscript in strict accordance with your suggestions, and the specific revisions are as follows:

  1. For the description of the multi-source data fusion strategy, we have revised the original sentence with repeated expressions and supplemented a detailed, clear explanation with the phrase "that is" as you requested. The revised content clarifies the specific usage rules of non-re-identification classification datasets in the training phase, including the positive and negative sample pair construction logic, the mixing ratio of different datasets in each training batch, and the core mechanism of this strategy to improve the model's generalization ability, which eliminates the ambiguity of the original expression.
  2. For the new ID registration process in the test pipeline of Figure 1(b), which is critical to the integrity of the system, we have supplemented a detailed and complete description in Section 3.1 (corresponding to Figure 1(b)) and Section 4.1 Inference Pipeline Configuration. We clearly elaborated the full judgment and execution process of new ID registration: after calculating the similarity between the query image feature and all pre-stored reference library features through the adaptive nonlinear metric branch, a fixed similarity threshold of 0.5 is set; if the maximum similarity score is higher than the threshold, the ID corresponding to the highest similarity is output as the final re-identification result; if the maximum similarity score is lower than the threshold, the query image is judged as an unregistered new ID, and the new ID registration is completed by adding its 512-dimensional feature extracted by the trained network to the reference library, forming a complete closed loop of the inference system.   All the above revisions have been embedded in the corresponding sections of the manuscript. We hope the revised content can meet your requirements, and thank you again for your professional guidance.

Comment2:2. Figure 1(b):
Just to confirm, in the test, when a new Unknown ID image is input, all images with IDs registered in the Reference class are input into the Backbone network for inference, and these values are used to proceed to the Similarity check. Is this correct? Since the Backbone network is a pre-trained model, it should not be necessary to process the Reference class images again. If so, it would be better to draw Fig. 1(b) more accurately.

Anwser2:We fully accept your comment, and have fully supplemented the description of the test process of Figure 1(b) in the paper, and clarified the relevant logic of reference library pre-extraction. The specific revisions are as follows:

  1. In the position corresponding to Figure 1(b) in Section 3.1 Overall Network Architecture, we supplemented the core description of the test process, clarified that the native algorithm process does not require pre-extraction of features in advance, and explained that pre-extraction is only an engineering optimization option;
  2. In the Inference Pipeline Configuration subsection of Section 4.1 Preprocessing and Training Configuration Details, we supplemented the complete process of reference library pre-extraction in detail: "Before formal inference, all images with registered IDs in the reference library are input into the trained network at one time to extract 512-dimensional features, which are pre-stored in the local feature library. During the inference process, there is no need to re-input the reference library images into the backbone network for repeated calculation, which greatly improves the inference efficiency". At the same time, it is clarified that this step is an optional optimization item in engineering deployment, not a necessary link of the native algorithm process;

Comment3. Figure 2:
The definition of the function "GAP( )" is missing. It would be helpful if you could add it.

Anwser3:We fully accept your comment, and have fully supplemented the English full name, abbreviation definition

Comment4: 4.1 and Figure 3:
When actually creating a dataset, is it correct to manually crop the target partial images from real images captured by drones, etc., and register them in the dataset with IDs, similar to normal annotation? For images that are marked as "unknown," is it correct to manually register the required number of images in the dataset in the same way? If so, it would be helpful to describe how to handle "unknown" images in particular.

Anwser4:We fully accept your comment, and have fully supplemented the relevant content in the paper. The specific revisions are as follows:

  1. In the introduction of the JC-1 dataset in Section 4.2 Datasets and Experimental Details, we clarified the annotation specification of the dataset: "Annotation Specification of JC-1 Dataset: All target vehicles are manually cropped from the original real-shot images of fixed-wing UAVs, and a unique integer ID label is assigned to each vehicle target. The same vehicle in different frames, viewpoints and shooting distances is assigned the same ID, while different vehicles are assigned non-overlapping unique IDs. The annotation process is fully consistent with the standard annotation protocol of mainstream ReID datasets such as Market-1501 and VehicleID", which clearly explains the collection and annotation methods of the dataset;
  2. In the Inference Pipeline Configuration subsection of Section 4.1 Preprocessing and Training Configuration Details, we fully supplemented the processing method of unknown ID images: "For the input unknown ID query image, the same preprocessing as the training phase is performed first, and then input into the trained network to extract 512-dimensional query features. The similarity between the query feature and all pre-stored reference library features is calculated through the adaptive nonlinear metric branch, and a similarity threshold of 0.5 is set: if the maximum similarity score is higher than the threshold, the ID corresponding to the highest similarity is output as the final ReID result; if the maximum similarity score is lower than the threshold, the query image is judged as a new unregistered ID, and the new ID registration can be completed by adding its feature to the reference library", which covers the whole process from preprocessing to final judgment of unknown ID images.

Comment5. lines 261-262:
The calculation methods for the values "92.3%/85.6%/82.1%" and "1.2%/2.4%/0.4%" are unclear. It would be helpful to describe how these values are calculated using which values in Table 1.

Anwser5:We fully accept your comment, and have supplemented the complete calculation process based on Table 1 in the result analysis of Section 4.3 Comparative Experiments. The specific content is as follows:

"Compared with the current SOTA method TransReID, the proposed method achieves mAP improvements of 0.4%, 1.2% and 2.4% on the three datasets respectively. The complete calculation process is:

  • JC-1 Dataset: 82.1% (Proposed Method) - 81.7% (TransReID) = 0.4%
  • Market-1501 Dataset: 92.8% (Proposed Method) - 91.6% (TransReID) = 1.2%
  • VehicleID Dataset: 83.6% (Proposed Method) - 81.2% (TransReID) = 2.4%"

Comment6. lines 266 and 279:
The calculation methods for "improves mAP by about 9%" and "about 7%" are the same as described in No. 5.

Anwser6:

We fully accept your comment, and have supplemented the calculation benchmark, single dataset difference and complete arithmetic average calculation process of the two average improvement ranges in the result analysis of Section 4.3 Comparative Experiments. The specific content is as follows:

  1. "Compared with the traditional metric learning baseline Baseline2, the proposed method achieves mAP improvements of 1.2% (JC-1), 9.3% (Market-1501), and 11.5% (VehicleID) on the three datasets, with an arithmetic average improvement of 8.9% (≈9%). The complete calculation process is: (1.2% + 9.3% + 11.5%) / 3 = 8.9%";
  2. "Compared with Baseline3 which only adopts Coordinate Attention, the proposed method achieves mAP improvements of 0.8% (JC-1), 7.7% (Market-1501), and 9.1% (VehicleID) on the three datasets, with an arithmetic average improvement of 6.8% (≈7%). The complete calculation process is: (0.8% + 7.7% + 9.1%) / 3 = 6.8%".

Comment7. line 308:
The calculation methods for "3.2% decrease" and "4.8% decrease" are unclear. Please describe how these values are obtained.

Anwser7:The specific calculation process:

  1. After removing the NFCA module and replacing it with the original CA module, the mAP drop on the Market-1501 dataset is: 92.8% (Full Proposed Method) - 89.1% (Ablation Configuration) = 3.7%; the mAP drop on the JC-1 dataset is: 82.1% (Full Proposed Method) - 81.5% (Ablation Configuration) = 0.6%;
  2. After replacing the proposed adaptive nonlinear metric with the traditional Euclidean distance, the mAP drop on the Market-1501 dataset is: 92.8% (Full Proposed Method) - 87.5% (Ablation Configuration) = 5.3%; the mAP drop on the JC-1 dataset is: 82.1% (Full Proposed Method) - 79.6% (Ablation Configuration) = 2.5%".

Comment8. line 320:
Regarding "feature heatmap experiments," Fig. 4 shows heatmaps from three different methods for two datasets. The reviewer thinks that providing an explanation of what constitutes a high-accuracy heatmap would greatly improve its precision, especially, the concentration or distribution of red or yellow regions. Please describe the authors' evaluation standard for heatmaps.

Anwser8:We fully accept your comment, and have fully supplemented the three-dimensional evaluation criteria for high-quality heatmaps in the paper, standardized the annotation of Figure 4, and completed the corresponding analysis based on the evaluation criteria. The specific revisions are as follows:

  1. In Section 4.5 Feature Heatmap Visualization Analysis, we supplemented the three-dimensional evaluation criteria for high-quality ReID heatmaps:â‘  Discriminative Region Focus: The high-response regions of the heatmap should be accurately concentrated on the core discriminative parts of the target, rather than scattered in the background, to ensure the inter-class discriminability of features;â‘¡ Background Noise Suppression: The response intensity of the background area should be significantly lower than that of the target main body, without large-area invalid high response, to verify the network's ability to filter background interference;â‘¢ Target Integrity: The high-response regions should cover the complete main structure of the target, rather than only focusing on scattered local details, to ensure the robustness of features to viewpoint changes and local occlusions;
  2. Based on the above three-dimensional evaluation criteria, we conducted a point-by-point corresponding analysis of the heatmap results in Figure 4, clarifying that the heatmap of the proposed method fully meets the high-precision evaluation criteria, forming a logical closed loop with the performance improvement of quantitative experiments.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

- The introduction provides sufficient background; however, the novelty of the proposed method could be articulated more clearly in relation to existing attention-based and metric learning approaches. In particular, it would be helpful to more explicitly highlight what distinguishes the NFCA module and the adaptive metric from prior work, beyond incremental improvements.

- The overall methodology is sound, and the key components are clearly described. However, some parts of the explanation could be improved for better readability. For example, the interaction between the NFCA module and the similarity learning branch could be explained more intuitively, and some overly dense or repetitive descriptions could be further simplified.

Author Response

Thank you very much for your rigorous and constructive comments on our manuscript. We have carefully studied each comment and revised the manuscript accordingly. 

Comment1:The introduction provides sufficient background; however, the novelty of the proposed method could be articulated more clearly in relation to existing attention-based and metric learning approaches. In particular, it would be helpful to more explicitly highlight what distinguishes the NFCA module and the adaptive metric from prior work, beyond incremental improvements.

Answer1:

We fully agree with this valuable comment, and have made targeted revisions to the Introduction section to clarify the fundamental novelty of our method and its core distinctions from existing attention-based and metric learning approaches, rather than only incremental improvements. The specific revisions are as follows:  
  1. Revised the transition paragraph between core challenges and contributions   We redefined the core optimization target of our work by explicitly anchoring the unresolved limitations of existing attention-based and metric learning methods in air-to-ground ReID scenarios. We also clarified that our work proposes a three-dimensional collaborative optimization chain of "attention mechanism - adaptive metric - multi-source data supplementation", which is a systematic innovation rather than incremental optimization of a single module. This revision is located in the last paragraph before the core contributions in the Introduction section.
  2. Rewrote the full core contribution section with explicit head-to-head comparisons   For each core contribution, we added direct and clear comparisons with state-of-the-art approaches:
    • For the NFCA module: We explicitly compared it with mainstream attention mechanisms (Coordinate Attention, CBAM, ECA), pointed out the inherent trade-off limitation of existing works (cannot balance position encoding and cross-feature semantic interaction), and clarified that our NFCA fundamentally breaks this trade-off via a parameter-free neighborhood feature centralization mechanism, rather than incremental modification based on Coordinate Attention.
    • For the adaptive nonlinear metric: We explicitly compared it with two types of mainstream metric learning methods (methods relying on manually designed linear metrics, methods using classification loss as a proxy task), and highlighted that our method constructs an end-to-end learning paradigm that fully aligns the training objective with the inference task, eliminating both the proxy task bias and the systematic bias of linear metrics, which is a paradigm-level innovation rather than a simple loss function replacement.
    • For the multi-source data fusion strategy and the JC-1 dataset, we also supplemented the explicit distinctions from prior works to highlight their innovation.

 

Comment2:The overall methodology is sound, and the key components are clearly described. However, some parts of the explanation could be improved for better readability. For example, the interaction between the NFCA module and the similarity learning branch could be explained more intuitively, and some overly dense or repetitive descriptions could be further simplified.

Answer2: 

We sincerely appreciate this constructive suggestion, and have carried out comprehensive revisions to the Methodology section (Section 3) to improve readability, clarify the interaction mechanism between core modules, and simplify redundant descriptions. The specific revisions are as follows:  
  1. Intuitively clarified the end-to-end collaborative interaction between the NFCA module and the similarity learning branch   We rewrote the training process description in Section 3.1 (Overall Network Architecture), split the original dense long sentence into a 4-step structured process, and explicitly elaborated the two-way collaborative optimization mechanism between the two modules:
    • The NFCA module outputs high-discriminability attention features by suppressing background noise and enhancing semantic consistency, which provides high-quality feature input for the similarity learning branch;
    • The gradient generated by the BCEWithLogitsLoss of the similarity branch is back-propagated to the NFCA module synchronously, guiding the attention module to focus more on the core discriminative regions that are critical for similarity matching.   This revision intuitively explains the internal interaction between the two modules, rather than simply describing their serial connection, and is located in the training process subsection of Section 3.1.
     
  2. Simplified repetitive and redundant descriptions   We deleted the repetitive content about the limitations of traditional linear metrics in the last paragraph of Section 3.1, which was repeatedly elaborated in the opening of Section 3.3. Meanwhile, we optimized the opening of Section 3.3, split the original dense paragraph into a structured comparative description of the design logic between traditional methods and our method, which eliminates redundancy while making the core difference more prominent.
  3. Optimized the overly dense formula description of the NFCA module   We reorganized the formula explanation in Section 3.2, split the original dense, unstructured description into two core steps (Neighborhood Feature Centralization with Position Encoding, Attention Weight Generation and Feature Re-calibration), and clarified the core objective of each step and the function of each operation. This revision greatly improves the readability of the module's calculation logic, and is located in the core calculation description of the NFCA module in Section 3.2.

Author Response File: Author Response.docx

Back to TopTop