Review Reports
- Bo Liang1,*,
- Hongfu Shan1 and
- Song Feng1
- et al.
Reviewer 1: Anonymous Reviewer 2: Anonymous Reviewer 3: Mingxing Nie Reviewer 4: Anonymous Reviewer 5: Anonymous
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThere are following observations for this research as per following:
- The introduction section must highlight the noble contributions in bullet points, also in line number 39, there are four references are cited, pls revisit the introduction with including one scenario diagram also.
- Revisit equation number 1, it is incorrect.
- Explain the equation number 6 in detail.
- Figure 1 is important, it should be detailed with stages, pls redesign it with more clarity and detailed one.
- Pls add literature review section also before methodology section with critical analysis.
- Can you add rationale for the statement on lines 196-198 "the NFE module integrates non-local attention with gradient enhancement, thereby focusing on semantic information while also reinforcing the representation of target boundaries and shapes."
- In the section 3, experiment and analysis, pls add the details about configuration parameters.
- Please add false positives and negatives, A deeper analysis of false positives and negatives should reveal critical weaknesses.
- Please add rationale for the introduction of non-local components like CNFI and NFE increase the complexity of network training and thereby additional training data or longer training periods are the necessary requirements.
- What are the limitations of this research? alignment with future research directions.
- Please add complexity analysis of Algorithm 1.
- The conclusion should precisely highlight the key facts about this research and it should be re-written.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper proposes a novel anti-UAV detection framework that addresses two major challenges: scale variation of UAV targets and complex background interference. The authors introduced two Cross-scale Non-local Feature Interaction (CNFI) and Non-local Feature Enhancement (NFE) key modules. The proposed modules and the overall procedure improves detection accuracy compared to state-of-the-art YOLO-based models, achieving higher precision, recall, and mAP on datasets like DUT-Anti-UAV and Det-Fly. The scientific novelty lies in explicit cross-scale non-local attention and gradient-enhanced feature fusion, which together enhance robustness in complex environments.
The introduction in the article does provide a solid background and cites a wide range of relevant references. The majority of cited articles are new and no older when 5 years. All articles are relevant to the investigated topic. The research design is appropriate. The methods are described in detail and are reproducible. The results are clearly presented in table and figures. The authors presented method results are compared with other methods like YOLOv7, YOLOv8, EDTC, and YOLOv12.
The article is good written with good English language in good academicals style. Overall the article has all pats necessary for the scientific paper. I only noticed a few inaccuracies in the article and would like to see additional information in a few places:
- The conclusion suggests broad applicability (“real-time unauthorized UAV detection in public safety”), but no latency or resource usage results are provided to confirm real-time feasibility. There is information, what kind of computer with GPU was used but that is all. I wonder if a 640×640 image requires four NVIDIA GeForce RTX 4090 GPUs or if in practical tasks the smaller resource can be used?
- While many YOLO-based improvements are cited, the introduction could briefly mention Transformer-based approaches or vision transformers for UAV detection, which are referenced later in the paper but not highlighted in the introduction.
- The introduction does not mention benchmark datasets (like DUT-Anti-UAV) until later sections; a short note here would strengthen context.
- The presentation of Figure 6 lacks clarity regarding which subfigure corresponds to which detection method. The figure includes multiple rows of results, but the legend and annotations are insufficient to clearly identify the methods being compared.
- In the current presentation of Algorithm 1, the headings Step 1: Encode Query Features, Step 2: Construct Shared Key and Value, Step 3: Compute Cross-scale Attention, Step 4: Aggregate Cross-scale Information, and Step 5: Decode and Apply Residual Connection are numbered and aligned as if they were part of the algorithm steps. However, these are section titles describing the main phases of the algorithm rather than individual executable steps. To improve clarity, these headings should be left-aligned and unnumbered, distinguishing them from the actual algorithm instructions. This formatting change would make the structure more readable and prevent confusion between conceptual phases and procedural steps.
Small shortcomings:
- Line 133. No space between “upsamplingdownsampling”
- Line 164. No space between “featuresQ1”
- Line 177-178. Are not aligned correctly.
- Equation 6 and 7. Should be “.” instead of “,”, because the and of sentence.
Summarizing, the article is high quality and well written, therefore could be accepted after small corrections.
Comments on the Quality of English Language
English quality is fine. Just small errors appears.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe paper addresses important challenges in anti-UAV detection, namely small targets with large scale variation and strong background clutter. The proposed CNFI and NFE modules, integrated into YOLOv12 and evaluated on DUT-Anti-UAV and Det-Fly, show clear performance gains over several YOLO baselines. The work is relevant and has some novelty, but several methodological and experimental details should be clarified before acceptance.
1. CNFI introduces cross-scale non-local attention, which likely increases computation and may affect real-time performance. Please quantify the additional Params, FLOPs, and FPS compared to YOLOv12n, and add a small table comparing baseline vs. CNFI/NFE models, to demonstrate practicality for real anti-UAV applications.
2. In CNFI all feature maps are downsampled to the smallest scale for attention, which may lose high-resolution details. Please briefly discuss the impact of this design on large targets and fine structures, and indicate whether alternative alignment strategies were considered and why the current choice was adopted.
3. The description of the RG operation and gradient features in NFE/NGA is still rather high level. Please specify how gradients are computed (e.g., Sobel/Scharr or learned convs, per-channel treatment, normalization/thresholding) and, if possible, provide a compact formula or diagram to improve reproducibility.
4. On Det-Fly, adding CNFI alone increases Precision but slightly reduces Recall and mAP@0.5. Please analyze this behavior in more detail (e.g., link to dataset characteristics or possible overfitting) and comment on whether tuning CNFI hyperparameters could mitigate this effect, ideally with a brief qualitative example.
5. Current comparisons mainly involve YOLO variants. If feasible, please add 1–2 more recent and strong baselines (e.g., small-object or Transformer-based detectors); otherwise, clearly state the practical constraints that prevent broader comparisons and briefly discuss this limitation in the paper.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for Authors1.Correct typographical errors in figures and tables; for example, "w/o NEF" in Figure 4 should be revised to "w/o NFE".
2.Refine the English expression and correct issues such as improper word usage and typographical errors.
3.Supplement the comparative analysis between the CNFI module and existing "cross-scale attention" methods, and elaborate on the theoretical basis for the module design.
4.Supplement the design rationale for the "KV pair" mechanism in the CNFI module.
5.What is the basis for selecting the number of channel groups and convolution kernel parameters in the GEC sub-module of the NFE module?
6.Engineering scenarios for Anti-UAV detection have extremely high requirements for real-time performance. It is recommended to supplement real-time performance metrics, such as FPS, model parameter count, and FLOPs.
7.Add comparative experiments, e.g., comparisons with the methods proposed in References 17 and 18.
8.Only ablation experiments have been conducted on the Det-Fly dataset, with no comparisons against other models. It is recommended to supplement comparative experiments to enhance the generalizability of the results.
9.There is a lack of hyperparameter sensitivity analysis—for instance, the impact of parameters like learning rate, batch size, and the downsampling factor of CNFI on model performance—which makes it impossible to verify the model’s robustness.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 5 Report
Comments and Suggestions for AuthorsThe manuscript proposes CNIFE, an anti-UAV detection network based on non-local feature learning. It outperforms current mainstream YOLO models in some performance metrics and exhibits improvements in small-target detection and complex background suppression. However, several issues still need to be addressed:
1. In the "Abstract" section, it mentions "improvements of 0.93%, 1.09%, and 2.21% in Precision (P), Recall (R), and mAP50, respectively", but the subsequent ablation study clearly states that the mAP50 increases by 2.12% when the two modules are combined. The numerical contradiction between "2.21%" and "2.12%" directly undermines the credibility of the academic achievements.
2.The "Introduction" section fails to provide brief explanations for the two key concepts—"non-local feature learning" and "cross-scale global dependence"—on which the subsequent module design relies, merely mentioning the terminology names. This creates obstacles to understanding the subsequent chapters.
3.It is mentioned that "existing multi-scale methods rely on simple addition or subtraction", yet no specific examples are provided to illustrate "which latest methods still adopt this approach". Only general statements are made, resulting in a lack of empirical support for demonstrating the "severity of the problem".
4.The annotations explaining the physical meaning of Equation (2) and Equation (3) are incomplete, which hinders readers' understanding of the computational process.
5.The "Introduction" refers to the proposed approach as an "anti-UAV detection method based on non-local feature learning", while the Method section describes it as an "end-to-end detection system". The relationship between "non-local feature learning" and the "end-to-end system" is not clarified.
6.Key parameters for module design are not provided, such as the "number of convolution kernels in the encoder (EnC)" and the "number of output channels in the decoder (DeC)" of the CNFI module. Additionally, critical software environment details, such as the versions of PyTorch/TensorFlow and specific parameters of the Adam optimizer are missing. The absence of these parameters will lead to poor reproducibility of the experiments.
7.In the "3. Experiments and Analysis" section, Figure 2 is labeled as "Typical scenarios from the DUT-Anti-UAV dataset", but no textual labels or categorical explanations for the scene types are provided.
8.The paper states that "the mAP50 of the proposed method outperforms the second-best model by 2.12%", yet it only calculates the numerical difference without analyzing the "significance of this improvement in practical applications".
9.No experiments are designed for the "gradient + non-local fusion" of the NFE module, this makes it impossible to prove the innovative value of the "fusion design"; only an ablation study is used to verify that "the NFE is effective", failing to highlight its innovative dimension.
10.The conclusion only makes general references to a "significant improvement in accuracy and robustness" and does not cite key quantitative experimental data from previous experiments. Relying solely on qualitative descriptions prevents readers from intuitively perceiving the value of the achievements. Furthermore, it fails to reaffirm the innovative dimensions of the modules, resulting in the insufficient highlighting of innovation.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsAuthors have addressed the queries, the manuscript may be accepted for the publication.
Reviewer 5 Report
Comments and Suggestions for AuthorsThe authors have revised all the comments. This manuscript can be accepted as it is now.