Spatially Adaptive and Distillation-Enhanced Mini-Patch Attacks for Remote Sensing Image Object Detection
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsIn the abstract, the authors should provide a more concise explanation of the uniqueness of the challenge. Since small objects constitute the majority in RSI, the rationale for designing a method targeting large-scale objects should be discussed more scientifically and with a clearer categorization.
In the introduction, the writing could be more concise. When presenting the modules and contributions in bullet points, I believe the first two points describing the modules can be incorporated into the last three contribution points. This would make the paragraph structure more compact and avoid redundancy.
Figure 2 is clear and visually appealing, but it is recommended to divide it into regions labeled as A, B, and C, which would better align with the related paragraphs.
Please check the abbreviation usage throughout the paper. Some technical metrics, such as mAP50, are mentioned without providing their full forms when first introduced.
The understanding of large-scale objects mentioned in the paper should be supported by literature. I recommend the following article for reading and citation, as it contains definitions of large, medium, and small objects in remote sensing that are relevant to this work:
https://doi.org/10.3390/rs17121965
Please check the citation of all figures in the text. For example, I noticed that Figure 3 is not cited and lacks detailed description, which affects readability.
Regarding performance comparisons, the paper evaluates against some YOLO series models, but given the rapid updates of these models, the authors could also consider including YOLOv8 series in the comparisons. Some transformer-based detectors are also interesting and worth including.
Author Response
Comment 1: In the abstract, the authors should provide a more concise explanation of the uniqueness of the challenge. Since small objects constitute the majority in RSI, the rationale for designing a method targeting large-scale objects should be discussed more scientifically and with a clearer categorization.
Response 1: Thank you for this insightful suggestion. We agree that a more concise and scientifically grounded explanation of the challenge's uniqueness is essential. To address this:
- We revised the Abstract to emphasize the impracticality of single large-scale patches on physically large, irregularly shaped high-value targets in RSI, while introducing the multi-compact-patch strategy and its two core challenges (placement and potency).
- In the Introduction, we expanded the discussion with a scientific rationale, citing Dong et al. on RSI object scale distribution (e.g., small targets ~60%), and clarified the categorization of scales, justifying our focus on large targets due to their operational significance and deployment challenges.
These changes improve the logical flow and motivation. Please refer to the revised Abstract (page 1) and Introduction (pages 2-3) in the updated manuscript.
Comment 2: In the introduction, the writing could be more concise. When presenting the modules and contributions in bullet points, I believe the first two points describing the modules can be incorporated into the last three contribution points. This would make the paragraph structure more compact and avoid redundancy.
Response 2: Thank you for this helpful recommendation. We agree that streamlining the introduction by integrating the module descriptions into the contribution points reduces redundancy and improves conciseness. To address this:
- We revised the Introduction by merging the two module-specific bullet points (describing ASAP and DMPG) into the subsequent contribution summary. This creates a more compact structure, where the contributions are now presented in a refined three-point list that explicitly incorporates the modules' roles without repetition.
- Additionally, we strengthened our research motivation in the introduction and provided relevant background support, including citations to scale distribution statistics in RSI (e.g., from Dong et al.), to further enhance the section's scientific grounding and flow.
These changes effectively reduce verbosity while maintaining completeness. Please refer to the updated Introduction (pages 2-3, lines 56-88 in the revised manuscript) for the modifications.
Comment 3: Figure 2 is clear and visually appealing, but it is recommended to divide it into regions labeled as A, B, and C, which would better align with the related paragraphs.
Response 3: Thank you for this positive feedback and practical suggestion. We agree that dividing Figure 2 into labeled regions would improve its alignment with the corresponding paragraphs, enhancing logical clarity and readability. To address this:
- We revised Figure 2 by restructuring it into two main labeled sections (A and B), directly corresponding to the paper's two key innovative modules: Section A for the Adaptive Sensitivity-Aware Positioning (ASAP) module and Section B for the Distillation-based Mini-Patch Generation (DMPG) module. This adjustment strengthens the visual-logical correspondence with the method description in Section 3.
These changes make the figure more intuitive and better integrated with the text. Please refer to the updated Figure 2 (page 5) and its caption in the revised manuscript.
Comment 4: Please check the abbreviation usage throughout the paper. Some technical metrics, such as mAP50, are mentioned without providing their full forms when first introduced.
Response 4: Thank you for pointing this out. We agree that consistent and proper introduction of abbreviations is crucial for readability and clarity. To address this, we conducted a thorough review of all abbreviations in the manuscript and ensured that each is expanded to its full form upon first appearance. Specific changes include:
- For metrics like mAP (mean Average Precision), we confirmed its full form is provided in Section 4.1.2 (Evaluation Metrics) at its initial mention. In our experiments, we actually used mAP@0.45 (mean Average Precision at IoU threshold of 0.45) instead of the common mAP50, as this aligns with the parameter settings in the prominent RSI patch attack paper APPA [21], ensuring consistency and comparability with state-of-the-art methods in remote sensing.
- Other abbreviations, such as ASR (Attack Success Rate) in the Abstract, RSI (Remote Sensing Image) in the Abstract, MAR20 (Military Aircraft Recognition 20 dataset) in Section 4.1.1, and RSOD (Remote Sensing Object Detection dataset) in Section 4.1.1, have been expanded at their first occurrences where needed.
- We also checked technical terms like NMS (Non-Maximum Suppression) in Section 3.3.1, GIoU (Generalized Intersection-over-Union) in Section 3.3.2, confirming full forms are provided initially.
These revisions ensure all abbreviations are properly introduced, improving the manuscript's accessibility. Please refer to the relevant sections (e.g., Abstract on page 1, Sections 3.2-3.3 on pages 6-10, and Section 4.1 on pages 10-11) in the updated manuscript.
Comment 5: The understanding of large-scale objects mentioned in the paper should be supported by literature. I recommend the following article for reading and citation, as it contains definitions of large, medium, and small objects in remote sensing that are relevant to this work: https://doi.org/10.3390/rs17121965
Response 5: Thank you for this excellent recommendation. We fully agree that supporting our discussion of large-scale objects with relevant literature is crucial, as it provides significant data backing for our research motivation. This is indeed a vital aspect of justifying our focus on adversarial attacks tailored to large-scale targets in remote sensing images (RSI). To address this:
- We have incorporated the suggested reference (Dong et al. [7]) into the Introduction section, citing its definitions and statistics on object scales in RSI (e.g., small objects constituting approximately 60% of instances, while large-scale objects like aircraft and ships pose unique challenges due to their physical size and high-value applications). This addition strengthens the scientific foundation of our problem statement and categorizes object scales more clearly.
These changes enhance the manuscript's rigor and motivational support. Please refer to the updated Introduction (pages 2-3, lines 41-52 in the revised manuscript) for the modifications.
Comment 6: Please check the citation of all figures in the text. For example, I noticed that Figure 3 is not cited and lacks detailed description, which affects readability.
Response 6: Thank you for this important observation. We agree that proper citation and detailed description of all figures are essential for maintaining readability and logical flow in the manuscript. To address this, we have completed a thorough check of all figure citations throughout the text. Specific changes include:
- We ensured every figure is cited at the appropriate point in the narrative where it is first relevant. For instance, Figure 3 (feature map visualization) is now explicitly cited in Section 3.2 (Adaptive Sensitivity-Aware Positioning Module), with an added detailed description explaining its components: the visualization of Mi (coarse semantic regions from LayerCAM), Ma (fine-grained adversarial sensitive areas from gradient maps), and their role in highlighting the importance of selected features for attack effectiveness.
- Additionally, we reviewed and enhanced captions for all figures to provide more comprehensive descriptions where needed, ensuring they align closely with the text. This includes cross-references for Figures 1, 2, 4-9, confirming no omissions remain.
These revisions improve the manuscript's coherence and accessibility. Please refer to the updated sections (e.g., Section 3.2 on page 6 for Figure 3, and throughout the manuscript for other figures) in the revised version.
Comment 7: Regarding performance comparisons, the paper evaluates against some YOLO series models, but given the rapid updates of these models, the authors could also consider including YOLOv8 series in the comparisons. Some transformer-based detectors are also interesting and worth including.
Response 7: Thank you for this thoughtful suggestion. We agree that incorporating recent advancements like the YOLOv8 series would strengthen the comprehensiveness of our performance comparisons, given the rapid evolution of detection models. However, after careful consideration, we have decided not to include transformer-based detectors, as our study and cited baseline methods primarily target CNN-based detectors. Adding transformer-based models could dilute the focus of our arguments, which center on vulnerabilities in CNN architectures commonly used in remote sensing object detection.
- We have added comparisons with YOLOv8 (specifically YOLOv8n for efficiency) to Table 2 (Performance Comparison on ASR and mAP) in Section 4.2 and Table 3 (Cross-model Transferability) in Section 4.3. The results demonstrate that SDMPA achieves competitive performance, e.g., an ASR of 79.83% on YOLOv8 in the MAR20 dataset, outperforming baselines like Thys et al. and APPA.
These revisions improve the experimental robustness while maintaining thematic coherence. Please refer to the updated Table 2 (page 12) and Table 3 (page 13) in the revised manuscript.
Reviewer 2 Report
Comments and Suggestions for AuthorsReviewer Comments
This paper proposes SDMPA, a two-module framework for adversarial patch attacks on remote sensing object detectors. The first module, Adaptive Sensitivity-Aware Positioning (ASAP), fuses an explainability-based attention map and an adversarial-gradient map to select multiple patch locations; the second, Distillation-based Mini-Patch Generation (DMPG), transfers “adversarial knowledge” from larger teacher patches to small student patches so that several compact patches can jointly defeat a detector. The abstract and introduction clearly state the motivation—single large patches are impractical for large physical targets in RSI and downsizing a single patch weakens attack strength—while positioning the contribution as a multi-mini-patch attack tailored to object detection in RS imagery, with reported gains on RSOD and MAR20 and cross-model transferability (YOLOv3/5 and Faster R-CNN).
On novelty, the paper’s main claim is that it is “the first to propose a multi-mini-patch attack strategy for the remote sensing object detection task,” contrasting with prior multi-patch work oriented to classification rather than detection. That positioning is plausible and the manuscript does cite Huang et al. as the closest antecedent in classification, then argues that detection remains open due to coupled localization and classification challenges. I encourage tempering the priority claim to “to the best of our knowledge” and adding a short comparative paragraph that systematically differentiates SDMPA from existing patch-based aerial/detection attacks (e.g., APPA, benchmarking studies) in terms of the number of patches, placement policy, and whether any prior work distilled patch content. That will prevent over-claiming and make the contribution boundary crisper.
Methodologically, the ASAP module is a sound idea and is described with helpful equations (LayerCAM-based Mi, gradient-derived Ma, weighted fusion Mf, OPTICS to select k sites). For reproducibility, please report the exact hyper-parameters: the fusion weights (w1, w2) and how they were tuned; the smoothing kernel size S in Ma; and the OPTICS settings (min\_samples, min\_cluster\_size/xi). A sensitivity plot showing ASR as a function of (w1, w2) and k (number of patches) would also illuminate how robust the placement policy is. Similarly, the DMPG losses are well motivated: teacher loss combines objectness suppression (Ld), total variation, and saliency terms; student loss adds a three-part distillation (confidence, class, box) with a final weighted sum (Lstu). Here too, please specify the NMS settings used when computing Ld (IoU threshold and score threshold), the temperature τ in Lcls, the matching rule for K pairs in Lconf/Lbbox, and the weights (λconf, λcls, λbbox, and γ in Lstu). The ablation in Table 5 intriguingly shows that adding the bounding-box term can sometimes hurt performance; a brief analysis (e.g., gradient interference between Lbbox and Lconf/ Lcls) and a pointer to cases where it helps would be useful.
The experimental section is comprehensive and, in general, convincing. The datasets are appropriate and clearly described, with aircraft emphasized on RSOD and 7:3 train/val splits; images are resized to 640×640, and detectors are trained only on the train set, which is good practice. The strength of the paper is the use of both ASR and mAP to quantify attack strength, along with white-box and black-box transfer tests; SDMPA outperforms the two chosen baselines under equal perturbation budgets (e.g., 9% total area) and exhibits good cross-model transferability, which supports the claims around efficacy and generalization. However, there is a threshold inconsistency that needs to be reconciled: the metric definition section evaluates with IoU = 0.5 and confidence = 0.25, whereas the implementation section later states validation with IoU = 0.45 and confidence = 0.40. Please align these thresholds across the paper (and tables/figures) and re-report if necessary, so that ASR/mAP are computed consistently.
The ablations are helpful and support both modules: enabling ASAP yields substantial ASR gains at fixed patch counts, and adding DMPG further improves ASR/mAP while smoothing optimization instability as the number of patches grows. The additional explorations—e.g., diminishing returns with many patches, optimal teacher-to-student size ratios, and the observation that confidence+class distillation is the strongest pair—are insightful and should remain prominent in the final version.
Two aspects deserve further strengthening. First, baselines: APPA and the method of Thys et al. are reasonable choices, but given the claim of advancing deployability, consider adding at least one modern detection-oriented patch baseline (e.g., DPatch or Naturalistic Physical Patch) re-implemented under the same perturbation budget to help readers situate improvements; your related-work section cites these families, which suggests they are in scope even if originally developed on natural images . Second, physical-world validation: the “printed patch attached to images and re-captured by camera” test demonstrates some rotational robustness but is still a lab proxy for aerial observation. Please be explicit about distances, capture angles, print resolution/material, and how scale changes were handled; discuss the feasibility of deploying multiple small patches on large physical targets (e.g., aircraft) and the intended real-world surface(s). Outlining these constraints will contextualize claims of deployability and better guide downstream defensive research.
Reproducibility would benefit from additional detail and, ideally, code release. You provide hardware/software versions and some training settings (learning rate = 0.1, 100 epochs, Adam; Table 1 loss weights; brightness adjustment), which is helpful; please add random seeds, patch initialization scheme, exact patch size schedule (area % per patch and mask geometry), and ASAP/DMPG hyper-parameters in a single table or appendix. Including per-experiment runtime and the number of optimization steps per patch would also aid practitioners in comparing computing budgets and scalability.
Author Response
Comment 1: On novelty, the paper’s main claim is that it is “the first to propose a multi-mini-patch attack strategy for the remote sensing object detection task,” contrasting with prior multi-patch work oriented to classification rather than detection. That positioning is plausible and the manuscript does cite Huang et al. as the closest antecedent in classification, then argues that detection remains open due to coupled localization and classification challenges. I encourage tempering the priority claim to “to the best of our knowledge” and adding a short comparative paragraph that systematically differentiates SDMPA from existing patch-based aerial/detection attacks (e.g., APPA, benchmarking studies) in terms of the number of patches, placement policy, and whether any prior work distilled patch content. That will prevent over-claiming and make the contribution boundary crisper.
Response 1: We greatly appreciate your insightful comment on tempering the novelty claim and adding differentiation, as it indeed helps clarify the boundaries of our contribution. In response, we have made the following revisions:
- Modified the priority statement in Introduction to "to the best of our knowledge."
- Added a short comparative paragraph in Section 2.1 (Related Work), discussing prior patch attack modalities, academic attempts at multi-patch attacks, and explicitly defining our innovation boundaries: the first systematic study of multi-patch joint attacks in RSI object detection scenarios, and the first use of knowledge distillation for miniaturized patch generation.
Comment 2: Methodologically, the ASAP module is a sound idea and is described with helpful equations (LayerCAM-based Mi, gradient-derived Ma, weighted fusion Mf, OPTICS to select k sites). For reproducibility, please report the exact hyper-parameters: the fusion weights (w1, w2) and how they were tuned; the smoothing kernel size S in Ma; and the OPTICS settings (min_samples, min_cluster_size/xi). A sensitivity plot showing ASR as a function of (w1, w2) and k (number of patches) would also illuminate how robust the placement policy is. Similarly, the DMPG losses are well motivated: teacher loss combines objectness suppression (Ld), total variation, and saliency terms; student loss adds a three-part distillation (confidence, class, box) with a final weighted sum (Lstu). Here too, please specify the NMS settings used when computing Ld (IoU threshold and score threshold), the temperature τ in Lcls, the matching rule for K pairs in Lconf/Lbbox, and the weights (λconf, λcls, λbbox, and γ in Lstu). The ablation in Table 5 intriguingly shows that adding the bounding-box term can sometimes hurt performance; a brief analysis (e.g., gradient interference between Lbbox and Lconf/ Lcls) and a pointer to cases where it helps would be useful.
Response 2: We sincerely thank you for your valuable suggestions on detailing hyperparameters, adding sensitivity analysis, and discussing the ablation results, which greatly enhance the reproducibility and depth of our methodology. In response, we have implemented the following revisions:
- Introduced experimental values for relevant parameters (e.g., w1, w2; S; OPTICS settings; NMS IoU/score thresholds; τ; matching rules; λconf, λcls, λbbox, γ) immediately after their proposal in Sections 3.2 and 3.3.
- Added a table in the Appendix summarizing all hyperparameters for better readability.
- Expanded discussion in Section 4.4.2 on why adding the bounding-box term sometimes hurts performance (e.g., potential gradient interference with Lconf and Lcls leading to conflicting optimization), with pointers to cases where it helps (e.g., in scenarios with high localization errors).
These changes enhance reproducibility and methodological depth. Please refer to updated Sections 3.2-3.3 (pages 5-10), Section 4.4 (pages 15-17), and Appendix.
Comment 3: The experimental section is comprehensive and, in general, convincing. The datasets are appropriate and clearly described, with aircraft emphasized on RSOD and 7:3 train/val splits; images are resized to 640×640, and detectors are trained only on the train set, which is good practice. The strength of the paper is the use of both ASR and mAP to quantify attack strength, along with white-box and black-box transfer tests; SDMPA outperforms the two chosen baselines under equal perturbation budgets (e.g., 9% total area) and exhibits good cross-model transferability, which supports the claims around efficacy and generalization. However, there is a threshold inconsistency that needs to be reconciled: the metric definition section evaluates with IoU = 0.5 and confidence = 0.25, whereas the implementation section later states validation with IoU = 0.45 and confidence = 0.40. Please align these thresholds across the paper (and tables/figures) and re-report if necessary, so that ASR/mAP are computed consistently.
Response 3 : Thank you very much for your positive evaluation of the experimental section and for kindly pointing out the inconsistency in the IoU and confidence thresholds. We deeply apologize for this oversight during the previous revisions of the manuscript and appreciate your attention to detail. To address this issue, we have unified the thresholds to the values actually used in our experiments: IoU = 0.45 and confidence = 0.4. This choice aligns with the parameters employed in the excellent paper APPA, ensuring overall comparability with baseline methods.
Comment 4: The ablations are helpful and support both modules: enabling ASAP yields substantial ASR gains at fixed patch counts, and adding DMPG further improves ASR/mAP while smoothing optimization instability as the number of patches grows. The additional explorations—e.g., diminishing returns with many patches, optimal teacher-to-student size ratios, and the observation that confidence+class distillation is the strongest pair—are insightful and should remain prominent in the final version.
Response 4: Thank you sincerely for your encouraging feedback on the ablation studies and for highlighting their value. We are grateful for your recognition of these analyses as insightful. In response to your comments and to further strengthen the discussion, we have expanded the analysis in the subsequent version to include a detailed explanation of why the bounding box loss (Lbbox) may reduce attack effectiveness in certain configurations. This addition builds on your suggestions and helps provide a more comprehensive understanding.
Comment 5: Two aspects deserve further strengthening. First, baselines: APPA and the method of Thys et al. are reasonable choices, but given the claim of advancing deployability, consider adding at least one modern detection-oriented patch baseline (e.g., DPatch or Naturalistic Physical Patch) re-implemented under the same perturbation budget to help readers situate improvements; your related-work section cites these families, which suggests they are in scope even if originally developed on natural images. Second, physical-world validation: the “printed patch attached to images and re-captured by camera” test demonstrates some rotational robustness but is still a lab proxy for aerial observation. Please be explicit about distances, capture angles, print resolution/material, and how scale changes were handled; discuss the feasibility of deploying multiple small patches on large physical targets (e.g., aircraft) and the intended real-world surface(s). Outlining these constraints will contextualize claims of deployability and better guide downstream defensive research.
Response 5: Thank you for these insightful and constructive suggestions, which have significantly strengthened our baseline comparisons and physical-world validation. We greatly appreciate your guidance on advancing the deployability claims. In response, we have made the following detailed modifications:
- We have added an additional baseline experiment using NAP. NAP is the only method among the baselines that generates patches based on GAN, enabling a more comprehensive comparison in our paper. Specifically:
- We have provided a detailed introduction to the parameters of the physical experiment and analyzed the feasibility of applying the attack method to real targets.
These changes strengthen the paper's claims on deployability and provide better context for future research. Please refer to the updated Section 4.2, Section 4.4.3, Table 2, and Figure 9 in the revised manuscript.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe quality of the article is improved after the author's revision, so it is recommended to accept it
Reviewer 2 Report
Comments and Suggestions for AuthorsThe necessary changes are applied. Now the article draft is ready for publication.