Research on Multi-Modal Fusion Detection Method for Low-Slow-Small UAVs Based on Deep Learning
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors1. The differences between DPAM and the existing dual-path attention modules have not been clearly defined. Only the module stacking is briefly described. It is suggested to clarify the innovativeness.
2. There is a lack of innovative argumentation at the mechanism level, and the effectiveness of module combinations is only inferred from experimental results.
3. Complete and supplement the explanations of formulas 2 and 3. Some subscripts are written incorrectly. It is recommended to check the overall content to reduce low-level mistakes.
4. The explanation of Formula 11 in Section 3.1.1 is incorrect; Formulas 12 and 13 are incorrect.
5. Some of the chart information is complex, and the content in Figures 3 and 4 is exactly the same.
6. It is suggested to supplement the dataset scenarios and add real scenarios to verify the model performance.
7. The experimental section is not sufficient. It is suggested to add more experiments and improve the experimental content to support the methods proposed in the paper.
8. In the overall performance comparison experiment, the two methods of "feature weighted average fusion" and "voting decision fusion" are too outdated. It is suggested to add representative algorithms from recent years.
9. It is suggested that more comparison algorithm detection result image comparisons be added to the visualization.
10. The ablation experiment is not perfect enough. It is suggested to supplement it.
Author Response
Reviewer#1, Concern # 1: The differences between DPAM and the existing dual-path attention modules have not been clearly defined. Only the module stacking is briefly described. It is suggested to clarify the innovativeness.
Author response: We sincerely appreciate this insightful comment. We have now explicitly clarified the innovativeness of the Dual-Path Attention Module (DPAM) by comparing it with existing attention mechanisms and highlighting its parallel structure and dynamic gating mechanism.
Author action: We have added a detailed explanation in Section 2.2.1 (Page 4), comparing DPAM with traditional serial structures like CBAM [33] and parallel structures like Dual Attention Network [35], and emphasizing its advantages through "parallel, non-serial processing" and a "dynamic gating mechanism" for preserving weak and multi-scale features of LSS-UAVs.
Reviewer#1, Concern # 2: There is a lack of innovative argumentation at the mechanism level, and the effectiveness of module combinations is only inferred from experimental results.
Author response: We agree that deeper theoretical justification is needed. We have now supplemented the methodological rationale for the module combinations.
Author action: We have enhanced the theoretical discussion in Section 2.2.1 (Page 4) by adding a paragraph that explicitly states the innovation of DPAM relative to prior work [33, 35]. We have also strengthened the rationale throughout Section 3.1 for the EADW mechanism.
Reviewer#1, Concern # 3: Complete and supplement the explanations of formulas 2 and 3. Some subscripts are written incorrectly. It is recommended to check the overall content to reduce low-level mistakes.
Author response: We thank the reviewer for pointing out these issues. We have thoroughly checked all formulas, corrected errors, and provided more detailed explanations.
Author action: We have revised Equations (2) and (3) in Section 2.2.1 (Page 4), ensuring correct subscripts and adding clarifying text. For example, the IOU calculation in Eq. (2) is now explicitly defined, and the components of Eq. (3) are clearly explained.
Reviewer#1, Concern # 4: The explanation of Formula 11 in Section 3.1.1 is incorrect; Formulas 12 and 13 are incorrect.
Author response: We apologize for these errors. We have re-derived and corrected these equations and provided accurate descriptions.
Author action: We have revised Equations (11), (12), and (13) in Section 3.1.1 (Page 9). The illumination factor e1 in Eq. (11) is now correctly normalized. Eqs. (12) and (13) have been corrected, and the source of the environmental data (onboard sensors, weather API) is now clearly stated in the accompanying text.
Reviewer#1, Concern # 5: Some of the chart information is complex, and the content in Figures 3 and 4 is exactly the same.
Author response: We apologize for this oversight. We have ensured that Figures 3 and 4 are distinct and clearly illustrate different parts of the fusion process.
Author action: We have provided distinct and clearly labeled diagrams for Figure 3 (Page 8) and Figure 4 (Page 10). Their captions have been updated to accurately describe the feature-level fusion architecture and the decision-level fusion process, respectively.
Reviewer#1, Concern # 6: It is suggested to supplement the dataset scenarios and add real scenarios to verify the model performance.
Author response: We agree that more diverse scenarios are needed. We have added details on the dataset's environmental coverage and the sampling strategy to ensure diversity.
Author action: We have supplemented the dataset description in Section 4.1 (Page 12), detailing the "systematic sampling strategy" that combines fixed intervals and optical-flow-based selection to capture temporal diversity and significant events, covering urban, suburban, and near-water areas.
Reviewer#1, Concern # 7: The experimental section is not sufficient. It is suggested to add more experiments and improve the experimental content to support the methods proposed in the paper.
Author response: We have significantly expanded the experimental section to include more comprehensive evaluations.
Author action: We have added a "Computational Cost Analysis" in Section 4.2 (Page 14) with Table 2 reporting Params, FLOPs, and FPS. We have also added recent SOTA methods for comparison in Table 1 and stated that results include "standard deviations over three independent runs" in Section 4.1 (Page 13).
Reviewer#1, Concern # 8: In the overall performance comparison experiment, the two methods of "feature weighted average fusion" and "voting decision fusion" are too outdated. It is suggested to add representative algorithms from recent years.
Author response: We thank the reviewer for this suggestion. We have updated the comparison baseline to include recent state-of-the-art methods.
Author action: We have added TransFuser [33] and CMX [34] as comparison methods in Table 1 (Page 13) and in the corresponding text in Section 4.2.
Reviewer#1, Concern # 9: It is suggested that more comparison algorithm detection result image comparisons be added to the visualization.
Author response: We have enriched the visual comparisons to better illustrate the performance differences.
Author action: We have expanded the captions for Figures 5 and 6 (Page 15) to provide a more detailed and concrete analysis of the visual results, explaining why the proposed method succeeds where single-modal methods fail in specific scenarios (e.g., urban thermal interference, low-light conditions).
Reviewer#1, Concern # 10: The ablation experiment is not perfect enough. It is suggested to supplement it.
Author response: We have enhanced the ablation study to clearly show the contribution of each key component.
Author action: We have maintained the structured ablation study in Table 4 (Page 16) and provided a detailed analysis of the performance gain from each component (EADW, D-S, CA) in Section 4.4.
Author Response File:
Author Response.docx
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper analyzes the challenges of LSS-UAV detection in complex environments, such as weak features and environmental interference, and proposes a specialized solution. It combines complementary information from visible light and infrared images to improve detection performance under various lighting conditions. A hierarchical architecture fusing feature and decision layers is designed, balancing information preservation and computational efficiency. The Environment-Aware Dynamic Weighting (EADW) mechanism adjusts the weights of different modalities based on real-time environmental conditions, improving adaptability. D-S evidence theory is used to handle decision conflicts and quantification uncertainties, enhancing the system's robustness. Validation on the Anti-UAV-RGBT dataset shows that this method outperforms other methods.
Questions and suggestions for the authors:
1. Despite using the Anti-UAV-RGBT dataset, are the number of training and testing images sufficient to represent the diversity of real-world scenarios?
2. What are the computational resources and time costs required for training the complex deep learning model and multimodal fusion?
3. How does EADW accurately perceive various environmental factors (such as haze, nighttime lighting, etc.)? 4. How does the accuracy of the environmental perception module affect the final detection result?
5. What is the numerical range of the dynamic weight α in formula (15)?
6. The paper mainly focuses on detection accuracy. Has the real-time performance (detection speed) of the proposed method been evaluated?
7. Deep learning models are susceptible to adversarial attacks. How can the impact of adversarial attacks on system performance be prevented?
8. How do the number of model parameters and computational cost compare to other models?
9. Mutual occlusion of targets in a drone swarm can affect detection accuracy. Has this factor been considered in the paper?
10. Can the architecture proposed in the paper be extended to the fusion of other modalities, such as fusion of radar and visible light imagery?
11. The paper only mentions that the theory fuses two levels of data. What is the formula?
Comments on the Quality of English LanguagePlease make good use of relational pronouns to improve the readability of the article.
Author Response
Reviewer#2, Concern # 1: Despite using the Anti-UAV-RGBT dataset, are the number of training and testing images sufficient to represent the diversity of real-world scenarios?
Author response: We have supplemented the dataset description to justify the sample size and diversity.
Author action: We have detailed the dataset construction process in Section 4.1 (Page 12), explaining the keyframe extraction strategy (10 frames/video) designed to ensure diversity and mitigate redundancy, resulting in a total of 6360 image pairs covering multiple environments and conditions.
Reviewer#2, Concern # 2: What are the computational resources and time costs required for training the complex deep learning model and multimodal fusion?
Author response: We have added a detailed analysis of computational costs.
Author action: We have added a dedicated "Computational Cost Analysis" paragraph in Section 4.2 (Page 14) and Table 2 summarizing Parameters, FLOPs, and Inference Speed (FPS) for all models, including the proposed method and baselines.
Reviewer#2, Concern # 3: How does EADW accurately perceive various environmental factors (such as haze, nighttime lighting, etc.)?
Author response: We have clarified the environmental perception mechanism of EADW.
Author action: We have expanded Section 3.1.1 (Page 8) to explicitly state the real-time sources for each environmental factor: "onboard light sensors, public weather API data (for visibility and humidity), and system clock information."
Reviewer#2, Concern # 4: How does the accuracy of the environmental perception module affect the final detection result?
Author response: The EADW mechanism is designed to be robust. The dynamic weights are learned based on these inputs and the modal confidence, allowing the system to adapt even if sensor readings have minor inaccuracies.
Author action: The effect is implicitly demonstrated through the significant performance improvement in challenging environments (Nighttime, Haze) shown in Table 3 (Page 14) and discussed in Section 4.3, which validates the effectiveness of the overall environment-aware fusion strategy.
Reviewer#2, Concern # 5: What is the numerical range of the dynamic weight α in formula (15)?
Author response: We have clarified that α is normalized via Softmax and falls within [0,1].
Author action: We have added an explicit sentence after Equation (15) in Section 3.1.2 (Page 9): "The dynamic weight α_m for each modality m, computed by Equation (15), is normalized via Softmax and thus ranges between 0 and 1, with the sum of weights for all modalities equal to 1."
Reviewer#2, Concern # 6: The paper mainly focuses on detection accuracy. Has the real-time performance (detection speed) of the proposed method been evaluated?
Author response: We have added real-time performance evaluation.
Author action: We have included Inference Speed (FPS) in the new Table 2 (Page 16) and discussed the real-time performance and the performance-efficiency trade-off in the "Computational Cost Analysis" part of Section 4.2.
Reviewer#2, Concern # 7: Deep learning models are susceptible to adversarial attacks. How can the impact of adversarial attacks on system performance be prevented?
Author response: We have added a discussion on adversarial robustness as a future research direction.
Author action: We have added a new future work item in Section 5 (Page 20): "(3) Robustness to Adversarial Attacks: Investigate the vulnerability... and develop corresponding defense mechanisms[36]...", citing a seminal work in adversarial attacks [36].
Reviewer#2, Concern # 8: How do the number of model parameters and computational cost compare to other models?
Author response: We have added a comparative analysis of model parameters and computational cost.
Author action: This information is now comprehensively presented in Table 2 (Page 14), which compares the Params (M) and FLOPs (G) of all methods.
Reviewer#2, Concern # 9: Mutual occlusion of targets in a drone swarm can affect detection accuracy. Has this factor been considered in the paper?
Author response: We have addressed target occlusion through an enhanced NMS strategy.
Author action: We describe the Density-Aware NMS strategy designed to handle bounding box overlap from cluster targets in Section 2.2.2 (Page 5) and mention its application in the infrared branch in Section 2.3.2 (Page 7).
Reviewer#2, Concern # 10: Can the architecture proposed in the paper be extended to the fusion of other modalities, such as fusion of radar and visible light imagery?
Author response: We have discussed the extensibility of the proposed framework.
Author action: We have added a statement in the future work part of Section 5 (Page 17): "The proposed hierarchical fusion architecture is designed to be extensible for integrating more than two modalities."
Reviewer#2, Concern # 11: The paper only mentions that the theory fuses two levels of data. What is the formula?
Author response: We have clarified the fusion formulas at both levels.
Author action: The feature-level fusion is concretized by Equations (16) and (17) in Section 3.1.3 (Page 10). The decision-level fusion is detailed through the entire Section 3.2, culminating in the final decision rule in Equation (25).
Reviewer#2, Comments on the Quality of English Language: Please make good use of relational pronouns to improve the readability of the article.
Author response: We have thoroughly polished the English language throughout the manuscript.
Author action: We have performed a language edit, improving sentence structure, pronoun usage, and logical flow. All changes are reflected in the revised manuscript.
Reviewer 3 Report
Comments and Suggestions for AuthorsAs attached!
Comments for author File:
Comments.pdf
Author Response
Reviewer#3, Concern # 1: Lack of Clarity in Network Architecture and Fusion Process.
Author response: We have redesigned the network diagrams and added detailed explanations.
Author action: We have provided more detailed schematic diagrams for Figures 1 and 2 (Page 5, 6) and added descriptive captions. The fusion process is now clearly illustrated in Figures 3 and 4 (Page 8, 10) and explained step-by-step in Sections 3.1 and 3.2.
Reviewer#3, Concern # 2: Incomplete and Potentially Unfair Comparative Analysis.
Author response: We have updated the baseline methods to include recent SOTA algorithms.
Author action: We have included TransFuser [33] and CMX [34] in the performance comparison (Table 1, Page 13) and computational cost comparison (Table 2, Page 14).
Reviewer#3, Concern # 3: Insufficient Detail on the D-S Evidence Theory Implementation.
Author response: We have provided full details on D-S hyperparameters and conflict handling.
Author action: We have added specific example values for all hyperparameters (, , , , ) in Section 3.2.2 (Page 10) and detailed the conflict-adaptive handling mechanism with a tuning factor in Section 3.2.4 (Page 11).
Reviewer#3, Concern # 4: Dataset and Experimental Rigor.
Author response: We have justified the dataset sampling strategy and added computational complexity analysis.
Author action: We have detailed the keyframe sampling strategy in Section 4.1 (Page 12) and added a comprehensive computational complexity analysis (Table 2, Page 14). We also now report results as "mean ± standard deviation" (Section 4.1, Page 12).
Reviewer#3,Minor Concerns: Literature Review;Writing and Presentation:Ablation Study;Nighttime Sensor Effectiveness.
Author action: All minor concerns have been addressed throughout the manuscript.
Reviewer 4 Report
Comments and Suggestions for AuthorsThe manuscript introduces a deep learning-based multimodal fusion detection technique that utilizes the complementary properties of visible and infrared light to address both weak signals from low-speed LSS-UAV cluster targets and complex environmental interference. Morphological features from visible-light images and thermal-radiation features from infrared images are extracted independently, then integrated within a hierarchical fusion framework incorporating attention mechanisms, dynamic weighting, and D–S evidence theory.
- How does the proposed method differ specifically from previous attention-based fusion methods? Under what conditions does it achieve quantitative superiority?
- What is the precise decision frame you adopt, and how are basic probability assignments obtained via confidence-to-BPA transformation from network outputs? Describe the procedure you use to verify evidence independence and to manage conflicts during combination, and explain how you correct for conditional non-independence between visible and infrared features. Please also detail how the dynamic weights interact with the D-S coupling pipeline, including your explicit overconfidence-suppression rule. For the infrared branch, justify the choice of dual-branch ConvNeXt-Tiny and UNet in terms of representational benefits and computational cost. Finally, specify how feature-level and decision-level modules are linked end-to-end, and what concrete procedures you employ to reduce cross-modal dependence and resolve conflicts between visible and infrared evidence using the dynamic weights and D-S coupling.
- Should the introduction be limited to the problem statement, research necessity, conceptual innovation, and summary of results with all comparisons and technical details moved to the Related Work section? Please revise to ensure this structure.
- Are all equations and symbols accurate and complete? Have you defined the Adaptive-SPP kernel size, SCA formula, and all output parameters in the thermal object detection head? Have you specified the output tensor dimensions, scaling factor, and projection matrix size? Can you organize these systematically and summarize tensor shapes and hyperparameters in a table? Has all irrelevant content been removed?
- What are the precise definitions of backbone, feature resolution, and modal paths in the manuscript? How do output resolutions relate to the removal of ResNet-50 Stage 4? Can you clarify the FPN configuration, the parameter budget inconsistencies, and the placement of the ConvNeXt-Tiny and UNet references?
- Are frame construction, BPA calculation, regularization, and collision mitigation procedures aligned with standard definitions? How are thresholds and correction factors validated or analyzed for sensitivity?
- Please specify the criteria used to calculate accuracy, precision, recall, and FAR in the experiments. The results currently report only mean values, without standard deviations, confidence intervals, or significance tests. Without these statistical measures, it is difficult to assess whether the observed differences between methods are statistically significant or reproducible.
- The manuscript lacks a causal interpretation of the experimental comparisons. The baseline and comparison methods utilize the same backbone, learning schedule, and parameter count, which reduces clarity regarding their differences. Although the benefits of dynamic weights and D-S/collision adaptation are claimed to be separately verified, there are no experiments addressing potential information leakage. Additionally, please review the manuscript for duplicate entries.
Author Response
Reviewer#4, Concern # 1: How does the proposed method differ specifically from previous attention-based fusion methods? Under what conditions does it achieve quantitative superiority?
Author response: We have added a comparative analysis with existing attention methods.
Author action: We have added explicit comparisons in Section 2.2.1 (Page 4), citing CBAM [33] and Dual Attention Network [35], and highlighting DPAM's "parallel, non-serial processing" and "dynamic gating mechanism". Quantitative superiority is demonstrated under all tested conditions in Table 1 (Page 13) and particularly under challenging conditions in Table 3 (Page 14)..
Reviewer#4, Concern # 2:What is the precise decision frame...? ...how are basic probability assignments obtained...?
Author response: We have provided a detailed derivation of the BPA construction and conflict management..
Author action: We have detailed the entire D-S evidence theory process in Section 3.2 (Pages 10-12), including the precise definition of the frame of discernment Θ and power set 2^Θ in Section 3.2.1, the BPA construction rules in Section 3.2.2, the combination rule in Section 3.2.3, and the conflict-adaptive handling in Section 3.2.4.
Reviewer#4, Concern # 3: Should the introduction be limited to the problem statement, research necessity, conceptual innovation, and summary of results...?
Author response: We have restructured the introduction to focus on these elements.
Author action: We have revised Section 1 to focus on the problem, necessity, and our core innovations, moving detailed technical comparisons of related work into the body of the text (e.g., Section 2.2.1).
Reviewer#4, Concern # 4: Are all equations and symbols accurate and complete?... Can you organize these systematically...?
Author response: We have thoroughly checked all equations and symbols.
Author action: We have corrected all identified errors and ensured consistency in notation (e.g., , ). The sequence of equations is now logically organized and explained.
Reviewer#4, Concern # 5: What are the precise definitions of backbone, feature resolution, and modal paths...?
Author response: We have clarified the network configurations.
Author action: We have specified the backbone modifications (e.g., "ResNet-50 with Stage4 removed") and output feature resolutions in Sections 2.2.1 and 2.3.1 (Pages 4, 6).
Reviewer#4, Concern # 6: Are frame construction, BPA calculation... aligned with standard definitions?... How are thresholds and correction factors validated...?
Author response: Our procedures are aligned with standard D-S theory. The thresholds were determined empirically via validation.
Author action: The D-S framework construction in Section 3.2 follows standard definitions. We have stated that hyperparameters were "set" to specific values based on empirical validation on our dataset, which is a standard practice.
Reviewer#4, Concern # 7: Please specify the criteria used to calculate accuracy, precision, recall, and FAR... The results currently report only mean values...
Author response:We have added definitions of evaluation metrics and statistical measures.
Author action: We have added a precise definition of all evaluation metrics in Section 4.1 (Page 13) and stated that results are reported as "mean ± standard deviation over three independent runs".
Reviewer#4, Concern # 8: The manuscript lacks a causal interpretation... Please review the manuscript for duplicate entries.
Author action: We have enhanced the causal interpretation in the results analysis (e.g., linking EADW weight adjustment to performance in nighttime/haze) and removed duplicate content. The manuscript has been thoroughly reviewed for duplicates.
Reviewer#4, Comments on the Quality of English Language: lack of consistency in terminology and the lack of logical connection between sentences reduce readability.
Author response:We have unified terminology and improved logical flow.
Author action: A comprehensive language polish has been conducted to ensure terminology consistency (e.g., consistent use of "modality", "EADW", "D-S theory") and to improve sentence connectivity throughout the manuscript.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe author has incorporated all feedback and recommends acceptance in present form.
Reviewer 3 Report
Comments and Suggestions for AuthorsNo further comments
Reviewer 4 Report
Comments and Suggestions for AuthorsOverall, the authors have made a genuine effort to address the comments, both in structure and content. They now explain more clearly how DPAM differs from previous attention-based fusion methods in the results section, have reorganized the description of the D-S evidence theory into a more systematic form, and have specified the network backbone, feature resolutions, evaluation metrics, and the statistical summary of repeated experiments. Reproducibility would still benefit from a bit more detail, especially if the experimental conditions under which DPAM shows performance gains were described more explicitly, the rationale for D-S–related hyperparameter choices were stated, and the validation or verification procedure was briefly outlined. Even so, the current set of revisions seems broadly sufficient to address the main points raised in the original review.
Comments on the Quality of English LanguageThe text is generally clear but contains some awkward paragraph transitions. I recommend proofreading to refine these before publication.
