MSDF-Mamba: Mutual-Spectrum Perception Deformable Fusion Mamba for Drone-Based Visible–Infrared Cross-Modality Vehicle Detection
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis work proposed a mutual-spectrum perception deformable fusion mamba model for drone-based visible-infrared cross-modality vehicle detection, namely MSDF-Mamba. The subject is meaningful and interesting. The design of the ablation and comparison study is comprehensive. There are minor issues that need to be explained before publication.
- In the introduction, the author needs to provide some results of previous studies to support their viewpoint, such as mAP@0.5.
- It is recommended to pay attention to some improved strategies for small object detection, such as CPD-YOLO.
- The contributions of this paper need further clarification. Is the module an improvement or a previous study? If it is an improvement, please provide a reference.
- Which was the YOLO version chosen in the comparison experiment of the method with the YOLO series?
- The authors should clearly state the limitations of the MSDF-Mamba.
Author Response
Comments 1: In the introduction, the author needs to provide some results of previous studies to support their viewpoint, such as mAP@0.5. Please refer to the attached file for details.
Response 1: Thank you for pointing this out. I agree with this and have incorporated the corresponding revisions in the Introduction section.
Comments 2: It is recommended to pay attention to some improved strategies for small object detection, such as CPD-YOLO.
Response 2: Thank you for pointing this out. I also greatly appreciate your recommendation of the CPD-YOLO method. I will carefully review this literature to explore how its approaches might inspire my work.
Comments 3: The contributions of this paper need further clarification. Is the module an improvement or a previous study? If it is an improvement, please provide a reference.
Response 3:
Thank you for your valuable comments. The clarification regarding the core contributions and citations of this paper is as follows:
The first module is an independent and innovative design based on the foundational theories of deformable convolution (Reference [28]) and attention mechanisms (Reference [19]). The overall architecture of this module represents an original contribution of this paper.
The second module involves targeted design and adaptive improvements based on the visual Mamba method (Reference [26]), and in the specific parameter design of the fusion strategy, we also referenced Reference [29].
Comments 4: Which was the YOLO version chosen in the comparison experiment of the method with the YOLO series?
Response 4: Thank you for pointing this out. In the network architecture design section, we use the YOLOv8 (CSPDarkNet-S) network architecture. This part is explained in Section 3.1 (Overall Framework of MSDF-Mamba), with the specific reference being [27].
Comments 5: The authors should clearly state the limitations of the MSDF-Mamba.
Response 5: Thank you for your valuable suggestions. In the Discussion section, we examine the issue of computational efficiency degradation caused by the introduction of the bidirectional cross-attention mechanism in MSDF-Mamba. Please refer to the attached file for details.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis manuscript proposes a multimodal UAV object detection framework named Mutual Spectral Perception Deformable Fusion Mamba (MSDF-Mamba) based on YOLOv8 and Mamba. The MSDA module employs a bidirectional cross-attention mechanism to achieve deep mutual spectral perception between visible and infrared modalities, generating complementary enhanced features, while the SSF module projects the aligned bi-modal features onto a unified hidden state space. The experimental results are shown reasonable. However, there are still some issues the authors should pay attention to.
- It should be ht in Equation (2).
- The authors have mentioned that the proposed method offers a feasible fusion detection approach for all-weather object detection in the Section of “what are the implications of the main finding?”, and mentioned all-time detection performance in the Abstract. However, although the two datasets used in the manuscript cover various weather conditions and lighting conditions, it is better to analyze the detection results of different weather and lighting conditions specifically.
- There are some grammatical and spelling mistakes in the manuscript and it still needs to be polished.
- Authors need to unify the citation format of the references.
- If there is no Section 2.2, 2.3, etc., it should not use the subsection 2.1 alone.
- It is best to mark the top three results in the tables.
Author Response
Comments 1: It should be ht in Equation (2).
Response 1: Thank you for your correction. The issue you pointed out was indeed an error in our text. We have now made the necessary revision based on your feedback. Please refer to the attached file for the updated version.
Comments 2: The authors have mentioned that the proposed method offers a feasible fusion detection approach for all-weather object detection in the Section of “what are the implications of the main finding?”, and mentioned all-time detection performance in the Abstract. However, although the two datasets used in the manuscript cover various weather conditions and lighting conditions, it is better to analyze the detection results of different weather and lighting conditions specifically.
Response 2:
Thank you for your highly constructive feedback. Your suggestion regarding analyzing detection performance under different weather and lighting conditions is indeed valuable for comprehensively evaluating the model's robustness from an application perspective. We fully agree with this point and will prioritize it in future research.
Regarding the core contribution of this study, we would like to clarify slightly: The main innovation of this paper lies in addressing the cross-modal target misalignment issue between infrared and visible images and proposing a general feature alignment and fusion framework. To validate the generalization capability of the model in real-world complex scenarios for solving cross-modal alignment problems, we utilized publicly available datasets (DroneVehicle and DVTOD) that encompass various lighting conditions.
Since the currently used datasets do not provide clear subset divisions based on weather or lighting conditions, it is challenging to conduct detailed performance analyses for specific environmental conditions. In fact, our excellent results demonstrate its strong robustness across multiple conditions.
Comments 3: There are some grammatical and spelling mistakes in the manuscript and it still needs to be polished.
Response 3: Thank you for pointing this out. We have conducted a comprehensive revision of the manuscript, and the updated content is provided in the attachment.
Comments 4: Authors need to unify the citation format of the references.
Response 4: Thank you for your feedback. We have thoroughly reviewed and standardized the citation format of all references.
Comments 5: If there is no Section 2.2, 2.3, etc., it should not use the subsection 2.1 alone.
Response 5: Thank you for your correction. We have revised Section 2.1 as per your suggestion. Please refer to the attachment for the specific modifications.
Comments 6: It is best to mark the top three results in the tables.
Response 6: Thank you for your suggestion. We will make revisions accordingly. However, we have observed that in many studies within this field, it is common practice to highlight only the top one or two results. We believe this approach maintains rigor while ensuring the clarity and readability of the tables. Therefore, we have chosen to mark only the top two results.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsOverall Recommendation: Major Revision
This manuscript presents MSDF-Mamba, a novel framework for visible-infrared object detection in UAV imagery. The core innovation lies in addressing the critical challenge of spatial misalignment between modalities through a novel Mutual-Spectrum Deformable Alignment (MSDA) module and leveraging the efficient global modeling of a simplified Mamba architecture via the Selective Scan Fusion (SSF) module. The work is technically sound, well-motivated, and represents a significant contribution to the field of multimodal remote sensing. The experimental design is comprehensive, featuring extensive comparisons on two public benchmarks (DroneVehicle and DVTOD), detailed ablation studies, and efficiency analysis.
The manuscript is generally well-written and structured. However, it requires revisions to address issues related to clarity in the methodological description, justification of certain design choices, and the correction of numerous minor errors and inconsistencies that detract from its professionalism.
Abstract
The abstract mentions a "3.1% improvement on benchmark tests" but does not specify the baseline for this comparison, which should be clarified.
The term "Mutual-Spectrum" in the title and throughout the paper is innovative but not explicitly defined. A brief explanation of what "Spectrum" refers to in this context (e.g., the electromagnetic spectrum represented by RGB and IR) would be helpful.
Introduction
The description of the Mamba-related works (DMM, MGMF, etc.) is good, but the claim that they "directly adopt the original visual Mamba module without structural optimization" could be slightly softened or better referenced, as some of the cited works (e.g., DMM, Fusion-Mamba) do introduce task-specific modifications.
Methods
Issues: Clarity in MSDA: The distinction between the "Mutual Spectral Perception (MSP) module" and the full MSDA module is somewhat blurred. Figure 3 and the text would benefit from a clearer delineation: the MSP handles feature enhancement via cross-attention, and the subsequent steps handle offset generation and deformable convolution.
Justification in SSF: The rationale for the specific design of the SSF module, particularly the splitting and re-fusing of modalities (y_vis = y_vis^ε + y_c^ε), needs a more detailed explanation. Why is this structure more effective for fusion than a standard VSS block?
Mathematical Typo: Equation (5) references \overline{\Delta} but should likely be \overline{A} to be consistent with Equation (3). Equation numbering is also inconsistent (jumps from (3) to (5)).
Figure Quality: Figures 2, 3, and 4 are provided but are low-resolution and contain small, hard-to-read text. Higher-quality, more legible figures are essential.
Results and Discussion
Issues: Reference Inconsistency: In Table 2, the text states "The test results are shown in Table 2," but the following table is unnumbered in the text. Based on content, it should be Table 2, and the multimodal comparison should be Table 3. This numbering must be corrected throughout the text (e.g., the reference to "Table 3" in section 4.3.1 should be to "Table 4").
Missing Context for FPS: The dramatic drop in FPS from the Baseline (337.05) to Baseline+MSDA (68.15) is noted but not discussed. A brief comment on the computational cost introduced by the cross-attention and deformable convolution in MSDA would be valuable.
Figure 6 Resolution: The scatter plot in Figure 6 is informative but has low resolution and axis labels that are difficult to read.
Conclusion
Issues: It claims a "40%" reduction in model parameters. This should be clarified relative to which model (presumably RemoteDet-Mamba from Table 3). The claim is plausible (51.34 MB vs. ~85 MB for RemoteDet-Mamba) but needs a specific reference.
References, Language, and Formatting
Minor Language Errors: There are several typographical and grammatical errors that require proofreading. Examples include:
"Misalignment between visible and infrared images. These two images..." (Sentence fragment).
"It makes deploying corresponding models on a resource-constrained UAV a challenge." (Awkward phrasing).
"So, given the center sampling value..." (Informal "So").
"It gets an mAP@0.5 of 81.8%..." (Informal "gets").
Formatting: The document contains version headers ("Version November 21, 2025 submitted to Journal Not Specified") on every page, which should be removed for the final submission. The publisher's note and disclaimer are included, which is good practice.
To further strengthen the manuscript and better situate your work within the very latest developments in the field, I strongly recommend citing and discussing several recent high-impact studies that are highly relevant to your work. Integrating these references will enhance the background and demonstrate the timeliness of your contribution.
(1) Rose-Mamba-YOLO;(2)Mamba-based super-resolution and semi-supervised YOLOv10……(3)ChangeMamba(4)Succulent-YOLO(5)Drone-based RGB-infrared cross-modality vehicle detection……
Comments for author File:
Comments.pdf
Author Response
Abstract
Comments 1: The abstract mentions a "3.1% improvement on benchmark tests" but does not specify the baseline for this comparison, which should be clarified.
Response 1: Thank you for your valuable comments. We have revised the manuscript accordingly as requested. Please refer to the attached file for the specific changes and results.
Comments 2: The term "Mutual-Spectrum" in the title and throughout the paper is innovative but not explicitly defined. A brief explanation of what "Spectrum" refers to in this context (e.g., the electromagnetic spectrum represented by RGB and IR) would be helpful.
Response 2: We sincerely appreciate your comment and fully agree with your point. We have addressed this issue in the revised manuscript, and the specific explanations can be found in the attached document.
Here, “Spectral” refers to the characteristic feature spaces of each modality. "Mutual spectral perception" means that, through a bidirectional attention mechanism, the model can actively explore and fuse complementary information within this multi-scale feature space to achieve feature enhancement.
Introduction
Comments 3: The description of the Mamba-related works (DMM, MGMF, etc.) is good, but the claim that they "directly adopt the original visual Mamba module without structural optimization" could be slightly softened or better referenced, as some of the cited works (e.g., DMM, Fusion-Mamba) do introduce task-specific modifications.
Response 3: Thank you for pointing this point. You are correct that the original statement was overly absolute. We have revised the relevant section accordingly. Please refer to the attached document for the specific changes.
Methods
Comments 4: Issues: Clarity in MSDA: The distinction between the "Mutual Spectral Perception (MSP) module" and the full MSDA module is somewhat blurred. Figure 3 and the text would benefit from a clearer delineation: the MSP handles feature enhancement via cross-attention, and the subsequent steps handle offset generation and deformable convolution.
Response 4: Thank you for your feedback. we have updated Figure 3 with larger text/symbols and improved image resolution.
Comments 5: Justification in SSF: The rationale for the specific design of the SSF module, particularly the splitting and re-fusing of modalities (y_vis = y_vis^ε + y_c^ε), needs a more detailed explanation. Why is this structure more effective for fusion than a standard VSS block?
Response 5: Thank you for your valuable comment. We have provided a detailed explanation of this issue in Section V (Discussion) of the manuscript. Please refer to the attached document for the specific content.
Comments 6: Mathematical Typo: Equation (5) references \overline{\Delta} but should likely be \overline{A} to be consistent with Equation (3). Equation numbering is also inconsistent (jumps from (3) to (5)).
Response 6: Thank you for pointing this out. The issue has been addressed.
Comments 7: Figure Quality: Figures 2, 3, and 4 are provided but are low-resolution and contain small, hard-to-read text. Higher-quality, more legible figures are essential.
Response 7: Thank you for your suggestion. we have updated Figure 2,Figure 3,Figure 4 with larger text/symbols and improved image resolution.
Results and Discussion
Comments 8: Issues: Reference Inconsistency: In Table 2, the text states "The test results are shown in Table 2," but the following table is unnumbered in the text. Based on content, it should be Table 2, and the multimodal comparison should be Table 3. This numbering must be corrected throughout the text (e.g., the reference to "Table 3" in section 4.3.1 should be to "Table 4").
Response 8: Thank you for your note. Our tables are complete, and the numbering is consistent throughout. Any discrepancy in the positioning of content (above or below the tables) was likely due to formatting issues during typesetting. We have made every effort to revise the layout to ensure consistency in presentation.
Comments 9: Missing Context for FPS: The dramatic drop in FPS from the Baseline (337.05) to Baseline+MSDA (68.15) is noted but not discussed. A brief comment on the computational cost introduced by the cross-attention and deformable convolution in MSDA would be valuable.
Response 9: Thank you for your suggestion. We fully agree with your comment and acknowledge that this part indeed requires further discussion. Accordingly, we have conducted a detailed analysis on this issue in Section 5 (Discussion) of the revised manuscript. Please refer to the attached document for the specific content.
Comments 10: Figure 6 Resolution: The scatter plot in Figure 6 is informative but has low resolution and axis labels that are difficult to read.
Response 10: Thank you for your suggestion. we have updated Figure 6 with larger text/symbols and improved image resolution.
Conclusion
Comments 11: Issues: It claims a "40%" reduction in model parameters. This should be clarified relative to which model (presumably RemoteDet-Mamba from Table 3). The claim is plausible (51.34 MB vs. ~85 MB for RemoteDet-Mamba) but needs a specific reference.
Response 11:
Thank you for your suggestion. To clarify, this study compares our method with the DMM model rather than RemoteDet-Mamba, as DMM has been widely adopted as a common baseline in many related studies, ensuring consistency and comparability. The parameter count of DMM is 87.97 MB, while our MSDF-Mamba has 51.34 MB parameters, representing a reduction of approximately 41%. Accordingly, we have revised the conclusion in the manuscript to state that MSDF-Mamba outperforms existing state-of-the-art methods, achieving about a 3.1% improvement in mAP and a reduction in parameters by around 40% compared to the baseline DMM model.
References, Language, and Formatting
Comments 12: Minor Language Errors: There are several typographical and grammatical errors that require proofreading.
Response 12: Thank you for your valuable suggestions. We have thoroughly reviewed and revised the manuscript accordingly. Please refer to the attached document for the specific changes.
Author Response File:
Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThe author made revisions to the comments raised by the reviewers. However, there are still some issues in the manuscript that need attention.
- There are still grammatical errors in the revised manuscript. The English should be improved.
- The author has not uploaded the response letter.
- The titles of each table in the manuscript should be written in a uniform way.
There are still grammatical errors in the revised manuscript. The English should be improved. For example,
- "An infrared-visible fusion network", not "a";
- In Section I, "While achieving an mAP of 74.2% on the UAV benchmark dataset." this sentence is not complete.
Author Response
Comments 1:There are still grammatical errors in the revised manuscript. The English should be improved.
Response 1:Thank you for your feedback. We have made comprehensive revisions. For details, please refer to the attachment.
Comments 2:The author has not uploaded the response letter.
Response 2:Thank you for pointing this out. Our response was entered on the web interface, so the response letter was not uploaded. We apologize for this. This time, we will upload the response letter along with the revised manuscript.
Comments 3:The titles of each table in the manuscript should be written in a uniform way.
Response 3:Thank you for your feedback. We have made comprehensive revisions. For details, please refer to the attachment
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThis manuscript can be accepted.
Author Response
We sincerely thank you for the constructive comments throughout the review process and for your final positive assessment.