Review Reports - MCDet: Target-Aware Fusion for RGB-T Fire Detection

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Weaknesses

Synthetic IR data simulation method is not clearly justified; risks affecting realism.
Some minor grammar issues and typographical errors throughout.
No mention of computational complexity or model latency compared to baselines.
Lack of qualitative interpretability (e.g., no attention maps or failure cases).
Limited discussion of real-world deployment constraints.

Suggestions for Improvement

Clarify infrared simulation process used to extend RGB-only datasets. Detail how synthetic IR was generated and validated.
Add visualizations of attention maps or fusion outputs to demonstrate MRCF and CGAN effectiveness qualitatively.
Include a computational performance table (FLOPs, runtime per image) to show feasibility for edge deployment or real-time systems.
Perform basic copyediting to fix spelling/grammar (e.g., “traget-aware,” “proformance,” “therby”).
Add a short subsection discussing limitations and future work regarding transferability to other environments (e.g., wildfires, urban fires).
Report standard deviations or confidence intervals for key metrics to improve statistical rigor.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

1. While the manuscript introduces MCDet with new components such as MRCF and CGAN, the novelty compared to existing multimodal fire detection frameworks (e.g., ICAFusion, METAFusion) could be made clearer. The authors should better articulate what makes MRCF and CGAN fundamentally different from or superior to previous attention-based or state-space-based fusion mechanisms. Adding a table or paragraph directly contrasting the proposed modules with related works would be helpful.

2. The manuscript presents several technical modules (MRCF, CGAN, BVSSM, TSFF), but their descriptions are often lengthy and occasionally repetitive. For example, the explanation of CGAN appears duplicated in Section 3.3. Streamlining these explanations with clear subheadings, consistent terminology, and diagrams (especially for TSFF and FRM) would greatly improve readability and comprehension.

3. The use of simulated infrared images in the D-Fire and Fire-dataset experiments raises potential concerns about the realism and generalizability of the results. The paper should provide more detail on how the synthetic thermal images were generated and whether the model was evaluated with real-world infrared data. Clarifying whether the model generalizes to truly captured multimodal datasets (beyond LLVIP) would strengthen the validation.

4. The paper presents a thorough set of experiments, but it could benefit from deeper analysis in some areas. For instance, although ablation studies are presented, they mostly focus on accuracy metrics. The authors are encouraged to include computational efficiency (FLOPs, latency) comparisons for each configuration (e.g., with/without CGAN), since the model targets real-time detection in UAV scenarios. Also, reporting standard deviations or confidence intervals for key metrics would support the claimed robustness.

5. The manuscript contains multiple grammatical errors and typographical mistakes (e.g., "proformance" instead of "performance" in the abstract; "traget-aware" instead of "target-aware" in Section 1). A thorough language proofreading is recommended. Furthermore, section titles such as "Related Word" should be corrected to "Related Work".