Review Reports - Infrared–Visible Image Fusion via Cross-Modal Guided Dual-Branch Networks

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper presents a Cross-Modal Guided Dual-Branch Network (CGDBN) for infrared and visible image fusion. The proposed method integrates CNN-based local feature extraction with Transformer-based global modeling in an asymmetric dual-branch architecture, aiming to address modal discrepancies and improve cross-modal interaction. The idea is meaningful and the results are promising. However, there are several areas where the paper could be further improved.

1.The introduction should provide a clearer statement of the main challenges in infrared-visible image fusion and explicitly emphasize the paper’s unique contributions compared with existing CNN-Transformer hybrid methods.

2.The related work section could be strengthened by including more recent studies on Transformer-based and cross-modal fusion models published in 2023–2025, such as Restormer, GeminiFusion, or other efficient ViT variants.

3.The Method section would benefit from a more detailed explanation of the design motivation behind each module, especially the rationale for asymmetric processing and how the TMFEM, CMIM, and DAMF modules interact to enhance cross-modal information flow.

4.The experimental section should include comparisons with more recent state-of-the-art Transformer-based fusion models. Some of the current baselines (e.g., DenseFuse, FusionGAN) are relatively outdated and do not fully reflect the latest advances in deep multimodal fusion.

5.The dataset scale is relatively limited (the AVMS dataset contains only 600 image pairs). The authors are encouraged to clarify the data diversity, discuss potential overfitting risks, or include additional experiments on larger or more challenging datasets to demonstrate robustness.

6.The conclusion section could be expanded with a deeper discussion of potential future directions, such as lightweight model deployment, temporal fusion for video, or extensions to downstream tasks (e.g., detection or segmentation) to guide ongoing research.

Author Response

I'm very sorry, I just uploaded the wrong file. I have corrected this error. Please refer to the attached Responses to Reviewer 1.pdf for details

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

COMMENTS TO AUTHORS

(Review applsci-3932350)

Title: Infrared-Visible Image Fusion via Cross-Modal Guided Dual-Branch Networks

General comments: The authors investigated the fusion of infrared and visible images with a cross-guided dual branch network (CGDBN) that combines CNN for local details with a simplified linear attention block (SLAB) of Transformer style for global context; the process uses an asymmetric design with a specialized infrared branch (TMFEM), cross-modal interchange (CMIM), and a density-adaptive merge head (DAMF). Experiments on AVMS (600 RGB-IR pairs), M3FD, and TNO report maximum or competitive PSNR/SSIM compared to DenseFuse, FusionGAN, YDTR, SwinFuse, ITFuse, and CDDFuse, with ablations indicating the contribution of each module. Training converts inputs to grayscale, uses 120×120 overlapping patches, Adam (lr=1e-3), batch 16 for 20 epochs, and claims improvements such as +6.5% PSNR on AVMS compared to YDTR. The paper also acknowledges limitations (computation for ultra-high resolution, loss design, need for many registered pairs).

The authors consider the following revisions:

Comment 1: What is the precise, verifiable novelty of SLAB within your CNN–Transformer hybrid compared with prior hybrids (e.g., shifted-window attention and efficient/linear attention), and why should SLAB be expected to outperform those mechanisms theoretically and empirically?

Comment 2: Your abstract reports relative PSNR gains (+6.5%, +4.7%, +4.7%); against which specific method(s) and datasets are these computed, and what are the corresponding absolute PSNR/SSIM/SF values used to derive those percentages?

Comment 3: Why is the specialized asymmetric processing applied only to the IR branch rather than also introducing a VI-specific specialization (e.g., chroma/texture attention), and what empirical evidence supports this design choice?

Comment 4: How does SLAB compare to 2024–2025 efficient/linear attention and lightweight ViT-based fusion approaches in terms of theory (complexity/memory) and practice (accuracy/latency/parameters), and what differentiates it decisively?

Comment 5: Was sub-pixel registration performed or verified, and how does the grayscale-only preprocessing impact methods that leverage color—does this choice affect the fairness and comparability of baselines?

Comment 6: Are the loss weights truly learnable end-to-end or fixed after tuning, what values do they converge to during training, and why were gradient/structure or perceptual losses (e.g., TV, VIF, MS-SSIM) excluded?

Comment 7: How are the train/validation/test splits defined on AVMS, were validation signals used for early stopping and model selection, and how do results vary across different seeds or splits?

Comment 8: The text claims “first place in four indicators,” yet Table 1 appears to show lower SSIM and higher MSE for your method than a comparator; is this a reporting or evaluation inconsistency, and how do you reconcile the discrepancy?

Comment 9: In which concrete scenarios does your method fail or degrade (e.g., mis-registration, heavy noise, small hot targets, low contrast), and what diagnostic evidence (e.g., attention/weight maps) explains these failure modes?

Comment 10: Will you standardize the manuscript’s English and nomenclature (e.g., consistent “AVMS” naming, figure/table labels, hardware names), and what specific edits are planned to correct typographical and stylistic issues?

Comments on the Quality of English Language

Will you standardize the manuscript’s English and nomenclature (e.g., consistent “AVMS” naming, figure/table labels, hardware names), and what specific edits are planned to correct typographical and stylistic issues?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Comment1 :

The abstract makes strong claims but lacks precision:
- "large modal differences, insufficient cross-modal alignment, and limited global context modeling" These are generic challenges. What specifically distinguishes your work?
-"Simplified Linear Attention Blocks (SLABs) to improve global context capture" SLAB is cited from reference [24] (Guo et al., 2024). The abstract should clarify this is an adapted existing component, not a novel contribution.
-"PSNR improvements of 6.5%, 4.7%, and 4.7%" These percentages are misleading. Looking at Table 2, your PSNR is 16.2497 vs YDTR's 15.2526, which is actually a 6.5% relative improvement but only 0.997 absolute improvement. For image quality metrics, absolute differences are more meaningful.

Comment 2:
-Conduct a more thorough literature review showing exactly what gaps exist
-Clearly distinguish "novel contributions" from "engineering improvements"

Comment 3 - Figure 1 caption states there are panels (a)-(e), but the organization is unclear. The overall flow needs better visual presentation. Why not process visible images through a specialized Visible Modal Feature Extraction Module (VMFEM) with texture-specific operations? Your argument (lines 238-243) that "visible imagery contains standardized textural patterns...suitable for conventional operations" is weak. Visible images have their own challenges (lighting variations, shadows, highlights) that could benefit from specialized processing.

Commetn 4:

TMFEM Module (Lines 237-268)
This is your main novel contribution but has several issues:
"Dilated convolutions expand receptive fields without spatial resolution reduction"
What dilation rates are used? This is a critical implementation detail missing from the paper. Figure 3b shows the structure but doesn't specify dilation parameters.
Equations 5-7: The cascaded structure is described but:

-How many convolutional layers in the cascade?
-What are the kernel sizes?
-What are the channel dimensions at each stage?

Alos, "This specialized processing enables enhanced thermal target definition while preserving boundary characteristics"
-No ablation study specifically validates TMFEM's effectiveness for thermal features vs. a symmetric processing baseline. Table 1 removes TMFEM entirely but doesn't compare against symmetric processing of both modalities.

Compare three architectures:

-Symmetric processing (both IR and Vis through TMFEM-like modules)
-Your asymmetric approach (only IR through TMFEM)
-No specialized processing (neither through TMFEM)

This would validate your core design choice.

Comment 5:

The description of "three distinct types" (channel, spatial, texture) is confusing. Looking at Equations 8-10:
-Channel attention uses AvgPool - standard
-Texture attention uses Conv on F_guide - what makes this specifically "texture" attention?
-Spatial attention uses Conv on F_in - how does this differ from standard spatial attention?

Equation 11 (line 296): The 0.5 weighting factor appears arbitrary.
Was this hyperparameter tuned? What is its impact? This should be in ablation studies.

Comment 6:
SLAB is from reference [24] (Guo et al., 2024). Your contribution is applying it to image fusion.
Is this your modification or part of the original SLAB? If it's from [24], cite it explicitly. If it's your modification, this is a contribution worth highlighting.

Comment 7:
Line 365-367: "Visible light density receives scaling factor of 1.2, reflecting the observation that visible imagery often provides essential structural information"
This appears to be a manually set hyperparameter based on intuition rather than principled design. How sensitive is performance to this value?
Why not learn this weight adaptively?

Comment 8:
You mention object categories and annotations, but fusion is typically evaluated without ground truth. How are these annotations used? Only for visual assessment?

Comment 9:
Line 460-461: "Training employed overlapping patches of 120×120 with 50% overlap"
Line 462-463: "batch size 16, 20 epochs, Adam optimizer with learning rate 0.001"

-Learning rate schedule? Fixed or decaying?
-Early stopping criteria?
-What are w₁, w₂, w₃, w₄ values?
-Training time per epoch?
-Total training time?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

This manuscript proposes a Cross-Modal Guided Dual-Branch Network (CGDBN) for infrared-visible image fusion, primarily applied to UAV low-altitude aerial photography data fusion scenarios. The authors designed an asymmetric dual-branch architecture incorporating multiple modules. I believe they demonstrate innovation in their asymmetric processing strategy. They recognize the fundamental differences between infrared thermal radiation and visible texture, applying specialized processing only to infrared images, thereby improving computational efficiency. However, the current manuscript still contains aspects regarding hyperparameter sensitivity, expression methods, and writing clarity that may confuse readers. Below are my revision suggestions, which I hope will be helpful to the authors.

In the Abstract, you state your method is a "cross-modal booted dual-branch network (CGDBN)". Please verify whether it should be "booted" or "boosted". These two words have completely different meanings.
Regarding the overall writing style, when reporting percentage improvements, you should simultaneously provide absolute values with baseline names, e.g., "+0.74 dB (+4.7%) over YDTR". Avoid subjective terms such as "excellent/promising/competitive" and replace them with objective metrics. In the abstract, use full name + abbreviation for module names on first mention, then only use abbreviations thereafter.
You mention "Visible images offer high-resolution textures, color information, and well-defined boundaries under adequate lighting, but their effectiveness diminishes in low-light conditions, poor weather, or when thermal contrast provides the dominant discriminative signal [2]." This motivation is inconsistent with your method, as your pipeline converts to grayscale in the final stage, yet here you cite color advantages as motivation. You need to explain why color is discarded and supplement with "color vs. grayscale" comparisons or discussion.
Please cite this paper: "Medical image denoising via explainable AI feature preserving loss". Your proposed method uses SSIM + Spatial Frequency (SF), which are structural and frequency domain terms. This paper uses explainable feature-preserving loss, similarly explicitly constraining visual structure through training objectives, which is conceptually very close to your loss design.
You mention "Training employed overlapping patches of 120×120 with 50% overlap, while inference used full-resolution images. Training configuration: batch size 16, 20 epochs, Adam optimizer with learning rate 0.001." My concern is that training with small patches truncates global dependencies, preventing linear attention (SLAB) from seeing the full image; while inference uses full images, creating train/inference distribution inconsistency (different position/normalization statistics). Therefore, global context advantages may only appear during inference but are not learned during training, easily causing boundary artifacts/domain shift. Have you conducted experiments with enlarged training views (at least window coverage over actual dependency range)? Or windowing consistency experiments (same size for train/inference or using sliding windows with overlap fusion)? Do you have experimental results to support this?
Regarding Equation 11: F_processed = F_in ⊙ A_channel ⊙ A_spatial + F_in ⊙ A_texture ⊙ 0.5. Where does this hard-coded 0.5 come from? Also, Equation 21: Softmax([Density_IR, Density_VI·1.2]). "1.2" is a manually set bias. What is the basis for setting these parameters? Additionally, Equation 17: Z = (QKV) / max(Q'·∑_i K'_i, ε). What is the specific value of ε? Why?
In Method 3.1, you mention "pixels normalized to [0,1]". However, the MSE values in several tables are in the thousands. If images are in [0,1], MSE should not be this large. Please check this.
The "Visual analysis" paragraphs in sections 4.3, 4.4, and 4.5 use overly subjective terms like "white noise," "artifact," and "overly smooth". You should ideally provide ×4/×8 zoomed comparisons at fixed locations, which would be more convincing.

In summary, I believe this manuscript has room for improvement in many aspects, so I recommend major revision at this stage. I hope my suggestions can help the authors further improve the paper.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

thanks for addressing all comments.

Author Response

We sincerely thank the reviewer for their positive feedback and for acknowledging our efforts in addressing the previous comments. We are pleased that the revisions have met with the reviewer's approval.

Reviewer 2 Report

Comments and Suggestions for Authors

The authors answered all questions.

Author Response

Reviewer 3 Report

Comments and Suggestions for Authors

Comments:

1:

Ours: PSNR=17.3825 (BEST) but MSE=2285.9549 (WORST)
VI+TMFEM: PSNR=17.1483 but MSE=2074.5719 (BEST)

This is contradictory because higher PSNR should mean lower MSE!

Lines 562-565 say only:

"In terms of MSE index, there is only a difference of 211.3830 compared to the first place VI+TMFEM model (2074.5719)."

This isn't "just" a difference, it's a 10% improvement! There is no explanation why this exists. How do you explain that you have the best PSNR but the worst MSE in the same experiment?

2:

"three key issues: (1) relying on manually designed features and fixed fusion rules, lacking adaptability to different scenarios; (2) Unable to model semantic relationships between modalities; (3) Lack of targeted processing for differentiating features between infrared thermal radiation and visible light texture."

"(1) the local receptive field of convolution operations fundamentally limits their ability to model global contextual relationships... (2) Most methods treat infrared and visible light modes symmetrically, ignoring their fundamentally different characteristics; (3) The existing methods lack clear adaptive and content aware fusion mechanisms..."

But:

CDDFuse (2023) already has "dual-branch feature decomposition" - it's asymmetric!
CMTFusion (2024) already has "cross-modal guidance"
ResSCFuse (lines 159-165) was added in v2 and already has "shared convolutional kernels to extract cross modal shared features and private kernels to capture modality specific information"

Gap analysis is still weak. They do not clearly show what is NEW compared to ResSCFuse which combines Restormer + CNN with modality-specific extraction.

3:

"the proposed model does not pay attention to the image registration problem in the image fusion preprocessing step, but defaults to using the already registered image fusion dataset. When the model is transferred to an unregistered data mart, additional registration steps are required for preprocessing, otherwise it will lead to a significant decrease in performance."

This is a SIGNIFICANT limitation! In real-world UAV applications, IR and visible cameras are often not perfectly registered. Robustness should be tested on misaligned images

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

I have received and carefully reviewed the authors' response and the revised manuscript. I am pleased to see that the authors have addressed my concerns in great detail and thoroughness. The revisions are comprehensive and well-executed, including improved precision in the abstract with absolute metric values, enhanced clarity in the introduction and related work sections, detailed explanations of module design rationale and SLAB principles in the methods section, strengthened experimental validation with additional ablation studies and the inclusion of CMTFusion as a baseline, corrected technical inconsistencies such as grayscale processing and pixel normalization clarifications, added dedicated sections to justify hyperparameter settings with experimental evidence, and replaced subjective descriptions with objective metric-based analysis throughout the results section. The authors have also demonstrated a constructive attitude by acknowledging areas for future improvement. In my assessment, the manuscript now meets the publication standards and I recommend acceptance of this manuscript for publication. The proposed cross-modal guided dual-branch network presents notable innovations in infrared-visible image fusion. I encourage the authors to continue their excellent work in this field and look forward to their future contributions. I would also like to thank the editor for their assistance throughout this review process.

Author Response