Semantic-Guided Multi-Level Collaborative Fusion Network for Visible and Infrared Images
Highlights
- We propose a semantic-guided collaborative network for visible and infrared image fusion that maintains semantic guidance throughout the fusion process, generating semantically consistent and detail-preserving fused representations.
- Stage-wise cross-modal calibration and a three-level interaction strategy are constructed to strengthen early-stage information exchange and progressively inject semantic priors into cross-modal fusion and intra-modal feature learning.
- Preserving global semantic context during up sampling is essential to mitigate semantic dilution and guide task-oriented fusion model design.
- This study suggests that semantic-guided collaborative fusion represents a practical direction for enhancing perceptual quality and downstream object detection.
Abstract
1. Introduction
2. Related Works
2.1. Semantic-Guided Image Fusion
2.2. Task-Oriented Image Fusion
3. Methods
3.1. Framework Overview
3.2. Cross-Modal Feature Calibration
3.2.1. Channel-Wise Calibration
3.2.2. Spatial-Wise Calibration
3.3. Three-Level Interaction Strategy
3.3.1. Cross-Modal Fusion Block
3.3.2. Intra-Modal Interaction Block
3.4. Semantic Compensation Block
3.5. Fusion Loss
4. Experiments
4.1. Experimental Protocol
4.1.1. Datasets
4.1.2. Comparison Methods
4.1.3. Evaluation Metrics
4.1.4. Training Details
4.2. Fusion Comparison and Analysis
4.2.1. Results of Infrared–Visible Image Fusion
- (1)
- Qualitative Comparison and Analysis: Figure 5, Figure 6 and Figure 7 present qualitative visualization results from six representative image pairs sourced from the MSRS, M3FD, and TNO datasets. For clearer comparison, prominent object regions are annotated (indicated by red, green, and blue rectangles) and displayed in enlarged views. Figure 5 demonstrates that across two representative MSRS scenes, pedestrians in dark areas of our DSIFuse-fused image exhibit high contrast and sharp contours, while effectively preserving vehicle texture details. Moreover, under challenging nighttime conditions involving weak objects and bright light sources, DSIFuse effectively suppresses overexposure and preserves local structural details, thereby maintaining both object saliency and structural fidelity. In contrast, visual inspection reveals that FusionGAN, DRF, and UMF-CMGR fail to effectively highlight discriminative objects, while LRRNet and CDDFuse cannot preserve rich texture details. Although SeAFusion and MFIFusion produce distinct objects with well-preserved texture details, their contrast and clarity fall short of our approach. Overall, superior fusion quality is achieved by DSIFuse, as evidenced by clearer salient objects, sharper boundaries, and more faithful texture preservation under varying illumination conditions. This advantage stems from three key aspects. First, multi-level cross-modal interactions and intra-modal enhancement mechanisms ensure effective extraction, precise alignment, and deep fusion of distinct modal features. Furthermore, the semantic compensation block and global semantic priors were devised to guide the fusion process, maintaining high semantic consistency while compensating for and refining potentially missing details in the fusion results. This simultaneously highlights prominent objects and preserves fine textures. Finally, a refined loss function based on contrast masks and salient object masks was designed to maintain the visual appeal of the fused images. As shown in Figure 6, under challenging visibility conditions, most fusion algorithms fail to produce satisfactory fusion results. In the smoke-filled scene, buildings obscured by smoke cannot be clearly recovered by most methods, despite the preservation of salient infrared objects. In the low-illumination scene with strong glare, competing methods tend to suffer from object blurring and loss of local details in bright regions, whereas DSIFuse preserves clearer pedestrian contours and richer structural information.
- (2)
- Quantitative Comparison and Analysis: Table 1, Table 2 and Table 3 present the comparison results of multiple average quantitative evaluation metrics between DSIFuse and other mainstream fusion methods across three datasets. Among these metrics, AG, SD, VIF, and SSIM consistently exhibit superior values, further confirming that the fusion results generated by DSIFuse achieve excellent structural consistency and high perceptual quality. Notably, MFIFusion outperforms DSIFuse in EN values across all three datasets, indicating that MFIFusion’s fusion images contain richer information content. However, higher EN does not necessarily imply superior fusion quality. The elevated EN values of MFIFusion primarily arise from its pixel-level multi-scale feature superposition strategy, which expands grayscale distribution and enhances local textures. While this approach excels in information density, the lack of high-level semantic constraints may introduce redundant information and local texture fluctuations. For instance, the green-boxed area in Figure 7j exhibits poor contrast and reduced clarity in windows and branches, along with unnecessary texture fluctuations. Similarly, the red-boxed region fails to preserve detailed human textures effectively, resulting in blurred contours and degraded structural integrity.
4.2.2. Results of Infrared–Visible Object Detection
- (1)
- Qualitative Comparison and Analysis: As seen in Figure 8, a single sensor (infrared or visible) is unable to effectively detect objects. Infrared images exhibit high contrast and allow distant objects to be clearly visualized, but lack fine texture details. In contrast, visible images reveal richer texture and color information, yet perform poorly in low-light and long-range conditions, lacking the contrast advantage of infrared images. By leveraging complementary information from both modalities, nearly all fusion methods enhance detection performance. However, their fusion results still exhibit numerous issues. In the upper scene, SeAFusion, FusionGAN, and DRF tend to over-smooth object and background regions, thereby weakening the discriminability of the distant pedestrian. In the lower rainy scene, DenseFuse and CDDFuse enhance local responses, but distracting activations are also introduced around traffic structures. By comparison, a more semantically coherent representation is produced by DSIFuse: the primary pedestrian in the upper scene is preserved with clearer contours, while a better balance is achieved among distant pedestrian cues, vehicle structures, and rainy background details in the lower scene. Consequently, fewer erroneous detections are produced overall. Nevertheless, a limitation of DSIFuse should also be noted. For several objects, the detection confidence remains slightly lower than that of CDDFuse, indicating that the discriminability of certain local object features has not yet been fully optimized. This observation is consistent with the quantitative results, in which CDDFuse performs marginally better at mAP@0.5, whereas DSIFuse exhibits greater robustness at higher IoU thresholds.
- (2)
- Quantitative Comparison and Analysis: Table 4 presents the quantitative detection results on the M3FD dataset. Nearly all fusion methods achieved improved detection performance, with mAP values surpassing those obtained using only visible or infrared images. DSIFuse ranked first in human, car, and motorcycle detection, remaining in the top-three positions even with slight declines in bus, streetlight, and truck detection. Under the mAP50, CDDFuse performed marginally better than DSIFuse. However, DSIFuse exhibited greater stability as detection difficulty increased. Overall, DSIFuse achieves superior performance across multiple categories and metrics, demonstrating particular robustness at high IoU thresholds. This validates its capacity to preserve clear details and deliver sufficient semantic information, culminating in enhanced object detection performance.
- (3)
- Computational Complexity Comparison and Analysis: To comprehensively evaluate DSIFuse’s performance, the FLOPs and training parameters of all compared methods were analyzed. As shown in Figure 9, more complex models (e.g., LRRNet and CrossFuse) showed superior performance but incurred higher computational costs. In contrast, simpler models (e.g., UMF-CMGR and MFIFusion) exhibited lower computational complexity but delivered relatively weaker detection performance. However, DSIFuse maintained a balance between computational efficiency and performance in terms of FLOPs and training parameters. Although its performance was slightly lower than the optimal CDDFuse, DSIFuse achieved comparable performance at a significantly lower computational cost, demonstrating its suitability for deployment in resource-constrained applications.
4.3. Ablation Studies and Discussion
- Quantitative Comparison and Analysis: Table 5 presents the quantitative ablation results of DSIFuse on the MSRS dataset. The complete model achieves the best performance on five of the six metrics, demonstrating the effectiveness of the overall architecture. Among all components, CFC has the largest impact: once it is removed, the most severe performance degradation is observed across all metrics, highlighting its critical role in early cross-modal feature calibration. Performance is also consistently degraded when CFB or IMB is omitted, indicating that both cross-modal fusion and intra-modal interaction are essential for effective feature aggregation. By contrast, removing SCB or SG leads to relatively smaller yet still noticeable declines, suggesting that semantic compensation and semantic guidance provide complementary benefits for detail preservation and structural consistency. Overall, these results confirm that each component contributes positively to the final performance, while the complete model achieves the best overall balance.
- 2.
- Qualitative Comparison and Analysis: As shown in Figure 11, (1) the absence of CFC leads to significant performance degradation, primarily manifested as detail loss and object blurring. This highlights the critical role of effective cross-modal feature calibration. (2) CFB facilitates deep collaborative interaction and integration of multimodal information, serving as the core component for synergistic information enhancement. Its absence directly weakens the multimodal integration mechanism, causing degradation in detail preservation. (3) Without IMB, fusion outputs lack sharpness and clarity, as raw information from each modality remains underutilized. (4) SCB and SG provide semantic consistency and detail compensation. Their absence causes relatively subtle degradation, primarily manifesting in the fineness of semantic boundaries, smoothness within objects, and overall semantic robustness. For example, in the first group (w/o SCB and w/o SG), the brightness of the streetlight area within the red box is notably higher than in the complete model. The surrounding leaf details also become blurred due to overexposure, losing the clarity and color gradation seen in the full model. This suggests that without global semantic guidance, the network fails to effectively regulate light source brightness output. At a deeper level, semantic guidance is crucial for the network’s understanding and control of overall image visual quality and intensity distribution. Its absence weakens the network’s ability to adaptively adjust high-intensity information based on the scene’s semantic context, leading to image saturation and overexposure. (5) The full model consistently delivers optimal performance. DSIFuse synergistically leverages the strengths of all modules, effectively integrating the rich details of visible images with the strong semantic information from thermal images. The fusion outputs achieve an optimal balance in detail, semantic representation, and clarity. This fully validates the rationality and effectiveness of each DSIFuse block.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Jang, J.; Park, C.; Kim, H.; Lee, J.; Paik, J. Multispectral Object Detection Enhanced by Cross-Modal Information Complementary and Cosine Similarity Channel Resampling Modules. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA; IEEE: Piscataway, NJ, USA, 2025; pp. 9437–9446. [Google Scholar] [CrossRef]
- Tan, Y.; Sun, K.; Wei, J.; Gao, S.; Cui, W.; Duan, Y.; Liu, J.; Zhou, W. STFNet: A Spatiotemporal Fusion Network for Forest Change Detection Using Multi-Source Satellite Images. Remote Sens. 2024, 16, 4736. [Google Scholar] [CrossRef]
- Guo, L.; Luo, X.; Liu, Y.; Zhang, Z.; Wu, X. SAM-guided multi-level collaborative Transformer for infrared and visible image fusion. Pattern Recognit. 2025, 162, 111391. [Google Scholar] [CrossRef]
- He, Y.; Ma, Z.; Wei, X.; Gong, Y. Knowledge Synergy Learning for Multi-Modal Tracking. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 5519–5532. [Google Scholar] [CrossRef]
- Ha, Q.; Watanabe, K.; Karasawa, T.; Ushiku, Y.; Harada, T. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada; IEEE: Piscataway, NJ, USA, 2017; pp. 5108–5115. [Google Scholar] [CrossRef]
- Shivakumar, S.S.; Rodrigues, N.; Zhou, A.; Miller, I.D.; Kumar, V.; Taylor, C.J. PST900: RGB-Thermal Calibration, Dataset and Segmentation Network. In 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France; IEEE: Piscataway, NJ, USA, 2020; pp. 9441–9447. [Google Scholar] [CrossRef]
- Sun, Y.; Zuo, W.; Liu, M. RTFNet: RGB-Thermal Fusion Network for Semantic Segmentation of Urban Scenes. IEEE Robot. Autom. Lett. 2019, 4, 2576–2583. [Google Scholar] [CrossRef]
- Zhou, H.; Tian, C.; Zhang, Z.; Huo, Q.; Xie, Y.; Li, Z. Multispectral Fusion Transformer Network for RGB-Thermal Urban Scene Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
- Wu, W.; Chu, T.; Liu, Q. Complementarity-aware cross-modal feature fusion network for RGB-T semantic segmentation. Pattern Recognit. 2022, 131, 108881. [Google Scholar] [CrossRef]
- Zhang, J.; Liu, H.; Yang, K.; Hu, X.; Liu, R.; Stiefelhagen, R. CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers. arXiv 2023, arXiv:2203.04838. [Google Scholar] [CrossRef]
- Luo, Z.; Lv, X.; Wen, X.; Zhang, X. EFNet: Multiscale Edge Fusion Network for Camouflaged Object Detection. IEEE Signal Process. Lett. 2025, 32, 2957–2961. [Google Scholar] [CrossRef]
- Zheng, H.; Ji, M.; Wang, H.; Liu, Y.; Fang, L. CrossNet: An End-to-end Reference-based Super Resolution Network using Cross-scale Warping. arXiv 2018. [Google Scholar] [CrossRef]
- Xu, H.; Ma, J.; Yuan, J.; Le, Z.; Liu, W. RFNet: Unsupervised Network for Mutually Reinforcing Multi-modal Image Registration and Fusion. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA; IEEE: Piscataway, NJ, USA, 2022; pp. 19647–19656. [Google Scholar] [CrossRef]
- Li, H.; Chen, Y.; Zhang, Q.; Zhao, D. BiFNet: Bidirectional Fusion Network for Road Segmentation. IEEE Trans. Cybern. 2022, 52, 8617–8628. [Google Scholar] [CrossRef] [PubMed]
- Tang, L.; Zhang, H.; Xu, H.; Ma, J. Rethinking the necessity of image fusion in high-level vision tasks: A practical infrared and visible image fusion network based on progressive semantic injection and scene fidelity. Inf. Fusion 2023, 99, 101870. [Google Scholar] [CrossRef]
- Tang, L.; Yuan, J.; Zhang, H.; Jiang, X.; Ma, J. PIAFusion: A progressive infrared and visible image fusion network based on illumination aware. Inf. Fusion 2022, 83–84, 79–92. [Google Scholar] [CrossRef]
- Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware Dual Adversarial Learning and a Multi-scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA; IEEE: Piscataway, NJ, USA, 2022; pp. 5792–5801. [Google Scholar] [CrossRef]
- Tang, L.; Yuan, J.; Ma, J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Inf. Fusion 2022, 82, 28–42. [Google Scholar] [CrossRef]
- Zhao, W.; Xie, S.; Zhao, F.; He, Y.; Lu, H. MetaFusion: Infrared and Visible Image Fusion via Meta-Feature Embedding from Object Detection. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada; IEEE: Piscataway, NJ, USA, 2023; pp. 13955–13965. [Google Scholar]
- Liu, J.; Liu, Z.; Wu, G.; Ma, L.; Liu, R.; Zhong, W.; Luo, Z.; Fan, X. Multi-Interactive Feature Learning and a Full-Time Multi-Modality Benchmark for Image Fusion and Segmentation. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France; IEEE: Piscataway, NJ, USA, 2023; pp. 8081–8090. [Google Scholar]
- Liu, J.; Lin, R.; Wu, G.; Liu, R.; Luo, Z.; Fan, X. CoCoNet: Coupled Contrastive Learning Network with Multi-Level Feature Ensemble for Multi-Modality Image Fusion. Int. J. Comput. Vis. 2024, 132, 1748–1775. [Google Scholar] [CrossRef]
- Zheng, N.; Zhou, M.; Huang, J.; Hou, J.; Li, H.; Xu, Y.; Zhao, F. Probing Synergistic High-Order Interaction in Infrared and Visible Image Fusion. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA; IEEE: Piscataway, NJ, USA, 2024; pp. 26374–26385. [Google Scholar]
- Liu, X.; Huo, H.; Li, J.; Pang, S.; Zheng, B. A Semantic-Driven Coupled Network for Infrared and Visible Image Fusion. Inf. Fusion 2024, 108, 102352. [Google Scholar] [CrossRef]
- Liu, J.; Zhang, B.; Mei, Q.; Li, X.; Zou, Y.; Jiang, Z.; Ma, L.; Liu, R.; Fan, X. DCEvo: Discriminative Cross-Dimensional Evolutionary Learning for Infrared and Visible Image Fusion. In 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA; IEEE: Piscataway, NJ, USA, 2025; pp. 2226–2235. [Google Scholar]
- Li, Z.; Zeng, Z.; Xiao, Z.; Wen, M.; Zhang, Z.; Tian, Y. CSSA-Fusion: Channel Selective and Spatial Alignment Infrared-Visible Image Fusion. In Proceedings of the 7th ACM International Conference on Multimedia in Asia, Kuala Lumpur, Malaysia, 9–12 December 2025; ACM: New York, NY, USA, 2025; pp. 1–7. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA; IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Toet, A. Progress in color night vision. Opt. Eng. 2012, 51, 010901. [Google Scholar] [CrossRef]
- Li, H.; Wu, X.-J. DenseFuse: A Fusion Approach to Infrared and Visible Images. IEEE Trans. Image Process. 2019, 28, 2614–2623. [Google Scholar] [CrossRef] [PubMed]
- Xu, H.; Wang, X.; Ma, J. DRF: Disentangled Representation for Visible and Infrared Image Fusion. IEEE Trans. Instrum. Meas. 2021, 70, 1–13. [Google Scholar] [CrossRef]
- Dong, A.; Wang, L.; Liu, J.; Lv, G.; Zhao, G.; Cheng, J. MFIFusion: An infrared and visible image enhanced fusion network based on multi-level feature injection. Pattern Recognit. 2024, 152, 110445. [Google Scholar] [CrossRef]
- Wang, D.; Liu, J.; Fan, X.; Liu, R. Unsupervised Misaligned Infrared and Visible Image Fusion via Cross-Modality Image Generation and Registration. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, Vienna, Austria; International Joint Conferences on Artificial Intelligence Organization: Darmstadt, Germany, 2022; pp. 3508–3515. [Google Scholar] [CrossRef]
- Li, H.; Xu, T.; Wu, X.-J.; Lu, J.; Kittler, J. LRRNet: A Novel Representation Learning Guided Fusion Network for Infrared and Visible Images. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11040–11052. [Google Scholar] [CrossRef]
- Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
- Zhao, Z.; Bai, H.; Zhang, J.; Zhang, Y.; Xu, S.; Lin, Z.; Timofte, R.; Van Gool, L. CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada; IEEE: Piscataway, NJ, USA, 2023; pp. 5906–5916. [Google Scholar] [CrossRef]
- Li, H.; Wu, X.-J. CrossFuse: A novel cross attention mechanism based infrared and visible image fusion approach. Inf. Fusion 2024, 103, 102147. [Google Scholar] [CrossRef]
- Van Aardt, J. Assessment of image fusion procedures using entropy, image quality, and multispectral classification. J. Appl. Remote Sens. 2008, 2, 023522. [Google Scholar] [CrossRef]
- Qu, G.; Zhang, D.; Yan, P. Information measure for performance of image fusion. Electron. Lett. 2002, 38, 313–315. [Google Scholar] [CrossRef]
- Cui, G.; Feng, H.; Xu, Z.; Li, Q.; Chen, Y. Detail preserved fusion of visible and infrared images using regional saliency extraction and multi-scale image decomposition. Opt. Commun. 2015, 341, 199–209. [Google Scholar] [CrossRef]
- Rao, Y.-J. In-fibre Bragg grating sensors. Meas. Sci. Technol. 1997, 8, 355–375. [Google Scholar] [CrossRef]
- Han, Y.; Cai, Y.; Cao, Y.; Xu, X. A new image fusion performance metric based on visual information fidelity. Inf. Fusion 2013, 14, 127–135. [Google Scholar] [CrossRef]
- Wang, Z.; Bovik, A.C. A universal image quality index. IEEE Signal Process. Lett. 2002, 9, 81–84. [Google Scholar] [CrossRef]
- Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar] [CrossRef]











| Methods | EN ↑ | MI ↑ | AG ↑ | SD ↑ | VIF ↑ | SSIM ↑ |
|---|---|---|---|---|---|---|
| DenseFuse [28] | 5.937 | 2.643 | 2.058 | 23.568 | 0.692 | 0.911 |
| FusionGAN [33] | 5.432 | 1.886 | 1.451 | 17.07 | 0.443 | 0.502 |
| DRF [29] | 5.769 | 2.069 | 1.421 | 20.991 | 0.494 | 0.470 |
| SeAFusion [18] | 6.658 | 3.226 | 2.584 | 35.375 | 0.704 | 0.904 |
| UMF-CMGR [31] | 5.597 | 1.920 | 2.134 | 20.745 | 0.427 | 0.538 |
| CDDFuse [34] | 5.942 | 3.244 | 2.397 | 29.683 | 0.752 | 0.655 |
| LRRNet [32] | 5.945 | 3.603 | 2.405 | 29.725 | 0.755 | 0.905 |
| MFIFusion [30] | 7.120 | 2.558 | 2.005 | 26.342 | 0.636 | 0.797 |
| CrossFuse [35] | 5.900 | 2.42 | 1.918 | 27.521 | 0.757 | 0.833 |
| Ours | 6.636 | 3.596 | 3.089 | 39.972 | 1.050 | 0.915 |
| Methods | EN ↑ | MI ↑ | AG ↑ | SD ↑ | VIF ↑ | SSIM ↑ |
|---|---|---|---|---|---|---|
| DenseFuse [28] | 6.796 | 2.930 | 3.303 | 32.398 | 0.762 | 0.868 |
| FusionGAN [33] | 6.513 | 2.560 | 2.679 | 27.160 | 0.656 | 0.759 |
| DRF [29] | 6.758 | 2.688 | 2.876 | 30.649 | 0.771 | 0.675 |
| SeAFusion [18] | 6.887 | 2.765 | 3.697 | 33.352 | 0.706 | 0.902 |
| UMF-CMGR [31] | 6.734 | 2.236 | 2.143 | 24.467 | 0.568 | 0.855 |
| CDDFuse [34] | 5.772 | 3.235 | 2.378 | 30.873 | 0.712 | 0.776 |
| LRRNet [32] | 6.372 | 3.364 | 2.563 | 29.496 | 0.768 | 0.865 |
| MFIFusion [30] | 6.978 | 2.684 | 3.564 | 26.487 | 0.725 | 0.767 |
| CrossFuse [35] | 6.834 | 3.263 | 3.383 | 27.865 | 0.783 | 0.822 |
| Ours | 6.866 | 3.353 | 4.210 | 33.346 | 0.791 | 0.911 |
| Methods | EN ↑ | MI ↑ | AG ↑ | SD ↑ | VIF ↑ | SSIM ↑ |
|---|---|---|---|---|---|---|
| DenseFuse [28] | 6.181 | 2.133 | 2.255 | 22.568 | 0.610 | 0.819 |
| FusionGAN [33] | 6.461 | 2.356 | 2.352 | 25.368 | 0.422 | 0.679 |
| DRF [29] | 6.627 | 1.932 | 2.341 | 29.085 | 0.323 | 0.516 |
| SeAFusion [18] | 6.896 | 2.658 | 4.683 | 39.556 | 0.595 | 0.909 |
| UMF-CMGR [31] | 6.307 | 2.177 | 2.684 | 26.274 | 0.533 | 0.832 |
| CDDFuse [34] | 6.842 | 2.569 | 4.331 | 34.565 | 0.611 | 0.520 |
| LRRNet [32] | 6.836 | 2.613 | 3.716 | 35.576 | 0.532 | 0.866 |
| MFIFusion [30] | 6.913 | 2.332 | 2.456 | 30.089 | 0.505 | 0.834 |
| CrossFuse [35] | 6.835 | 2.203 | 3.813 | 37.992 | 0.622 | 0.872 |
| Ours | 6.893 | 2.779 | 4.732 | 39.263 | 0.629 | 0.916 |
| Methods | Per | Car | Bus | Mot | Lam | Tru | mAP50 | mAP50:95 |
|---|---|---|---|---|---|---|---|---|
| Visible | 61.5 | 81.2 | 88.6 | 62.1 | 83.9 | 72.8 | 71.9 | 52.6 |
| Infrared | 63.7 | 73.6 | 83.4 | 60.9 | 86.6 | 65.2 | 68.3 | 48.9 |
| DenseFuse [28] | 66.1 | 88.3 | 95.2 | 69.6 | 89.7 | 73.0 | 80.5 | 53.3 |
| FusionGAN [33] | 64.2 | 85.4 | 88.8 | 69.2 | 89.5 | 69.4 | 76.8 | 52.7 |
| DRF [29] | 63.3 | 81.8 | 89.7 | 66.5 | 84.8 | 67.3 | 72.1 | 51.9 |
| SeAFusion [18] | 64.6 | 83.5 | 91.1 | 67.6 | 88.7 | 72.7 | 77.0 | 52.8 |
| UMF-CMGR [31] | 66.8 | 84.7 | 92.3 | 68.8 | 89.8 | 72.2 | 77.9 | 53.2 |
| CDDFuse [34] | 66.5 | 86.5 | 95.8 | 71.8 | 89.5 | 71.5 | 81.8 | 53.9 |
| LRRNet [32] | 64.7 | 85.1 | 90.9 | 69.4 | 86.2 | 73.3 | 80.2 | 52.6 |
| MFIFusion [30] | 64.8 | 87.7 | 92.3 | 67.5 | 89.3 | 73.5 | 79.3 | 53.6 |
| CrossFuse [35] | 65.5 | 87.9 | 93.9 | 70.7 | 91.3 | 72.2 | 80.3 | 54.2 |
| Ours | 67.3 | 88.6 | 94.2 | 72.3 | 91.0 | 73.1 | 81.6 | 54.8 |
| Methods | EN ↑ | MI ↑ | AG ↑ | SD ↑ | VIF ↑ | SSIM ↑ |
|---|---|---|---|---|---|---|
| w/o CFC | 5.813 | 3.431 | 2.928 | 37.394 | 0.798 | 0.741 |
| w/o CFB | 6.323 | 3.520 | 3.003 | 39.120 | 0.990 | 0.809 |
| w/o IMB | 6.026 | 3.465 | 2.994 | 38.681 | 0.947 | 0.795 |
| w/o SCB | 6.575 | 3.572 | 3.027 | 39.912 | 1.019 | 0.842 |
| w/o SG | 6.510 | 3.591 | 3.081 | 39.970 | 1.035 | 0.865 |
| Ours | 6.636 | 3.596 | 3.089 | 39.972 | 1.050 | 0.915 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yuan, L.; Xie, C.; Yang, M.; Tu, X.; Li, Q.; Zhu, X. Semantic-Guided Multi-Level Collaborative Fusion Network for Visible and Infrared Images. Sensors 2026, 26, 2577. https://doi.org/10.3390/s26092577
Yuan L, Xie C, Yang M, Tu X, Li Q, Zhu X. Semantic-Guided Multi-Level Collaborative Fusion Network for Visible and Infrared Images. Sensors. 2026; 26(9):2577. https://doi.org/10.3390/s26092577
Chicago/Turabian StyleYuan, Lijun, Chuanjiang Xie, Ming Yang, Xiaoguang Tu, Qiqin Li, and Xinyu Zhu. 2026. "Semantic-Guided Multi-Level Collaborative Fusion Network for Visible and Infrared Images" Sensors 26, no. 9: 2577. https://doi.org/10.3390/s26092577
APA StyleYuan, L., Xie, C., Yang, M., Tu, X., Li, Q., & Zhu, X. (2026). Semantic-Guided Multi-Level Collaborative Fusion Network for Visible and Infrared Images. Sensors, 26(9), 2577. https://doi.org/10.3390/s26092577

