Visible–Infrared Fusion Based on CNN and Deformable Transformer
Abstract
1. Introduction
2. Related Work
2.1. Infrared and Visible Image Fusion
2.2. Multi-Modal Object Detection
2.3. Perception Based on CNN and Transformer
3. Methods
3.1. Overall Framework
3.2. Multi-Scale Feature Extraction
3.3. Mutual Deformable Attention
3.4. Image Reconstruction
3.5. Loss Function
3.5.1. Image Fusion Loss
3.5.2. Detection-Driven Loss
4. Experiments
4.1. Experimental Settings
4.1.1. Datasets
4.1.2. Evaluation Metrics
- Detection Accuracy Metrics
- 2.
- Model Efficiency and Complexity Metrics
4.1.3. Implementation Details
4.2. Comparison with State-of-the-Art Models
4.2.1. Qualitative Comparison
4.2.2. Quantitative Comparison
4.2.3. Quantitative Analysis of Model Efficiency and Complexity
4.3. Ablation Study
- (1)
- Baseline (Row 1): Without any enhancement modules, the model achieved 60.3% mAP@50 and 33.9% mAP@75, establishing a performance baseline for subsequent comparisons.
- (2)
- Adding the MFE module (Row 2): mAP@50 increased to 64.1% and mAP@75 to 34.3%, indicating that Multi-scale Feature Extraction improves the model’s ability to capture target contours and edges across different scales, especially in complex backgrounds.
- (3)
- Further introducing the MDA module (Row 3): mAP@50 rose substantially to 68.9% and mAP@75 to 40.1%, demonstrating that MDA effectively guides the model to learn complementary features between visible and infrared modalities, thereby enhancing the quality of multi-modal information fusion.
- (4)
- Introducing the feature-fusion loss (Row 4): mAP@50 further improved to 71.5% and mAP@75 to 41.4%, confirming that this loss term plays a significant role in optimizing cross-modal feature alignment and reducing information redundancy caused by inter-modal discrepancies.
- (5)
- Introducing the object-detection loss (Row 5): The final configuration reached 74.2% mAP@50 and 42.1% mAP@75, yielding the best performance among all settings. This indicates that the discriminative enhancement constraint further strengthens the model’s ability to distinguish targets from the background, leading to higher detection accuracy, particularly in complex scenes.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Ma, J.; Ma, Y.; Li, C. Infrared and visible image fusion methods and applications: A survey. Inf. Fusion 2019, 45, 153–178. [Google Scholar] [CrossRef]
- Li, H.; Wu, X.J. DenseFuse: A fusion approach to infrared and visible images. IEEE Trans. Image Process. 2018, 28, 2614–2623. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Y.; Liu, Y.; Sun, P.; Yan, H.; Zhao, X.; Zhang, L. IFCNN: A general image fusion framework based on convolutional neural network. Inf. Fusion 2020, 54, 99–118. [Google Scholar] [CrossRef]
- Tang, L.; Yuan, J.; Ma, J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Inf. Fusion 2022, 82, 28–42. [Google Scholar] [CrossRef]
- Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5802–5811. [Google Scholar]
- Li, H.; Ye, Z.; Hao, Y.; Lin, W.; Ye, C. DQO-MAP: Real-Time Object-Level SLAM with Dual Quadrics and Gaussians. IEEE Robot. Autom. Lett. 2025, 11, 1034–1041. [Google Scholar] [CrossRef]
- Liu, R.; Liu, J.; Jiang, Z.; Fan, X.; Luo, Z. A bilevel integrated model with data-driven layer ensemble for multi-modality image fusion. IEEE Trans. Image Process. 2020, 30, 1261–1274. [Google Scholar] [CrossRef]
- Liu, R.; Liu, Z.; Liu, J.; Fan, X. Searching a hierarchically aggregated fusion architecture for fast multi-modality image fusion. In Proceedings of the 29th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2021; pp. 1600–1608. [Google Scholar]
- Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3496–3504. [Google Scholar]
- Zhang, S.; Fu, Q.; Duan, J.; Zhan, J.; Jiang, H. Low Contrast Target Polarization Recognition Technology Based on Lifting Wavelet. Acta Opt. Sin. 2015, 35, 0211002. [Google Scholar] [CrossRef]
- Zhang, S.; Peng, J.; Zhan, J.; Fu, Q.; Duan, J.; Jiang, H. Research of the Influence of Non-Spherical Ellipsoid Particle Parameter Variation on Polarization Characteristic of Light. Acta Phys. Sin. 2016, 65, 064205. [Google Scholar] [CrossRef]
- Zhang, Y.; Fu, Q.; Luo, K.; Yang, W.; Zhan, J.; Zhang, S.; Shi, H.; Li, Y.; Yu, H. Analysis of Two-Color Infrared Polarization Imaging Characteristics for Target Detection and Recognition. Photonics 2023, 10, 1181. [Google Scholar] [CrossRef]
- Wang, J.; Shi, H.; Liu, J.; Li, Y.; Fu, Q.; Wang, C.; Jiang, H. Compressive Space-Dimensional Dual-Coded Hyperspectral Polarimeter (CSDHP) and Interactive Design Method. Opt. Express 2023, 31, 9886–9903. [Google Scholar] [CrossRef]
- Zhang, S.; Zhan, J.; Fu, Q.; Jiang, H. Simulation Research on Sky Polarization Characteristics under Complicated Marine Environment. Acta Opt. Sin. 2020, 40, 2201001. [Google Scholar] [CrossRef]
- Li, H.; Xiao, Y.; Cheng, C.; Song, X. SFPFusion: An improved vision transformer combining super feature attention and wavelet-guided pooling for infrared and visible images fusion. Sensors 2023, 23, 7870. [Google Scholar] [CrossRef]
- Wang, D.; Liu, J.; Liu, R.; Fan, X. An interactively reinforced paradigm for joint infrared-visible image fusion and saliency object detection. Inf. Fusion 2023, 98, 101828. [Google Scholar] [CrossRef]
- Zhang, L.; Zhou, Q.; Tang, M.; Ding, X.; Yang, C.; Wei, C.; Zhou, Z. DDRF: Dual-branch decomposition and reconstruction architecture for infrared and visible image Fusion. Opt. Laser Technol. 2025, 181, 111991. [Google Scholar] [CrossRef]
- Hu, X.; Liu, Y.; Yang, F. PFCFuse: A Poolformer and CNN Fusion Network for Infrared-Visible Image Fusion. IEEE Trans. Instrum. Meas. 2024, 73, 5029714. [Google Scholar] [CrossRef]
- Nie, J.; Sun, H.; Sun, X.; Ni, L.; Gao, L. Cross-Modal Feature Fusion and Interaction Strategy for CNN-Transformer-Based Object Detection in Visual and Infrared Remote Sensing Imagery. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5000405. [Google Scholar] [CrossRef]
- Zhang, C.M.; Yuan, C.; Luo, Y.; Zhou, X. MFT: Multi-scale fusion transformer for infrared and visible image fusion. In Artificial Neural Networks and Machine Learning—ICANN 2023; Springer Nature: Cham, Switzerland, 2023; pp. 485–496. [Google Scholar]
- Lv, H.; Deng, B.; Shi, C. Research On Multi-Source Image Fusion Target Detection Technology Based on Neural Network. J. Phys. Conf. Ser. 2021, 2033, 012139. [Google Scholar] [CrossRef]
- Li, H.; Ma, H.; Cheng, C.; Shen, Z.; Song, X.; Wu, X.-J. Conti-Fuse: A novel continuous decomposition-based fusion framework for infrared and visible Images. Inf. Fusion 2025, 117, 102839. [Google Scholar] [CrossRef]
- Yang, X.; Huo, H.; Li, C.; Liu, X.; Wang, W.; Wang, C. Semantic perceptive infrared and visible image fusion Transformer. Pattern Recognit. 2024, 149, 110223. [Google Scholar] [CrossRef]
- Park, S.; Vien, A.G.; Lee, C. Cross-Modal Transformers for Infrared and Visible Image Fusion. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 770–785. [Google Scholar] [CrossRef]
- Xu, F.; Mei, S.; Zhang, G.; Wang, N.; Du, Q. Bridging CNN and Transformer with Cross-Attention Fusion Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5522214. [Google Scholar] [CrossRef]
- Jiang, Z.; Wu, Y.; Huang, L.; Gu, M. FDB-Net: Fusion double branch network combining CNN and transformer for medical image Segmentation. J. X-Ray Sci. Technol. Clin. Appl. Diagn. Ther. 2024, 32, 931–951. [Google Scholar] [CrossRef] [PubMed]
- Zhao, G.; Ding, J. Research on Multi-Modal Image Target Recognition Based on Asynchronous Depth Reinforcement Learning. Autom. Control Comput. Sci. 2022, 56, 253–260. [Google Scholar] [CrossRef]
- Xin, D.; Xu, L.; Chen, H.; Yang, X.; Zhang, R. A Vehicle Target Detection Method Based on Feature Level Fusion of Infrared and Visible Light Image. In 2022 34th Chinese Control and Decision Conference (CCDC); IEEE: Piscataway, NJ, USA, 2022; pp. 469–474. [Google Scholar]
- Yuan, M.; Shi, X.; Wang, N.; Wang, Y.; Wei, X. Improving RGB-infrared object detection with cascade alignment-guided Transformer. Inf. Fusion 2024, 105, 102246. [Google Scholar] [CrossRef]
- Liu, Y.; Dong, L.; Xu, W. Infrared and visible image fusion via salient object extraction and Low-light region Enhancement. Infrared Phys. Technol. 2022, 124, 104223. [Google Scholar] [CrossRef]
- Dong, X.; Shi, P.; Liang, T.; Yang, A. CTAFFNet: CNN–Transformer Adaptive Feature Fusion Object Detection Algorithm for Complex Traffic Scenarios. Transp. Res. Rec. J. Transp. Res. Board 2024, 2679, 1947–1965. [Google Scholar] [CrossRef]
- Gao, Y.; Pei, G.; Sheng, M.; Sun, Z.; Chen, T.; Yao, Y. Relating CNN-Transformer Fusion Network for Change Detection. arXiv 2024, arXiv:2407.03178. [Google Scholar] [CrossRef]
- Jiang, M.; Chen, Y.; Dong, Z.; Liu, X.; Zhang, X.; Zhang, H. Multiscale Fusion CNN-Transformer Network for High-Resolution Remote Sensing Image Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 5280–5293. [Google Scholar] [CrossRef]
- Liu, C.; Wang, Y.; Yang, J. A Transformer-encoder-based multimodal multi-attention fusion network for sentiment Analysis. Appl. Intell. 2024, 54, 8415–8441. [Google Scholar] [CrossRef]
- Chen, Y.-L.; Lin, C.-L.; Lin, Y.-C.; Chen, T.-C. Transformer-CNN for small image object Detection. Signal Process. Image Commun. 2024, 129, 117194. [Google Scholar] [CrossRef]
- Qin, Z.; Zhang, Y.; Li, J.; Li, D.; Mo, Y.; Wang, L.; Qian, P.; Feng, L. A reconstruction and convolution operations enabled variant vision transformer with gastroscopic images for automatic locating of polyps in Internet of Medical Things. Inf. Fusion 2024, 101, 102007. [Google Scholar] [CrossRef]
- Toet, A. The TNO multiband image data collection. Data Brief 2017, 15, 249–251. [Google Scholar] [CrossRef] [PubMed]
- Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 502–518. [Google Scholar] [CrossRef] [PubMed]
- Zhang, H.; Xu, H.; Tian, X.; Jiang, J.; Ma, J. Image fusion meets deep learning: A survey and perspective. Inf. Fusion 2021, 76, 323–336. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Zeng, Y.; Liang, T.; Jin, Y.; Li, Y. MMI-Det: Exploring multi-modal integration for visible and infrared object detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 11198–11213. [Google Scholar] [CrossRef]
- Ma, J.; Tang, L.; Fan, F.; Huang, J.; Mei, X.; Ma, Y. SwinFusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA J. Autom. Sin. 2022, 9, 1200–1217. [Google Scholar] [CrossRef]
- Tang, L.; Yuan, J.; Zhang, H.; Jiang, X.; Ma, J. PIAFusion: A progressive infrared and visible image fusion network based on illumination aware. Inf. Fusion 2022, 83, 79–92. [Google Scholar] [CrossRef]
- Jocher, G. *ultralytics/yolov8: V8.1.0—Oriented Bounding Boxes (OBB)*. 2024. Available online: https://github.com/ultralytics/ultralytics/releases/tag/v8.1.0 (accessed on 16 May 2026).









| Modality | mAP50 (%) | mAP75 (%) | |
|---|---|---|---|
| YOLOv5 | VIS | 67.8 | 25.9 |
| YOLOv5 | IR | 73.9 | 35.7 |
| YOLOv7 | VIS | 70.2 | 32.7 |
| YOLOv7 | IR | 75.6 | 32.2 |
| IFCNN | VIS + IR | 73.8 | 34.9 |
| SwinFusion | VIS + IR | 74.2 | 35.8 |
| PIAFusin | VIS + IR | 75.3 | 35.7 |
| Ours | VIS + IR | 77.8 | 34.9 |
| Modality | mAP50 (%) | mAP75 (%) | |
|---|---|---|---|
| YOLOv5 | VIS | 66.9 | 37.8 |
| YOLOv5 | IR | 68.7 | 39.2 |
| YOLOv7 | VIS | 69.3 | 38.1 |
| YOLOv7 | IR | 70.8 | 40.7 |
| IFCNN | VIS + IR | 67.3 | 38.7 |
| SwinFusion | VIS + IR | 68.9 | 40.5 |
| PIAFusin | VIS + IR | 69.9 | 41.6 |
| Ours | VIS + IR | 74.2 | 42.1 |
| Modality | mAP50 (%) | mAP75 (%) | |
|---|---|---|---|
| YOLOv5 | VIS | 90.8 | 51.9 |
| YOLOv5 | IR | 94.6 | 72.2 |
| YOLOv7 | VIS | 91.9 | 52.9 |
| YOLOv7 | IR | 96.0 | 72.9 |
| IFCNN | VIS + IR | 94.8 | 71.4 |
| SwinFusion | VIS + IR | 95.2 | 72.3 |
| PIAFusin | VIS + IR | 96.1 | 72.6 |
| Ours | VIS + IR | 98.6 | 73.4 |
| Parameters (M) | FLOPs (G) | Speed (FPS) | |
|---|---|---|---|
| SwinFusion | 3.09 | 40.57 | 1.90 |
| PIAFusion | 1.18 | 77.01 | 117.89 |
| Ours | 3.25 | 18.50 | 4.50 |
| Index | Baseline | MFE | MDA | mAP50 (%) | mAP75 (%) | ||
|---|---|---|---|---|---|---|---|
| 1 | √ | 60.3 | 33.9 | ||||
| 2 | √ | √ | 64.1 | 34.3 | |||
| 3 | √ | √ | √ | 68.9 | 40.1 | ||
| 4 | √ | √ | √ | √ | 71.5 | 41.4 | |
| 5 | √ | √ | √ | √ | √ | 74.2 | 42.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Wang, X.; Gu, X.; Li, B.; Zhang, M.; Yang, P.; Fu, Q. Visible–Infrared Fusion Based on CNN and Deformable Transformer. J. Imaging 2026, 12, 219. https://doi.org/10.3390/jimaging12060219
Wang X, Gu X, Li B, Zhang M, Yang P, Fu Q. Visible–Infrared Fusion Based on CNN and Deformable Transformer. Journal of Imaging. 2026; 12(6):219. https://doi.org/10.3390/jimaging12060219
Chicago/Turabian StyleWang, Xiaoyi, Xiansong Gu, Bin Li, Mingqiang Zhang, Panpan Yang, and Qiang Fu. 2026. "Visible–Infrared Fusion Based on CNN and Deformable Transformer" Journal of Imaging 12, no. 6: 219. https://doi.org/10.3390/jimaging12060219
APA StyleWang, X., Gu, X., Li, B., Zhang, M., Yang, P., & Fu, Q. (2026). Visible–Infrared Fusion Based on CNN and Deformable Transformer. Journal of Imaging, 12(6), 219. https://doi.org/10.3390/jimaging12060219
