Language-Driven Cross-Attention for Visible–Infrared Image Fusion Using CLIP
Abstract
1. Introduction
2. Related Work
2.1. Conventional Image Fusion Methods:
2.2. Deep Learning-Based Image Fusion Methods
2.3. Language–Vision Models
3. Method
3.1. Image Encoder
3.2. Cross-Modality Feature Fusion
3.3. Language-Driven Image Fusion
3.4. Loss Function
4. Experiment
4.1. Implementation Details and Datasets
- MI (Mutual Information): Evaluates the degree to which information from both source images is preserved and integrated in the fused result, indicating how well the fusion process combines complementary details.
- VIF (Visual Information Fidelity): Measures the fidelity of the fused image relative to the source images, focusing on how accurately essential visual content is conveyed.
- SF (Spatial Frequency): Examines the spatial frequency components within the fused data, reflecting the level of detail and sharpness retained.
- Qabf: Quantifies the edge information contributed by each source image, offering insight into how effectively structural details are preserved in the final result.
- SD (Standard Deviation): Assesses the overall contrast of the fused image, highlighting its dynamic range and distinguishing ability between different intensity levels.
4.2. Analysis of Fusion Results with Textual Guidance
4.3. Comparison of Visible and Infrared Image Fusion
4.3.1. Saliency and Detail Preservation of Pedestrians
4.3.2. Clarity of Background and Noise Suppression
4.3.3. Distant Targets and Scene Structure
4.3.4. Overall Visual Consistency and Layered Appearance
4.3.5. Overall Performance Assessment
4.4. Impact of Caption Variability on Fusion Performance
4.5. Dataset-Specific Differences in Structural Metrics
4.6. Ablation Study
4.7. Limitations
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Wang, Y.; Huang, W.; Sun, F.; Xu, T.; Rong, Y.; Huang, J. Deep multimodal fusion by channel exchanging. Adv. Neural Inf. Process. Syst. 2020, 33, 4835–4845. [Google Scholar]
- Ma, J.; Ma, Y.; Li, C. Infrared and visible image fusion methods and applications: A survey. Inf. Fusion 2019, 45, 153–178. [Google Scholar] [CrossRef]
- Ma, J.; Tang, L.; Xu, M.; Zhang, H.; Xiao, G. STDFusionNet: An infrared and visible image fusion network based on salient target detection. IEEE Trans. Instrum. Meas. 2021, 70, 5009513. [Google Scholar] [CrossRef]
- Li, S.; Kang, X.; Hu, J. Image fusion with guided filtering. IEEE Trans. Image Process. 2013, 22, 2864–2875. [Google Scholar] [CrossRef]
- Llordés, A.; Garcia, G.; Gazquez, J.; Milliron, D.J. Tunable near-infrared and visible-light transmittance in nanocrystal-in-glass composites. Nature 2013, 500, 323–326. [Google Scholar] [CrossRef]
- Kim, J.U.; Park, S.; Ro, Y.M. Uncertainty-guided cross-modal learning for robust multispectral pedestrian detection. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 1510–1523. [Google Scholar] [CrossRef]
- Li, C.; Song, D.; Tong, R.; Tang, M. Illumination-aware faster R-CNN for robust multispectral pedestrian detection. Pattern Recognit. 2019, 85, 161–171. [Google Scholar] [CrossRef]
- Ma, J.; Zhang, H.; Shao, Z.; Liang, P.; Xu, H. GANMcC: A generative adversarial network with multiclassification constraints for infrared and visible image fusion. IEEE Trans. Instrum. Meas. 2020, 70, 5005014. [Google Scholar] [CrossRef]
- Tang, L.; Yuan, J.; Ma, J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Inf. Fusion 2022, 82, 28–42. [Google Scholar] [CrossRef]
- Zhu, Z.; Yang, X.; Lu, R.; Shen, T.; Xie, X.; Zhang, T. CLF-Net: Contrastive learning for infrared and visible image fusion network. IEEE Trans. Instrum. Meas. 2022, 71, 5021015. [Google Scholar] [CrossRef]
- Lin, J.; Zhang, J.; Lu, G. Keypoint Detection and Description for Raw Bayer Images. arXiv 2025, arXiv:2503.08673. [Google Scholar]
- Zhang, J.; Lu, G. Underground mapping and localization based on ground-penetrating radar. In Proceedings of the Asian Conference on Computer Vision, Hanoi, Vietnam, 8–12 December 2024; pp. 2018–2033. [Google Scholar]
- Hamida, A.B.; Benoit, A.; Lambert, P.; Amar, C.B. 3-D deep learning approach for remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 4420–4434. [Google Scholar] [CrossRef]
- Liu, Y.; Liu, S.; Wang, Z. A general framework for image fusion based on multi-scale transform and sparse representation. Inf. Fusion 2015, 24, 147–164. [Google Scholar] [CrossRef]
- Zhu, P.; Ma, X.; Huang, Z. Fusion of infrared-visible images using improved multi-scale top-hat transform and suitable fusion rules. Infrared Phys. Technol. 2017, 81, 282–295. [Google Scholar] [CrossRef]
- Dogra, A.; Goyal, B.; Agrawal, S. Bone vessel image fusion via generalized reisz wavelet transform using averaging fusion rule. J. Comput. Sci. 2017, 21, 371–378. [Google Scholar] [CrossRef]
- Dogra, A.; Goyal, B.; Agrawal, S.; Ahuja, C.K. Efficient fusion of osseous and vascular details in wavelet domain. Pattern Recognit. Lett. 2017, 94, 189–193. [Google Scholar] [CrossRef]
- Shen, R.; Cheng, I.; Basu, A. QoE-based multi-exposure fusion in hierarchical multivariate Gaussian CRF. IEEE Trans. Image Process. 2012, 22, 2469–2478. [Google Scholar] [CrossRef]
- Therrien, C.W.; Scrofani, J.W.; Kreb, W.K. An adaptive technique for the enhanced fusion of low-light visible with uncooled thermal infrared imagery. In Proceedings of the International Conference on Image Processing, Santa Barbara, CA, USA, 14–17 July 1997; Volume 1, pp. 405–408. [Google Scholar]
- Xue, Z.; Blum, R.S. Concealed weapon detection using color image fusion. In Proceedings of the 6th International Conference on Information Fusion, Cairns, Australia, 8–11 July 2003; Volume 1, pp. 622–627. [Google Scholar]
- Tang, H.; Liu, G.; Tang, L.; Bavirisetti, D.P.; Wang, J. MdedFusion: A multi-level detail enhancement decomposition method for infrared and visible image fusion. Infrared Phys. Technol. 2022, 127, 104435. [Google Scholar] [CrossRef]
- Li, H.; Liu, L.; Huang, W.; Yue, C. An improved fusion algorithm for infrared and visible images based on multi-scale transform. Infrared Phys. Technol. 2016, 74, 28–37. [Google Scholar] [CrossRef]
- Zhang, X.; Dai, X.; Zhang, X.; Jin, G. Joint principal component analysis and total variation for infrared and visible image fusion. Infrared Phys. Technol. 2023, 128, 104523. [Google Scholar] [CrossRef]
- Cvejic, N.; Bull, D.; Canagarajah, N. Region-based multimodal image fusion using ICA bases. IEEE Sens. J. 2007, 7, 743–751. [Google Scholar] [CrossRef]
- Mitianoudis, N.; Stathaki, T. Pixel-based and region-based image fusion schemes using ICA bases. Inf. Fusion 2007, 8, 131–142. [Google Scholar] [CrossRef]
- Kong, W.; Lei, Y.; Zhao, H. Adaptive fusion method of visible light and infrared images based on non-subsampled shearlet transform and fast non-negative matrix factorization. Infrared Phys. Technol. 2014, 67, 161–172. [Google Scholar] [CrossRef]
- Wang, J.; Peng, J.; Feng, X.; He, G.; Fan, J. Fusion method for infrared and visible images by using non-negative sparse representation. Infrared Phys. Technol. 2014, 67, 477–489. [Google Scholar] [CrossRef]
- Wang, S.; Yue, J.; Liu, J.; Tian, Q.; Wang, M. Large-scale few-shot learning via multi-modal knowledge discovery. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020. Proceedings Part X 16. [Google Scholar]
- Bavirisetti, D.P.; Dhuli, R. Two-scale image fusion of visible and infrared images using saliency detection. Infrared Phys. Technol. 2016, 76, 52–64. [Google Scholar] [CrossRef]
- Cui, G.; Feng, H.; Xu, Z.; Li, Q.; Chen, Y. Detail preserved fusion of visible and infrared images using regional saliency extraction and multi-scale image decomposition. Opt. Commun. 2015, 341, 199–209. [Google Scholar] [CrossRef]
- Han, J.; Pauwels, E.J.; Zeeuw, P.D. Fast saliency-aware multi-modality image fusion. Neurocomputing 2013, 111, 70–80. [Google Scholar] [CrossRef]
- Liu, C.H.; Qi, Y.; Ding, W.R. Infrared and visible image fusion method based on saliency detection in sparse domain. Infrared Phys. Technol. 2017, 83, 94–102. [Google Scholar] [CrossRef]
- Zhang, B.; Lu, X.; Pei, H.; Zhao, Y. A fusion algorithm for infrared and visible images based on saliency analysis and non-subsampled Shearlet transform. Infrared Phys. Technol. 2015, 73, 286–297. [Google Scholar] [CrossRef]
- Zhang, H.; Xu, H.; Xiao, Y.; Guo, X.; Ma, J. Rethinking the image fusion: A fast unified image fusion network based on proportional maintenance of gradient and intensity. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12797–12804. [Google Scholar]
- Zhang, H.; Ma, J. SDNet: A versatile squeeze-and-decomposition network for real-time image fusion. Int. J. Comput. Vis. 2021, 129, 2761–2785. [Google Scholar] [CrossRef]
- Zhang, J.; Lu, G. Vision-language embodiment for monocular depth estimation. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 29479–29489. [Google Scholar]
- Zhang, J.; Reddy, P.K.; Wong, X.; Aloimonos, Y.; Lu, G. Embodiment: Self-supervised depth estimation based on camera models. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 7809–7816. [Google Scholar]
- Zhang, J.; Li, Z.; Lu, G. Language-depth navigated thermal and visible image fusion. arXiv 2025, arXiv:2503.08676. [Google Scholar]
- Tang, L.; Yuan, J.; Zhang, H.; Jiang, X.; Ma, J. PIAFusion: A progressive infrared and visible image fusion network based on illumination aware. Inf. Fusion 2022, 83, 79–92. [Google Scholar] [CrossRef]
- Tang, L.; Xiang, X.; Zhang, H.; Gong, M.; Ma, J. DIVFusion: Darkness-free infrared and visible image fusion. Inf. Fusion 2023, 91, 477–493. [Google Scholar] [CrossRef]
- Zhang, J.; Xu, N.; Zhang, H.; Lu, G. Depth Estimation Based on 3D Gaussian Splatting Siamese Defocus. arXiv 2024, arXiv:2409.12323. [Google Scholar]
- Ma, J.; Tang, L.; Fan, F.; Huang, J.; Mei, X.; Ma, Y. SwinFusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA J. Autom. Sin. 2022, 9, 1200–1217. [Google Scholar] [CrossRef]
- Wang, Z.; Chen, Y.; Shao, W.; Li, H.; Zhang, L. SwinFuse: A residual swin transformer fusion network for infrared and visible images. IEEE Trans. Instrum. Meas. 2022, 71, 5016412. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtually, 18–14 July 2021; pp. 8748–8763. [Google Scholar]
- Patashnik, O.; Wu, Z.; Shechtman, E.; Cohen-Or, D.; Lischinski, D. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2085–2094. [Google Scholar]
- Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8110–8119. [Google Scholar]
- Kim, G.; Kwon, T.; Ye, J.C. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2426–2435. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
- Yi, X.; Xu, H.; Zhang, H.; Tang, L.; Ma, J. Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27026–27035. [Google Scholar]
- Wang, Y.; Miao, L.; Zhou, Z.; Zhang, L.; Qiao, Y. Infrared and visible Image Fusion with Language-driven Loss in CLIP Embedding Space. arxiv 2024, arXiv:2402.16267. [Google Scholar]
- Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5728–5739. [Google Scholar]
- Hu, J.C.; Cavicchioli, R.; Capotondi, A. Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 2173–2182. [Google Scholar]
- Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Toet, A. The TNO Multiband Image Data Collection. Data Brief 2017, 15, 249–251. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Qu, G.; Zhang, D.; Yan, P. Information measure for performance of image fusion. Electron. Lett. 2002, 38, 1–2. [Google Scholar] [CrossRef]
- Han, Y.; Cai, Y.; Cao, Y.; Xu, X. A new image fusion performance metric based on visual information fidelity. Inf. Fusion 2013, 14, 127–135. [Google Scholar] [CrossRef]
- Eskicioglu, A.M.; Fisher, P.S. Image quality measures and their performance. IEEE Trans. Commun. 1995, 43, 2959–2965. [Google Scholar] [CrossRef]
- Xydeas, C.S.; Petrovic, V. Objective image fusion performance measure. Electron. Lett. 2000, 36, 308–309. [Google Scholar] [CrossRef]
- Li, H.; Wu, X.-J.; Kittler, J. RFN-Nest: An end-to-end residual fusion network for infrared and visible images. Inf. Fusion 2021, 73, 72–86. [Google Scholar] [CrossRef]
- Tang, W.; He, F.; Liu, Y. YDTR: Infrared and visible image fusion via Y-shape dynamic transformer. IEEE Trans. Multimed. 2022, 25, 5413–5428. [Google Scholar] [CrossRef]
- Zhang, U.; Liu, Y.; Sun, P.; Yan, H.; Zhao, X.; Zhang, L. IFCNN: A general image fusion framework based on convolutional neural network. Inf. Fusion 2020, 54, 99–118. [Google Scholar] [CrossRef]
- Tang, W.; He, F.; Liu, Y.; Duan, Y. MATR: Multimodal medical image fusion via multiscale adaptive transformer. IEEE Trans. Image Process. 2022, 31, 5134–5149. [Google Scholar] [CrossRef]
- Xu, H.; Ma, J.; Zhang, X.-P. MEF-GAN: Multi-exposure image fusion via generative adversarial networks. IEEE Trans. Image Process. 2020, 29, 7203–7216. [Google Scholar] [CrossRef]
- Wang, D.; Liu, J.; Fan, X.; Liu, R. Unsupervised misaligned infrared and visible image fusion via cross-modality image generation and registration. arXiv 2022, arXiv:2205.11876. [Google Scholar] [CrossRef]
- Prabhakar, K.R.; Srikar, V.S.; Babu, R.V. Deepfuse: A deep unsupervised approach for exposure fusion with extreme exposure image pairs. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4714–4722. [Google Scholar]
- Li, H.; Wu, X. DenseFuse: A fusion approach to infrared and visible images. IEEE Trans. Image Process. 2018, 28, 2614–2623. [Google Scholar] [CrossRef]
- Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 502–518. [Google Scholar] [CrossRef]
- Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
- Rasheed, H.; Maaz, M.; Shaji, S.; Shaker, A.; Khan, S.; Cholakkal, H.; Anwer, R.M.; Xing, E.; Yang, M.; Khan, F.S. GLAMM: Pixel Grounding Large Multimodal Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 13009–13018. [Google Scholar]
- Lin, W.; Wei, X.; An, R.; Ren, T.; Chen, T.; Zhang, R.; Guo, Z.; Zhang, W.; Zhang, L.; Li, H. Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos. arXiv 2025, arXiv:2506.05302. [Google Scholar] [CrossRef]
Method | SF | Qab/f | MI | SD | VIF |
---|---|---|---|---|---|
DeepFuse [65] | 8.3500 | 0.3847 | 13.2205 | 66.8872 | 0.5752 |
DenseFuse [66] | 9.3238 | 0.4735 | 13.7053 | 81.7283 | 0.6875 |
RFN-Nest [59] | 5.8457 | 0.3292 | 13.4547 | 67.8765 | 0.5404 |
PMGI [59] | 8.7195 | 0.3787 | 13.7376 | 69.2364 | 0.6904 |
U2Fusion [67] | 11.0368 | 0.3934 | 13.4453 | 66.5035 | 0.7680 |
IFCNN [61] | 11.8590 | 0.4962 | 13.2909 | 73.7053 | 0.6090 |
FusionGAN [68] | 8.0476 | 0.2682 | 13.0817 | 61.6339 | 0.4928 |
MEFGAN [63] | 7.8481 | 0.2076 | 13.9454 | 43.7332 | 0.7330 |
SeAFusion [9] | 11.9355 | 0.4908 | 14.0663 | 93.3851 | 0.8919 |
YDTR [60] | 3.2567 | 0.1410 | 12.3865 | 56.0668 | 0.2792 |
MATR [62] | 5.3632 | 0.2723 | 13.0705 | 78.0720 | 0.3920 |
UMF-CMGR [64] | 8.2388 | 0.3671 | 12.6301 | 60.7236 | 0.3934 |
Ours | 11.3149 | 0.5863 | 13.9676 | 94.7203 | 0.7746 |
Method | SF | Qab/f | MI | SD | VIF |
---|---|---|---|---|---|
DeepFuse [65] | 12.4175 | 0.4620 | 14.0444 | 0.4586 | 38.3328 |
DenseFuse [66] | 12.5900 | 0.4700 | 14.0723 | 0.4669 | 38.7011 |
RFN-Nest [59] | 10.6825 | 0.3844 | 14.1284 | 0.4658 | 39.7194 |
PMGI [59] | 12.0997 | 0.3951 | 14.0737 | 0.4487 | 37.9572 |
U2Fusion [67] | 17.2889 | 0.4985 | 13.4141 | 0.4917 | 37.4284 |
IFCNN [61] | 21.7698 | 0.6092 | 14.4835 | 0.6762 | 44.0938 |
FusionGAN [68] | 9.2062 | 0.0600 | 12.8981 | 0.1141 | 26.9133 |
MEFGAN [63] | 15.1905 | 0.3644 | 13.9575 | 0.8720 | 59.7947 |
SeAFusion [9] | 20.9194 | 0.6181 | 14.9016 | 0.8392 | 51.8096 |
YDTR [60] | 7.0755 | 0.1961 | 13.3858 | 0.3365 | 33.1625 |
MATR [62] | 13.5066 | 0.4282 | 11.9989 | 0.4575 | 34.1515 |
UMF-CMGR [64] | 13.4481 | 0.3707 | 13.4037 | 0.3841 | 35.1731 |
Ours | 22.1381 | 0.6329 | 14.2305 | 0.8527 | 45.1842 |
Method | SF | Qab/f | MI | SD | VIF |
---|---|---|---|---|---|
GLaMM [69] | 8.3500 | 0.3847 | 13.2205 | 66.8872 | 0.5752 |
Perceive [70] | 9.3238 | 0.4735 | 13.7053 | 81.7283 | 0.6875 |
ExpansionNet v2 [52] | 11.3149 | 0.5863 | 13.9676 | 94.7203 | 0.7746 |
Cross | Lan | SF | Qab/f | MI | SD | VIF |
---|---|---|---|---|---|---|
7.3240 | 0.4432 | 8.6232 | 51.4228 | 0.6238 | ||
✓ | 9.4612 | 0.5196 | 12.4140 | 72.4543 | 0.7426 | |
✓ | ✓ | 11.3149 | 0.5863 | 13.9676 | 94.7203 | 0.7746 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, X.; Wu, J.; Zhang, P.; Yu, Z. Language-Driven Cross-Attention for Visible–Infrared Image Fusion Using CLIP. Sensors 2025, 25, 5083. https://doi.org/10.3390/s25165083
Wang X, Wu J, Zhang P, Yu Z. Language-Driven Cross-Attention for Visible–Infrared Image Fusion Using CLIP. Sensors. 2025; 25(16):5083. https://doi.org/10.3390/s25165083
Chicago/Turabian StyleWang, Xue, Jiatong Wu, Pengfei Zhang, and Zhongjun Yu. 2025. "Language-Driven Cross-Attention for Visible–Infrared Image Fusion Using CLIP" Sensors 25, no. 16: 5083. https://doi.org/10.3390/s25165083
APA StyleWang, X., Wu, J., Zhang, P., & Yu, Z. (2025). Language-Driven Cross-Attention for Visible–Infrared Image Fusion Using CLIP. Sensors, 25(16), 5083. https://doi.org/10.3390/s25165083