Correlation and Semantic Prior-Guided Multi-Scale Cross-Modal Interaction Network for SAR-OPT Image Fusion
Highlights
- A correlation and semantic prior-guided multi-scale cross-modal interaction network, termed CSP-MCIN, is proposed to accurately align and aggregate complementary features from SAR and optical images. CSP-MCIN consists of two modality-specific encoders based on ResNet-18 and a multi-scale interactive decoder integrating cross-modal Transformers and multi-modal gated fusion units.
- A novel loss function combining a pixel-domain correlation loss and a CLIP-guided semantic consistency loss is constructed to enhance the representation of source-modal information in the fused results. Furthermore, a PCGrad-based optimization strategy is introduced to effectively mitigate modality bias and enable balanced learning across multiple modality-specific loss objectives.
- Experimental results on public datasets demonstrate that CSP-MCIN outperforms state-of-the-art methods in terms of both fusion performance and computational efficiency. Accordingly, CSP-MCIN can provide more reliable fusion representations for downstream remote sensing image interpretation tasks.
- Corresponding ablation studies verify the effectiveness of the cross-modal Transformers and gated fusion units in aligning and fusing low-level details and high-level semantic features. Moreover, within the PCGrad-based multi-objective optimization scheme, incorporating pixel-domain correlations and CLIP-derived semantic priors enhances detail fidelity and semantic consistency between the fused results and the source modalities. The proposed network architecture and loss function design provide new insights and guidance for future research on multi-modal image fusion methods.
Abstract
1. Introduction
- A SAR-OPT image fusion network, termed CSP-MCIN, is proposed for generating fused images with enhanced cross-modal complementary features.
- A feature fusion decoder, named MID, is designed based on cross-modal Transformers and GFUs to align and aggregate high-level semantic and low-level detail features from different modal images.
- A novel loss function, composed of a pixel-domain correlation loss and a CLIP-guided semantic consistency loss, is developed to enhance the representation of source modalities. Furthermore, to alleviate the effect of modality bias during training, a PCGrad-based multi-objective optimization strategy is incorporated into the loss function.
- Extensive experimental results on public SAR-OPT image datasets demonstrate the effectiveness and high computational efficiency of the proposed method.
2. Related Works
2.1. Traditional Methods
2.2. Deep Learning-Based Methods
3. Proposed Method
3.1. Network Architecture Overview
3.2. Modality-Specific Encoder
3.3. Multi-Scale Interaction Decoder
3.3.1. High-Level Semantic Fusion Module
3.3.2. Low-Level Detail Fusion Module
3.4. Loss Function
3.4.1. Pixel-Domain Correlation Loss
3.4.2. CLIP-Guided Semantic Consistency Loss
3.4.3. Multi-Objective Optimization
4. Experiments
4.1. Experimental Setup
4.1.1. Datasets
4.1.2. Compared Methods
4.1.3. Evaluation Metrics
4.2. Implementation Details
4.3. Training Stability Analysis
4.4. Experimental Results
4.5. Effectiveness of High-Level Semantic Alignment Strategy
4.6. Model Complexity Analysis
4.7. Ablation Study
4.7.1. Effectiveness of the Multi-Scale Interaction Decoder
4.7.2. Effectiveness of Loss Formulations and Optimization Strategies
4.7.3. Effectiveness of Cross-Modal Interaction and Gated Fusion Mechanisms
4.7.4. Necessity of PCGrad
4.8. Generalization and Transferability Analysis
4.8.1. Generalization to Diverse SAR-OPT Datasets
4.8.2. Cross-Dataset Transferability Analysis
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
| 0.3 | 0.7 | 0.7 | 0.3 | 0.4 | 0.6 | 0.7 | 0.3 | 0.5 | 0.2 | 0.1 | 0.1 | 0.1 |
References
- Wang, Z.; Zhao, L.; Zhang, J.; Song, R.; Song, H.; Meng, J.; Wang, S. Multi-text guidance is important: Multi-modality image fusion via large generative vision-language model. Int. J. Comput. Vis. 2025, 133, 4646–4668. [Google Scholar] [CrossRef]
- Wang, Y.; Zhang, W.; Chen, W.; Chen, C.; Liang, Z. MFFnet: Multimodal Feature Fusion Network for Synthetic Aperture Radar and Optical Image Land Cover Classification. Remote Sens. 2024, 16, 2459. [Google Scholar] [CrossRef]
- Gao, G.; Wang, M.; Zhang, X.; Li, G. DEN: A New Method for SAR and Optical Image Fusion and Intelligent Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5201118. [Google Scholar] [CrossRef]
- Wang, C.; Lu, W.; Li, X.; Yang, J.; Luo, L. M4-SAR: A Multi-Resolution, Multi-Polarization, Multi-Scene, Multi-Source Dataset and Benchmark for Optical-SAR Fusion Object Detection. arXiv 2025, arXiv:2505.10931. [Google Scholar]
- Li, J.; Zhang, J.; Yang, C.; Liu, H.; Zhao, Y.; Ye, Y. Comparative analysis of pixel-level fusion algorithms and a new high-resolution dataset for SAR and optical image fusion. Remote Sens. 2023, 15, 5514. [Google Scholar] [CrossRef]
- Kulkarni, S.C.; Rege, P.P. Pixel level fusion techniques for SAR and optical images: A review. Inf. Fusion 2020, 59, 13–29. [Google Scholar] [CrossRef]
- Sui, C.; Yang, G.; Hong, D.; Wang, H.; Yao, J.; Atkinson, P.M.; Ghamisi, P. IG-GAN: Interactive Guided Generative Adversarial Networks for Multimodal Image Fusion. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5634719. [Google Scholar] [CrossRef]
- Ma, J.; Tang, L.; Fan, F.; Huang, J.; Mei, X.; Ma, Y. SwinFusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA J. Autom. Sin. 2022, 9, 1200–1217. [Google Scholar] [CrossRef]
- Zhao, Y.; Zheng, Q.; Zhu, P.; Zhang, X.; Ma, W. TUFusion: A transformer-based universal fusion algorithm for multimodal images. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 1712–1725. [Google Scholar] [CrossRef]
- Tang, L.; Deng, Y.; Yi, X.; Yan, Q.; Yuan, Y.; Ma, J. DRMF: Degradation-robust multi-modal image fusion via composable diffusion prior. In Proceedings of the ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 8546–8555. [Google Scholar]
- Zhang, H.; Cao, L.; Zuo, X.; Shao, Z.; Ma, J. OmniFuse: Composite Degradation-Robust Image Fusion with Language-Driven Semantics. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 7577–7595. [Google Scholar] [CrossRef]
- Zhao, Z.; Deng, L.; Bai, H.; Cui, Y.; Zhang, Z.; Zhang, Y.; Qin, H.; Chen, D.; Zhang, J.; Wang, P.; et al. Image fusion via vision-language model. In Proceedings of the International Conference on Machine Learning, PMLR, Vienna, Austria, 21–27 July 2024. [Google Scholar]
- Singh, S.; Saber, E.; Markopoulos, P.P.; Heard, J. Regulating modality utilization within multimodal fusion networks. Sensors 2024, 24, 6054. [Google Scholar] [CrossRef]
- Pal, S.; Majumdar, T.; Bhattacharya, A.K. ERS-2 SAR and IRS-1C LISS III data fusion: A PCA approach to improve remote sensing based geological interpretation. ISPRS J. Photogramm. Remote Sens. 2007, 61, 281–297. [Google Scholar] [CrossRef]
- Chen, C.M.; Hepner, G.; Forster, R. Fusion of hyperspectral and radar data using the IHS transformation to enhance urban surface features. ISPRS J. Photogramm. Remote Sens. 2003, 58, 19–30. [Google Scholar] [CrossRef]
- Yang, J.; Ren, G.; Ma, Y.; Fan, Y. Coastal wetland classification based on high resolution SAR and optical image fusion. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Beijing, China, 10–15 July 2016; pp. 886–889. [Google Scholar]
- Burt, P.J.; Adelson, E.H. The Laplacian pyramid as a compact image code. In Readings in Computer Vision; Elsevier: Amsterdam, The Netherlands, 1987; pp. 671–679. [Google Scholar]
- Liu, Y.; Jin, J.; Wang, Q.; Shen, Y.; Dong, X. Region level based multi-focus image fusion using quaternion wavelet and normalized cut. Signal Process. 2014, 97, 9–30. [Google Scholar] [CrossRef]
- Kong, W.; Lei, Y.; Lei, Y.; Zhang, J. Technique for image fusion based on non-subsampled contourlet transform domain improved NMF. Sci. China Inf. Sci. 2010, 53, 2429–2440. [Google Scholar] [CrossRef]
- Kulkarni, S.C.; Rege, P.P.; Parishwad, O. Hybrid fusion approach for synthetic aperture radar and multispectral imagery for improvement in land use land cover classification. J. Appl. Remote Sens. 2019, 13, 034516. [Google Scholar] [CrossRef]
- Chong, X.J.; Xuejiao, C. Comparative analysis of different fusion rules for SAR and multi-spectral image fusion based on NSCT and IHS transform. In Proceedings of the International Conference on Computer and Computational Sciences, Porto, Portugal, 21–23 October 2015; IEEE: New York, NY, USA, 2015; pp. 271–274. [Google Scholar]
- Zhang, W.; Yu, L. SAR and Landsat ETM+ image fusion using variational model. In Proceedings of the International Conference on Computer and Communication Technologies in Agriculture Engineering, Chengdu, China, 12–13 June 2010; IEEE: New York, NY, USA, 2010; Volume 3, pp. 205–207. [Google Scholar]
- Ghahremani, M.; Ghassemian, H. A compressed-sensing-based pan-sharpening method for spectral distortion reduction. IEEE Trans. Geosci. Remote Sens. 2015, 54, 2194–2206. [Google Scholar] [CrossRef]
- Li, H.; Wu, X.J. DenseFuse: A fusion approach to infrared and visible images. IEEE Trans. Image Process. 2018, 28, 2614–2623. [Google Scholar] [CrossRef]
- Li, H.; Wu, X.J.; Durrani, T. NestFuse: An infrared and visible image fusion architecture based on nest connection and spatial/channel attention models. IEEE Trans. Instrum. Meas. 2020, 69, 9645–9656. [Google Scholar] [CrossRef]
- Li, H.; Wu, X.J.; Kittler, J. RFN-Nest: An end-to-end residual fusion network for infrared and visible images. Inf. Fusion 2021, 73, 72–86. [Google Scholar] [CrossRef]
- Xu, H.; Zhang, H.; Ma, J. Classification saliency-based rule for visible and infrared image fusion. IEEE Trans. Comput. Imaging 2021, 7, 824–836. [Google Scholar] [CrossRef]
- Ye, Y.; Liu, W.; Zhou, L.; Peng, T.; Xu, Q. An unsupervised SAR and optical image fusion network based on structure-texture decomposition. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4028305. [Google Scholar] [CrossRef]
- Liu, J.; Fan, X.; Jiang, J.; Liu, R.; Luo, Z. Learning a deep multi-scale feature ensemble and an edge-attention guidance for image fusion. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 105–119. [Google Scholar] [CrossRef]
- Cheng, C.; Xu, T.; Wu, X.J.; Li, H.; Li, X.; Kittler, J. Fusionbooster: A unified image fusion boosting paradigm. Int. J. Comput. Vis. 2025, 133, 3041–3058. [Google Scholar] [CrossRef]
- Liu, Y.; Chen, X.; Cheng, J.; Peng, H. A medical image fusion method based on convolutional neural networks. In Proceedings of the International Conference on Information Fusion, Xi’an, China, 10–13 July 2017; IEEE: New York, NY, USA, 2017; pp. 1–7. [Google Scholar]
- Ma, H.; Liao, Q.; Zhang, J.; Liu, S.; Xue, J.H. An α-matte boundary defocus model-based cascaded network for multi-focus image fusion. IEEE Trans. Image Process. 2020, 29, 8668–8679. [Google Scholar] [CrossRef]
- Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 502–518. [Google Scholar] [CrossRef]
- Tang, L.; Yuan, J.; Ma, J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Inf. Fusion 2022, 82, 28–42. [Google Scholar] [CrossRef]
- Tang, L.; Yuan, J.; Zhang, H.; Jiang, X.; Ma, J. PIAFusion: A progressive infrared and visible image fusion network based on illumination aware. Inf. Fusion 2022, 83, 79–92. [Google Scholar] [CrossRef]
- Duan, C.; Belgiu, M.; Stein, A. Efficient cloud removal network for satellite images using sar-optical image fusion. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
- Zhao, W.; Cui, H.; Wang, H.; He, Y.; Lu, H. FreeFusion: Infrared and Visible Image Fusion via Cross Reconstruction Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 8040–8056. [Google Scholar] [CrossRef]
- Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
- Ma, J.; Xu, H.; Jiang, J.; Mei, X.; Zhang, X.P. DDcGAN: A dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Trans. Image Process. 2020, 29, 4980–4995. [Google Scholar] [CrossRef] [PubMed]
- Ma, J.; Zhang, H.; Shao, Z.; Liang, P.; Xu, H. GANMcC: A generative adversarial network with multiclassification constraints for infrared and visible image fusion. IEEE Trans. Instrum. Meas. 2020, 70, 5005014. [Google Scholar] [CrossRef]
- Le, Z.; Huang, J.; Xu, H.; Fan, F.; Ma, Y.; Mei, X.; Ma, J. UIFGAN: An unsupervised continual-learning generative adversarial network for unified image fusion. Inf. Fusion 2022, 88, 305–318. [Google Scholar] [CrossRef]
- Kong, Y.; Hong, F.; Leung, H.; Peng, X. A fusion method of optical image and SAR image based on dense-UGAN and Gram–Schmidt transformation. Remote Sens. 2021, 13, 4274. [Google Scholar] [CrossRef]
- Ding, Z.; Yang, Y.; Zhang, Y.; Luo, X.; Huang, M.; Xiang, X. Cross-Modal Feature Calibration and Fusion Network for Remote Sensing Optical-SAR Joint Object Detection under Cloud Occlusion. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 27302–27319. [Google Scholar] [CrossRef]
- Geng, Z.; Liu, H.; Duan, P.; Wei, X.; Li, S. Feature-based multimodal remote sensing image matching: Benchmark and state-of-the-art. ISPRS J. Photogramm. Remote Sens. 2025, 229, 285–302. [Google Scholar] [CrossRef]
- Sommervold, O.; Gazzea, M.; Arghandeh, R. A survey on SAR and optical satellite image registration. Remote Sens. 2023, 15, 850. [Google Scholar] [CrossRef]
- Quan, Y.; Zhang, R.; Li, J.; Ji, S.; Guo, H.; Yu, A. Learning SAR-optical cross modal features for land cover classification. Remote Sens. 2024, 16, 431. [Google Scholar] [CrossRef]
- Aleissaee, A.A.; Kumar, A.; Anwer, R.M.; Khan, S.; Cholakkal, H.; Xia, G.S.; Khan, F.S. Transformers in remote sensing: A survey. Remote Sens. 2023, 15, 1860. [Google Scholar] [CrossRef]
- Wei, K.; Dai, J.; Hong, D.; Ye, Y. MGFNet: An MLP-dominated gated fusion network for semantic segmentation of high-resolution multi-modal remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2024, 135, 104241. [Google Scholar] [CrossRef]
- Wang, J.; Ma, L.; Zhao, B.; Gou, Z.; Yin, Y.; Sun, G. MRLF: Multi-Resolution Layered Fusion Network for Optical and SAR Images. Remote Sens. 2025, 17, 3740. [Google Scholar] [CrossRef]
- Wang, P.; Lu, Z.; Li, Y.; Ding, B.; Zhang, D. SARCLIP: The First Vision–Language Foundation Model for SAR Image. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5223211. [Google Scholar] [CrossRef]
- Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Fu, L.; Zhou, J. Remoteclip: A vision language foundation model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622216. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Yu, T.; Kumar, S.; Gupta, A.; Levine, S.; Hausman, K.; Finn, C. Gradient surgery for multi-task learning. Adv. Neural Inf. Process. Syst. 2020, 33, 5824–5836. [Google Scholar]
- Aslantas, V.; Bendes, E. A new image quality metric for image fusion: The sum of the correlations of differences. Aeu-Int. J. Electron. Commun. 2015, 69, 1890–1896. [Google Scholar] [CrossRef]
- Li, X.; Zhang, G.; Cui, H.; Hou, S.; Wang, S.; Li, X.; Chen, Y.; Li, Z.; Zhang, L. MCANet: A joint semantic segmentation framework of optical and SAR images for land use classification. Int. J. Appl. Earth Obs. Geoinf. 2022, 106, 102638. [Google Scholar] [CrossRef]
- Schmitt, M.; Hughes, L.H.; Zhu, X.X. The SEN1-2 dataset for deep learning in SAR-optical data fusion. arXiv 2018, arXiv:1807.01569. [Google Scholar] [CrossRef]
- Huang, M.; Xu, Y.; Qian, L.; Shi, W.; Zhang, Y.; Bao, W.; Wang, N.; Liu, X.; Xiang, X. The QXS-SAROPT dataset for deep learning in SAR-optical data fusion. arXiv 2021, arXiv:2103.08259. [Google Scholar]
- Wang, C.; Luo, L.; Fang, W.; Yang, J. Cross-modal Gaussian Localization Distillation for Optical Information guided SAR Object Detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India, 6–11 April 2025; IEEE: New York, NY, USA, 2025; pp. 1–5. [Google Scholar]
- Ye, Y.; Zhang, J.; Zhou, L.; Li, J.; Ren, X.; Fan, J. Optical and SAR image fusion based on complementary feature decomposition and visual saliency features. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5205315. [Google Scholar] [CrossRef]
- Roberts, J.W.; Van Aardt, J.A.; Ahmed, F.B. Assessment of image fusion procedures using entropy, image quality, and multispectral classification. J. Appl. Remote Sens. 2008, 2, 023522. [Google Scholar]
- Han, Y.; Cai, Y.; Cao, Y.; Xu, X. A new image fusion performance metric based on visual information fidelity. Inf. Fusion 2013, 14, 127–135. [Google Scholar] [CrossRef]
- Zhang, W.; Zhai, G.; Wei, Y.; Yang, X.; Ma, K. Blind image quality assessment via vision-language correspondence: A multitask learning perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14071–14081. [Google Scholar]
- Wang, J.; Chan, K.C.; Loy, C.C. Exploring clip for assessing the look and feel of images. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 2555–2563. [Google Scholar]
- Jayasumana, S.; Ramalingam, S.; Veit, A.; Glasner, D.; Chakrabarti, A.; Kumar, S. Rethinking fid: Towards a better evaluation metric for image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 9307–9315. [Google Scholar]
- Sheikh, H.R.; Bovik, A.C. Image information and visual quality. IEEE Trans. Image Process. 2006, 15, 430–444. [Google Scholar] [CrossRef]
- Wang, M.; Deng, W. Deep visual domain adaptation: A survey. Neurocomputing 2018, 312, 135–153. [Google Scholar] [CrossRef]











| Method | SD↑ | CC↑ | EN↑ | VIFF↑ | LIQE↑ | CLIP-IQA↑ | CMMDF↓ |
|---|---|---|---|---|---|---|---|
| LP | 0.077 | 0.366 | 5.926 | 0.033 | 1.443 | 0.486 | 2.467 |
| VSFF | 0.072 | 0.631 | 5.870 | 0.083 | 2.809 | 0.519 | 2.274 |
| FusionGAN | 0.112 | 0.254 | 6.484 | 0.204 | 1.548 | 0.354 | 1.600 |
| DDcGAN | 0.062 | 0.369 | 5.816 | 0.067 | 2.703 | 0.595 | 1.947 |
| GANMcC | 0.104 | 0.606 | 6.395 | 0.412 | 2.156 | 0.417 | 1.329 |
| U2Fusion | 0.095 | 0.590 | 6.235 | 0.321 | 2.154 | 0.361 | 1.519 |
| SwinFusion | 0.166 | 0.537 | 6.910 | 0.617 | 2.134 | 0.342 | 1.674 |
| TUFusion | 0.106 | 0.623 | 6.431 | 0.416 | 2.284 | 0.382 | 1.544 |
| FreeFusion | 0.193 | 0.642 | 7.233 | 0.754 | 2.530 | 0.490 | 1.409 |
| CSP-MCIN | 0.184 | 0.691 | 7.319 | 0.687 | 2.948 | 0.612 | 1.133 |
| Method | SD↑ | CC↑ | EN↑ | VIFF↑ | LIQE↑ | CLIP-IQA↑ | CMMDF↓ |
|---|---|---|---|---|---|---|---|
| LP | 0.079 | 0.407 | 5.602 | 0.032 | 1.325 | 0.391 | 2.308 |
| VSFF | 0.141 | 0.506 | 6.870 | 0.210 | 2.113 | 0.464 | 0.899 |
| FusionGAN | 0.120 | 0.478 | 6.824 | 0.195 | 1.904 | 0.440 | 0.780 |
| DDcGAN | 0.103 | 0.419 | 6.299 | 0.090 | 2.277 | 0.527 | 1.543 |
| GANMcC | 0.111 | 0.629 | 6.739 | 0.267 | 1.568 | 0.359 | 0.743 |
| U2Fusion | 0.109 | 0.656 | 6.697 | 0.280 | 1.671 | 0.365 | 0.745 |
| SwinFusion | 0.156 | 0.603 | 7.255 | 0.358 | 1.650 | 0.414 | 0.606 |
| TUFusion | 0.119 | 0.653 | 6.831 | 0.322 | 1.578 | 0.358 | 0.643 |
| FreeFusion | 0.144 | 0.539 | 6.080 | 0.376 | 1.205 | 0.294 | 1.630 |
| CSP-MCIN | 0.194 | 0.641 | 7.364 | 0.440 | 2.111 | 0.519 | 0.574 |
| Metric | Method | |||||||
|---|---|---|---|---|---|---|---|---|
| FusionGAN | DDcGAN | GANMcC | U2Fusion | SwinFusion | TUFusion | FreeFusion | CSP-MCIN | |
| Params↓ | 0.9 | 1.1 | 1.9 | 0.7 | 1.0 | 19.1 | 5.7 | 37.7 |
| GFLOPs↓ | 51.4 | 211.6 | 28.6 | 43.2 | 76.0 | 27.9 | 96.8 | 24.1 |
| Time↓ | 17.5 | 16.4 | 27.2 | 17.2 | 194.9 | 139.2 | 25.3 | 12.9 |
| Mem↓ | 205.6 | 1231.7 | 241.2 | 260.6 | 706.6 | 1339.3 | 341.5 | 306.7 |
| Method | SD↑ | CC↑ | EN↑ | VIFF↑ | LIQE↑ | CLIP-IQA↑ | CMMDF↓ |
|---|---|---|---|---|---|---|---|
| w/o HSFM | 0.177 | 0.673 | 7.152 | 0.651 | 2.744 | 0.601 | 1.139 |
| w/o LDFM | 0.198 | 0.601 | 7.354 | 0.629 | 1.365 | 0.401 | 1.688 |
| w/o Atten | 0.187 | 0.691 | 7.312 | 0.703 | 2.881 | 0.599 | 1.135 |
| w/o GFU | 0.186 | 0.688 | 7.301 | 0.709 | 2.808 | 0.597 | 1.165 |
| w/o | 0.196 | 0.681 | 7.330 | 0.869 | 2.493 | 0.565 | 1.787 |
| w/o | 0.191 | 0.666 | 7.282 | 0.890 | 2.419 | 0.540 | 1.807 |
| w/o PCGrad | 0.196 | 0.595 | 7.375 | 0.573 | 1.319 | 0.372 | 1.670 |
| vanilla loss | 0.157 | 0.557 | 6.954 | 0.600 | 2.295 | 0.468 | 2.184 |
| CSP-MCIN | 0.184 | 0.691 | 7.319 | 0.687 | 2.948 | 0.612 | 1.133 |
| SD↑ | CC↑ | EN↑ | VIFF↑ | LIQE↑ | CLIP-IQA↑ | CMMDF↓ | ||
|---|---|---|---|---|---|---|---|---|
| 0.2 | 0.7 | 0.184 | 0.694 | 7.315 | 0.645 | 2.944 | 0.603 | 1.019 |
| 0.3 | 0.184 | 0.691 | 7.319 | 0.687 | 2.948 | 0.612 | 1.133 | |
| 0.4 | 0.181 | 0.685 | 7.264 | 0.740 | 2.824 | 0.585 | 0.908 | |
| 0.5 | 0.181 | 0.676 | 7.263 | 0.794 | 2.804 | 0.586 | 1.108 | |
| 0.6 | 0.181 | 0.675 | 7.246 | 0.789 | 2.752 | 0.573 | 1.087 | |
| 0.7 | 0.182 | 0.667 | 7.209 | 0.826 | 2.644 | 0.554 | 1.064 | |
| 0.3 | 0.3 | 0.124 | 0.651 | 6.818 | 0.248 | 2.993 | 0.581 | 0.970 |
| 0.4 | 0.147 | 0.667 | 7.043 | 0.382 | 2.970 | 0.586 | 0.862 | |
| 0.5 | 0.161 | 0.681 | 7.166 | 0.481 | 3.006 | 0.588 | 0.898 | |
| 0.6 | 0.172 | 0.689 | 7.254 | 0.586 | 2.965 | 0.596 | 0.962 | |
| 0.8 | 0.185 | 0.681 | 7.175 | 0.780 | 2.709 | 0.571 | 1.017 |
| Method | CC↑ | EN↑ | VIFF↑ | CLIP-IQA↑ | Params↓ | GFLOPs↓ | Time↓ | Mem↓ |
|---|---|---|---|---|---|---|---|---|
| CFCFNet | 0.667 | 7.114 | 0.641 | 0.587 | 41.0 | 18.4 | 20.8 | 317.6 |
| CSP-MCIN | 0.691 | 7.319 | 0.687 | 0.612 | 37.7 | 24.1 | 12.9 | 306.7 |
| Method | SD↑ | CC↑ | EN↑ | VIFF↑ | LIQE↑ | CLIP-IQA↑ | CMMDF↓ | ||
|---|---|---|---|---|---|---|---|---|---|
| w/WL | 0.2 | 0.8 | 0.184 | 0.687 | 7.320 | 0.529 | 2.922 | 0.610 | 1.242 |
| 0.3 | 0.7 | 0.188 | 0.595 | 7.318 | 0.573 | 1.319 | 0.372 | 1.670 | |
| 0.4 | 0.6 | 0.182 | 0.692 | 7.312 | 0.682 | 2.906 | 0.603 | 1.133 | |
| 0.5 | 0.5 | 0.184 | 0.682 | 7.307 | 0.683 | 2.791 | 0.594 | 1.311 | |
| 0.6 | 0.4 | 0.180 | 0.669 | 7.231 | 0.669 | 2.689 | 0.557 | 1.236 | |
| 0.7 | 0.3 | 0.180 | 0.651 | 7.212 | 0.614 | 2.563 | 0.529 | 1.369 | |
| w/PCGrad | 0.3 | 0.7 | 0.184 | 0.691 | 7.319 | 0.687 | 2.948 | 0.612 | 1.133 |
| Train | Test | SD↑ | CC↑ | EN↑ | VIFF↑ | LIQE↑ | CLIP-IQA↑ | CMMDF↓ |
|---|---|---|---|---|---|---|---|---|
| SEN1-2 | WOS | 0.179 | 0.689 | 7.321 | 0.579 | 3.156 | 0.668 | 2.136 |
| QS | 0.166 | 0.685 | 7.210 | 0.624 | 2.787 | 0.610 | 2.114 | |
| WOS | 0.184 | 0.691 | 7.319 | 0.687 | 2.948 | 0.612 | 1.133 | |
| WOS | SEN1-2 | 0.203 | 0.628 | 7.407 | 0.440 | 1.856 | 0.504 | 0.946 |
| QS | 0.200 | 0.631 | 7.438 | 0.431 | 1.884 | 0.527 | 1.249 | |
| SEN1-2 | 0.194 | 0.641 | 7.364 | 0.440 | 2.111 | 0.519 | 0.574 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Hou, X.; Zhou, L.; Feng, C.; Cha, H.; Liu, Y.; Liu, L.; Liu, H. Correlation and Semantic Prior-Guided Multi-Scale Cross-Modal Interaction Network for SAR-OPT Image Fusion. Remote Sens. 2026, 18, 975. https://doi.org/10.3390/rs18070975
Hou X, Zhou L, Feng C, Cha H, Liu Y, Liu L, Liu H. Correlation and Semantic Prior-Guided Multi-Scale Cross-Modal Interaction Network for SAR-OPT Image Fusion. Remote Sensing. 2026; 18(7):975. https://doi.org/10.3390/rs18070975
Chicago/Turabian StyleHou, Xiaoyang, Lingxi Zhou, Chenguo Feng, Hao Cha, Yang Liu, Liguo Liu, and Haibo Liu. 2026. "Correlation and Semantic Prior-Guided Multi-Scale Cross-Modal Interaction Network for SAR-OPT Image Fusion" Remote Sensing 18, no. 7: 975. https://doi.org/10.3390/rs18070975
APA StyleHou, X., Zhou, L., Feng, C., Cha, H., Liu, Y., Liu, L., & Liu, H. (2026). Correlation and Semantic Prior-Guided Multi-Scale Cross-Modal Interaction Network for SAR-OPT Image Fusion. Remote Sensing, 18(7), 975. https://doi.org/10.3390/rs18070975

