Harnessing Foundation Models for Optical–SAR Object Detection via Gated–Guided Fusion
Abstract
1. Introduction
- We introduce a Foundation Model-Guided Feature Injection strategy, which seamlessly embeds transferable representations into the detection network to augment generalization and adaptability across diverse environments.
- We design an adaptive Dual-Stream Fusion Architecture incorporating Low- and High-Frequency Mamba (LHF-Mamba) blocks along with a Gated–Guided Fusion mechanism. This design facilitates the simultaneous modeling of long-range dependencies while dynamically balancing information from SAR and optical inputs to suppress noise and highlight reliable cues.
- We conduct extensive evaluations on the large-scale M4-SAR dataset [22]. Our approach achieves state-of-the-art performance, significantly improving robustness and accuracy under complex sensing conditions.
2. Related Work
2.1. Application of Foundation Models
2.2. Optical and SAR Image Fusion
3. Methods
3.1. Network Structure
3.2. Low- and High-Frequency Mamba Block
3.3. Modality Fusion Module
3.4. Adaptive Prior Gating Module
3.5. Loss Function
4. Experiments and Results
4.1. Dataset
4.2. Experimental Settings
4.3. Performance Comparison
4.4. Generalization Experiments
4.5. Ablation Experiments
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015; pp. 1–14. [Google Scholar]
- Yang, J.; Liang, Z.; Li, J.; Gan, Y.; Zhong, J. A Novel Copy–Move Forgery Detection Algorithm via Gradient-Hash Matching and Simplified Cluster-Based Filtering. Int. J. Pattern Recognit. Artif. Intell. 2023, 37, 2350011. [Google Scholar] [CrossRef]
- Andrew, O.; Apan, A.; Paudyal, D.R.; Perera, K. Convolutional Neural Network-Based Deep Learning Approach for Automatic Flood Mapping Using NovaSAR-1 and Sentinel-1 Data. ISPRS Int. J. Geo. Inf. 2023, 12, 194. [Google Scholar] [CrossRef]
- Guo, P.; Celik, T.; Liu, N.; Li, H.C. Piecewise Self-Adaption Weighted attention for the detection of concentrated distributions of ships in SAR images. Remote Sens. Lett. 2025, 16, 200–210. [Google Scholar] [CrossRef]
- Wan, S.; Yeh, M.L.; Ma, H.L. An Innovative Intelligent System with Integrated CNN and SVM: Considering Various Crops through Hyperspectral Image Data. ISPRS Int. J. Geo. Inf. 2021, 10, 242. [Google Scholar] [CrossRef]
- Yu, L.; Wu, H.; Liu, L.; Hu, H.; Deng, Q. TWC-AWT-Net: A transformer-based method for detecting ships in noisy SAR images. Remote Sens. Lett. 2023, 14, 512–521. [Google Scholar] [CrossRef]
- Aleissaee, A.A.; Kumar, A.; Anwer, R.M.; Khan, S.; Cholakkal, H.; Xia, G.S.; Khan, F.S. Transformers in Remote Sensing: A Survey. Remote Sens. 2023, 15, 1860. [Google Scholar] [CrossRef]
- Zhang, J.; Zhao, H.; Li, J. TRS: Transformers for Remote Sensing Scene Classification. Remote Sens. 2021, 13, 4143. [Google Scholar] [CrossRef]
- Wang, J.; Li, H.; Li, Y.; Qin, Z. A Lightweight CNN-Transformer Implemented via Structural Re-Parameterization and Hybrid Attention for Remote Sensing Image Super-Resolution. Isprs Int. J. Geo. Inf. 2025, 14, 8. [Google Scholar] [CrossRef]
- Ding, K.; Wang, Y.; Wang, C.; Ma, J. A New Subject-Sensitive Hashing Algorithm Based on Multi-PatchDrop and Swin-Unet for the Integrity Authentication of HRRS Image. ISPRS Int. J. Geo. Inf. 2024, 13, 336. [Google Scholar] [CrossRef]
- Jiang, M.; Shao, H. A CNN-Transformer Combined Remote Sensing Imagery Spatiotemporal Fusion Model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 13995–14009. [Google Scholar] [CrossRef]
- Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
- Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual State Space Model. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 37, pp. 103031–103063. [Google Scholar] [CrossRef]
- Liao, J.; Wang, L. SpecSpatMamba: An efficient hyperspectral image classification method integrating spectral-spatial dual-path and state space model. Egypt. J. Remote Sens. Space Sci. 2025, 28, 628–644. [Google Scholar] [CrossRef]
- Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. RSMamba: Remote Sensing Image Classification with State Space Model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 8002605. [Google Scholar] [CrossRef]
- Wang, Q.; Ye, H.; Liang, D.; Huang, S.J. Diffusion-Noise-Based Augmentation for Long-Tailed Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5626114. [Google Scholar] [CrossRef]
- Hao, X.; Liu, L.; Yang, R.; Yin, L.; Zhang, L.; Li, X. A Review of Data Augmentation Methods of Remote Sensing Image Target Recognition. Remote Sens. 2023, 15, 827. [Google Scholar] [CrossRef]
- Wang, D.; Hu, M.; Jin, Y.; Miao, Y.; Yang, J.; Xu, Y.; Qin, X.; Ma, J.; Sun, L.; Li, C.; et al. HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 6427–6444. [Google Scholar] [CrossRef]
- Zhang, Z.; Zhao, T.; Guo, Y.; Yin, J. RS5M and GeoRSCLIP: A Large-Scale Vision- Language Dataset and a Large Vision-Language Model for Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5642123. [Google Scholar] [CrossRef]
- Wu, K.; Zhang, Y.; Ru, L.; Dang, B.; Lao, J.; Yu, L.; Luo, J.; Zhu, Z.; Sun, Y.; Zhang, J.; et al. A semantic-enhanced multi-modal remote sensing foundation model for Earth observation. Nat. Mach. Intell. 2025, 7, 1235–1249. [Google Scholar] [CrossRef]
- Wang, C.; Lu, W.; Li, X.; Yang, J.; Luo, L. M4-SAR: A Multi-Resolution, Multi-Polarization, Multi-Scene, Multi-Source Dataset and Benchmark for Optical-SAR Fusion Object Detection. arXiv 2025, arXiv:2505.10931. [Google Scholar]
- Li, K.; Cao, X.; Meng, D. A New Learning Paradigm for Foundation Model-Based Remote-Sensing Change Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5610112. [Google Scholar] [CrossRef]
- Ding, L.; Zhu, K.; Peng, D.; Tang, H.; Yang, K.; Bruzzone, L. Adapting Segment Anything Model for Change Detection in VHR Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611711. [Google Scholar] [CrossRef]
- Wang, G.; Ma, Y.; Zhou, F.; Wang, Y.; Yan, Y.; Geng, H. RFHP-CD: A Prompt-Driven Fine-Tuning Framework of Remote Sensing Foundation Model for Building and Cropland Change Detection. IEEE Access 2025, 13, 121601–121615. [Google Scholar] [CrossRef]
- Wang, K.; Li, Z.; Guo, J.; Wang, Y. Incremental Classification of Cross-Scene Hyperspectral Images Based on Dual Constraints and Knowledge Transfer. IEEE Geosci. Remote Sens. Lett. 2025, 22, 5505005. [Google Scholar] [CrossRef]
- Cao, Y.; Bin, J.; Hamari, J.; Blasch, E.; Liu, Z. Multimodal Object Detection by Channel Switching and Spatial Attention. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 18–22 June 2023; pp. 403–411. [Google Scholar] [CrossRef]
- Zhang, J.; Cao, M.; Xie, W.; Lei, J.; Li, D.; Huang, W.; Li, Y.; Yang, X. E2E-MFD: Towards end-to-end synchronous multimodal fusion detection. In Proceedings of the 38th International Conference on Neural Information Processing Systems (NIPS’24), Red Hook, NY, USA, 9–15 December 2024. [Google Scholar]
- Liu, B.; Ren, B.; Hou, B.; Gu, Y. Multi-Source Fusion Network for Remote Sensing Image Segmentation with Hierarchical Transformer. In Proceedings of the IGARSS 2023—2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; pp. 6318–6321. [Google Scholar] [CrossRef]
- Mao, R.; Li, H.; Ren, G.; Yin, Z. Cloud Removal Based on SAR-Optical Remote Sensing Data Fusion via a Two-Flow Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 7677–7686. [Google Scholar] [CrossRef]
- Song, K.; Xue, X.; Wen, H.; Ji, Y.; Yan, Y.; Meng, Q. Misaligned Visible-Thermal Object Detection: A Drone-Based Benchmark and Baseline. IEEE Trans. Intell. Veh. 2024, 9, 7449–7460. [Google Scholar] [CrossRef]
- Wei, T.; Chen, H.; Wang, J.; Liu, W. MDFNet: Multimodal Feature Decomposition and Fusion Network for Multimodal Remote Sensing Image Semantic Segmentation. In Proceedings of the 2024 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), Zhuhai, China, 20–22 December 2024; pp. 1–5. [Google Scholar] [CrossRef]
- Fang, Q.; Wang, Z. Cross-Modality Attentive Feature Fusion for Object Detection in Multispectral Remote Sensing Imagery. Pattern Recognit. 2022, 130, 108786. [Google Scholar] [CrossRef]
- Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
- Lin, Z.; Nikishin, E.; He, X.; Courville, A. Forgetting Transformer: Softmax Attention with a Forget Gate. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025; Volume 2025, pp. 69704–69738. [Google Scholar]
- Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar] [CrossRef]
- Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20), Red Hook, NY, USA, 6–12 December 2020; pp. 21002–21012. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zürich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
- He, X.; Tang, C.; Zou, X.; Zhang, W. Multispectral Object Detection via Cross-Modal Conflict-Aware Learning. In Proceedings of the 31st ACM International Conference on Multimedia, New York, NY, USA, 29 October–3 November 2023; pp. 1465–1474. [Google Scholar] [CrossRef]
- Shen, J.; Chen, Y.; Liu, Y.; Zuo, X.; Fan, H.; Yang, W. ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection. Pattern Recognit. 2024, 145, 109913. [Google Scholar] [CrossRef]
- Zeng, Y.; Liang, T.; Jin, Y.; Li, Y. MMI-Det: Exploring Multi-Modal Integration for Visible and Infrared Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 11198–11213. [Google Scholar] [CrossRef]




| Method | #P (M) | Inf.T (ms) | BD (%) | HB (%) | OT (%) | PG (%) | AP (%) | WT (%) | AP50 (%) | AP75 (%) | mAP (%) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CFT [33] | 53.8 | 40.6 | 75.8 | 92.5 | 61.3 | 91.6 | 90.3 | 96.3 | 84.6 | 68.9 | 59.9 |
| CLANet [39] | 48.2 | 37.1 | 74.8 | 92.2 | 60.7 | 91.3 | 91.6 | 97.2 | 84.6 | 68.5 | 59.6 |
| CSSA [27] | 13.5 | 29.1 | 73.3 | 91.7 | 59.3 | 88.9 | 91.6 | 95.8 | 83.4 | 66.4 | 58.0 |
| CMADet [31] | 41.5 | 12.3 | 70.9 | 90.7 | 52.0 | 86.4 | 91.7 | 97.1 | 81.5 | 63.5 | 55.7 |
| ICAFusion [40] | 29.0 | 23.6 | 74.7 | 91.9 | 60.9 | 91.0 | 91.8 | 96.7 | 84.5 | 67.3 | 58.8 |
| MMIDet [41] | 53.8 | 41.9 | 74.9 | 92.6 | 61.1 | 91.7 | 91.4 | 97.0 | 84.8 | 68.6 | 59.8 |
| E2E-MFD [28] | 31.3 | 37.1 | 76.1 | 91.9 | 61.1 | 91.8 | 91.3 | 97.2 | 84.9 | 69.5 | 60.5 |
| E2E-OSDet [22] | 27.5 | 20.9 | 77.7 | 90.7 | 64.3 | 91.8 | 92.1 | 97.8 | 85.7 | 70.3 | 61.4 |
| Ours | 31.4 | 28.4 | 77.6 | 96.8 | 62.9 | 94.7 | 99.1 | 96.7 | 88.0 | 74.9 | 65.6 |
| Method | 25% Training | 50% Training | 70% Training | |||
|---|---|---|---|---|---|---|
| AP50 (%) | mAP (%) | AP50 (%) | mAP (%) | AP50 (%) | mAP (%) | |
| CFT [33] | 51.8 | 31.1 | 64.8 | 42.4 | 76.4 | 54.1 |
| CLANet [39] | 50.9 | 30.5 | 63.9 | 41.5 | 75.8 | 53.4 |
| CSSA [27] | 48.2 | 28.5 | 61.1 | 39.4 | 73.2 | 50.9 |
| CMADet [31] | 47.9 | 27.2 | 59.7 | 38.1 | 71.8 | 49.5 |
| ICAFusion [40] | 49.5 | 29.8 | 62.5 | 40.8 | 74.5 | 52.2 |
| MMIDet [41] | 52.1 | 31.5 | 65.1 | 42.9 | 76.9 | 54.5 |
| E2E-MFD [28] | 53.4 | 32.4 | 66.2 | 43.7 | 77.5 | 55.2 |
| E2E-OSDet [22] | 54.5 | 33.2 | 67.1 | 44.5 | 78.8 | 56.3 |
| Ours | 55.9 | 34.6 | 68.4 | 45.8 | 80.1 | 57.7 |
| HyperSIGMA | APG | MFM | (%) | (%) | mAP (%) |
|---|---|---|---|---|---|
| × | × | × | 77.9 | 59.1 | 49.8 |
| ✓ | × | × | 83.7 | 68.4 | 59.3 |
| ✓ | ✓ | × | 85.9 | 69.7 | 61.4 |
| × | × | ✓ | 83.4 | 64.7 | 56.8 |
| ✓ | × | ✓ | 84.8 | 69.0 | 59.7 |
| ✓ | ✓ | ✓ | 88.0 | 74.9 | 65.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Jiang, Q.; Liao, J.; Lin, Q.; Zhang, J. Harnessing Foundation Models for Optical–SAR Object Detection via Gated–Guided Fusion. ISPRS Int. J. Geo-Inf. 2026, 15, 160. https://doi.org/10.3390/ijgi15040160
Jiang Q, Liao J, Lin Q, Zhang J. Harnessing Foundation Models for Optical–SAR Object Detection via Gated–Guided Fusion. ISPRS International Journal of Geo-Information. 2026; 15(4):160. https://doi.org/10.3390/ijgi15040160
Chicago/Turabian StyleJiang, Qianyin, Jianshang Liao, Qiuyu Lin, and Junkang Zhang. 2026. "Harnessing Foundation Models for Optical–SAR Object Detection via Gated–Guided Fusion" ISPRS International Journal of Geo-Information 15, no. 4: 160. https://doi.org/10.3390/ijgi15040160
APA StyleJiang, Q., Liao, J., Lin, Q., & Zhang, J. (2026). Harnessing Foundation Models for Optical–SAR Object Detection via Gated–Guided Fusion. ISPRS International Journal of Geo-Information, 15(4), 160. https://doi.org/10.3390/ijgi15040160
