SAM–Attention Synergistic Enhancement: SAR Image Object Detection Method Based on Visual Large Model
Abstract
Highlights
- An SAM-based visual large model detector for SAR is proposed, combining an Adaptive Channel Interactive Attention (ACIA) module for speckle suppression rather than generic robustness and a Dynamic Tandem Attention (DTA) decoder for multi-scale spatial focusing and task decoupling.
- This model has strong generalization ability and stability, and demonstrates excellent performance in cross-domain detection on different ship datasets and few-shot detection on aircraft datasets.
- It validates the transferability and advantages of visual large model (VLM)-based networks for SAR object detection, mitigating baselines’ reliance on SAR texture cues and providing new technical support for SAR interpretation and microwave remote sensing image processing.
- The approach reduces labeled-data demands and remains robust under domain shift, improving the practicality of SAR object detection for military operations, marine supervision, disaster detection, etc.
Abstract
1. Introduction
- 1.
- The image encoder obtained by pre-training SAM on natural images is used as the backbone of SAR object detection, which utilizes its generalized representation ability learned from large-scale data to extract detailed features in SAR images while alleviating the overfitting problem under few-shot annotation.
- 2.
- Adaptive Channel Interactive Attention (ACTA) and Dynamic Tandem Attention (DTA) are combined to suppress SAR speckle noise and enhance object scattering characteristics through global–local channel fusion and a three-level attention tandem.
- 3.
- Taking two types of objects, ships and airplanes, as an example, experiments are conducted on public SAR datasets, which fully demonstrate the great potential of visual large models and multi-dimensional attention synergy in SAR object detection.
2. Related Work
2.1. SAR Object Detection
2.2. Cross-Domain Detection and Few-Sample Detection
2.3. Large Model Development
3. Methods
3.1. Overall
3.2. SAM Image Encoder
3.3. Adaptive Channel Interaction Attention
3.4. Decoder Combined with Dynamic Tandem Attention
4. Experimental Results and Analysis
4.1. Experimental Details
4.1.1. Datasets
4.1.2. Relevant Details
4.1.3. Evaluation Index
4.2. Experimental Result
4.2.1. Cross-Domain Detection on Ship Datasets
4.2.2. Few-Shot Detection on Aircraft Datasets
4.2.3. Ablation Experiment
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhu, Q.; Zhang, Y.; Li, Z.; Yan, X.; Guan, Q.; Zhong, Y.; Zhang, L.; Li, D. Oil spill contextual and boundary-supervised detection network based on marine SAR images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5213910. [Google Scholar] [CrossRef]
- Amitrano, D.; Di Martino, G.; Di Simone, A.; Imperatore, P. Flood detection with SAR: A review of techniques and datasets. Remote Sens. 2024, 16, 656. [Google Scholar] [CrossRef]
- Brenner, A.R.; Ender, J.H. Demonstration of advanced reconnaissance techniques with the airborne SAR/GMTI sensor PAMIR. IEE Proc.-Radar Sonar Navig. 2006, 153, 152–162. [Google Scholar] [CrossRef]
- Ikeuchi, K.; Shakunaga, T.; Wheeler, M.D.; Yamazaki, T. Invariant histograms and deformable template matching for SAR target recognition. In Proceedings of the Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 18-20 June 1996; pp. 100–105. [Google Scholar]
- Jianxiong, Z.; Zhiguang, S.; Xiao, C.; Qiang, F. Automatic target recognition of SAR images based on global scattering center model. IEEE Trans. Geosci. Remote Sens. 2011, 49, 3713–3729. [Google Scholar] [CrossRef]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
- Ding, J.; Chen, B.; Liu, H.; Huang, M. Convolutional neural network with data augmentation for SAR target recognition. IEEE Geosci. Remote Sens. Lett. 2016, 13, 364–368. [Google Scholar] [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Ding, S.; Wang, Q.; Guo, L.; Li, X.; Ding, L.; Wu, X. Wavelet and adaptive coordinate attention guided fine-grained residual network for image denoising. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 6156–6166. [Google Scholar] [CrossRef]
- Gao, G.; Liu, L.; Zhao, L.; Shi, G.; Kuang, G. An adaptive and fast CFAR algorithm based on automatic censoring for target detection in high-resolution SAR images. IEEE Trans. Geosci. Remote Sens. 2008, 47, 1685–1697. [Google Scholar] [CrossRef]
- Al-Hussaini, E.K. Performance of the greater-of and censored greater-of detectors in multiple target environments. In Proceedings of the IEE Proceedings F (Communications, Radar and Signal Processing); IET: Stevenage, UK, 1988; Volume 135, pp. 193–198. [Google Scholar]
- Bakirci, M.; Bayraktar, I. Assessment of YOLO11 for ship detection in SAR imagery under open ocean and coastal challenges. In Proceedings of the 2024 21st International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE), Mexico City, Mexico, 23–25 October 2024; pp. 1–6. [Google Scholar]
- Li, K.; Wang, D.; Hu, Z.; Zhu, W.; Li, S.; Wang, Q. Unleashing channel potential: Space-frequency selection convolution for SAR object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17323–17332. [Google Scholar]
- Zhang, L.; Zheng, J.; Li, C.; Xu, Z.; Yang, J.; Wei, Q.; Wu, X. Ccdn-detr: A detection transformer based on constrained contrast denoising for multi-class synthetic aperture radar object detection. Sensors 2024, 24, 1793. [Google Scholar] [CrossRef] [PubMed]
- Li, Z.; Zhou, X. Refined deformable-detr for sar target detection and radio signal detection. Remote. Sensing 2025, 17, 1406. [Google Scholar] [CrossRef]
- Fu, Y.; Wang, Y.; Pan, Y.; Huai, L.; Qiu, X.; Shangguan, Z.; Liu, T.; Fu, Y.; Van Gool, L.; Jiang, X. Cross-domain few-shot object detection via enhanced open-set object detector. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 247–264. [Google Scholar]
- Huang, H.; Li, B.; Zhang, Y.; Chen, T.; Wang, B. Joint distribution adaptive-alignment for cross-domain segmentation of high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5401214. [Google Scholar] [CrossRef]
- Han, G.; Lim, S.N. Few-shot object detection with foundation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 28608–28618. [Google Scholar]
- Lin, H.; Li, N.; Yao, P.; Dong, K.; Guo, Y.; Hong, D.; Zhang, Y.; Wen, C. Generalization-enhanced few-shot object detection in remote sensing. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 5445–5460. [Google Scholar] [CrossRef]
- Siméoni, O.; Vo, H.V.; Seitzer, M.; Baldassarre, F.; Oquab, M.; Jose, C.; Khalidov, V.; Szafraniec, M.; Yi, S.; Ramamonjisoa, M.; et al. DINOv3. arXiv 2025, arXiv:2508.10104. [Google Scholar]
- Wang, Y.; Hernández, H.H.; Albrecht, C.M.; Zhu, X.X. Feature guided masked autoencoder for self-supervised learning in remote sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 18, 321–336. [Google Scholar] [CrossRef]
- Li, W.; Yang, W.; Liu, T.; Hou, Y.; Li, Y.; Liu, Z.; Liu, Y.; Liu, L. Predicting gradient is better: Exploring self-supervised learning for SAR ATR with a joint-embedding predictive architecture. ISPRS J. Photogramm. Remote Sens. 2024, 218, 326–338. [Google Scholar] [CrossRef]
- Pu, X.; Jia, H.; Zheng, L.; Wang, F.; Xu, F. ClassWise-SAM-adapter: Parameter efficient fine-tuning adapts segment anything to SAR domain for semantic segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 4791–4804. [Google Scholar] [CrossRef]
- Ren, T.; Liu, S.; Zeng, A.; Lin, J.; Li, K.; Cao, H.; Chen, J.; Huang, X.; Chen, Y.; Yan, F.; et al. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv 2024, arXiv:2401.14159. [Google Scholar] [CrossRef]
- Baraha, S.; Sahoo, A.K. Synthetic aperture radar image and its despeckling using variational methods: A review of recent trends. Signal Process. 2023, 212, 109156. [Google Scholar] [CrossRef]
- Xian, S.; Zhirui, W.; Yuanrui, S.; Wenhui, D.; Yue, Z.; Kun, F. AIR-SARShip-1.0: High-resolution SAR ship detection dataset. J. Radars 2019, 8, 852–863. [Google Scholar]
- Zhang, T.; Zhang, X.; Li, J.; Xu, X.; Wang, B.; Zhan, X.; Xu, Y.; Ke, X.; Zeng, T.; Su, H.; et al. SAR ship detection dataset (SSDD): Official release and comprehensive data analysis. Remote Sens. 2021, 13, 3690. [Google Scholar] [CrossRef]
- Zhirui, W.; Yuzhuo, K.; Xuan, Z.; Yuelei, W.; Ting, Z.; Xian, S. SAR-AIRcraft-1.0: High-resolution SAR aircraft detection and recognition dataset. J. Radars 2023, 12, 906–922. [Google Scholar]
- Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
- Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
- Zhang, H.; Chang, H.; Ma, B.; Wang, N.; Chen, X. Dynamic R-CNN: Towards high quality object detection via dynamic training. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 260–275. [Google Scholar]
- Lv, W.; Zhao, Y.; Chang, Q.; Huang, K.; Wang, G.; Liu, Y. Rt-detrv2: Improved baseline with bag-of-freebies for real-time detection transformer. arXiv 2024, arXiv:2407.17140. [Google Scholar]
- Chai, B.; Nie, X.; Zhou, Q.; Zhou, X. Enhanced cascade R-CNN for multiscale object detection in dense scenes from SAR images. IEEE Sens. J. 2024, 24, 20143–20153. [Google Scholar] [CrossRef]
Models | Proportion of SSDD (%) | Precision | Recall | mAP50 | mAP50–95 |
---|---|---|---|---|---|
Yolo11L [7,33] | 0 | 0.543 | 0.390 | 0.388 | 0.131 |
5 | 0.806 | 0.696 | 0.783 | 0.332 | |
10 | 0.758 | 0.713 | 0.786 | 0.319 | |
Faster-RCNN [6] | 0 | 0.145 | 0.203 | 0.068 | 0.014 |
5 | 0.244 | 0.302 | 0.198 | 0.063 | |
10 | 0.750 | 0.752 | 0.802 | 0.364 | |
Cascade-RCNN [34] | 0 | 0.467 | 0.405 | 0.387 | 0.205 |
5 | 0.773 | 0.661 | 0.757 | 0.329 | |
10 | 0.750 | 0.707 | 0.786 | 0.456 | |
DETR [8,35] | 0 | 0.600 | 0.495 | 0.464 | 0.141 |
5 | 0.754 | 0.792 | 0.774 | 0.306 | |
10 | 0.803 | 0.780 | 0.812 | 0.351 | |
Dynamic-RCNN [36] | 0 | 0.535 | 0.408 | 0.390 | 0.144 |
5 | 0.794 | 0.727 | 0.794 | 0.400 | |
10 | 0.831 | 0.788 | 0.857 | 0.521 | |
RT-DETR v2 [37] | 0 | 0.362 | 0.432 | 0.352 | 0.213 |
5 | 0.788 | 0.693 | 0.782 | 0.409 | |
10 | 0.766 | 0.756 | 0.814 | 0.543 | |
EC-RCNN [38] | 0 | 0.271 | 0.428 | 0.265 | 0.181 |
5 | 0.744 | 0.578 | 0.671 | 0.419 | |
10 | 0.731 | 0.638 | 0.724 | 0.461 | |
Ours | 0 | 0.560 | 0.527 | 0.540 | 0.214 |
5 | 0.818 | 0.738 | 0.820 | 0.450 | |
10 | 0.840 | 0.744 | 0.838 | 0.444 |
Models | Precision | Recall | mAP50 | mAP50–95 |
---|---|---|---|---|
Yolo11L [7,33] | 0.764 | 0.420 | 0.479 | 0.214 |
Faster-RCNN [6] | 0.523 | 0.252 | 0.285 | 0.153 |
Cascade-RCNN [34] | 0.676 | 0.331 | 0.428 | 0.261 |
DETR [8,35] | 0.310 | 0.382 | 0.294 | 0.143 |
Dynamic-RCNN [36] | 0.648 | 0.388 | 0.446 | 0.269 |
RT-DETR v2 [37] | 0.348 | 0.443 | 0.339 | 0.226 |
EC-RCNN [38] | 0.336 | 0.432 | 0.324 | 0.216 |
Ours | 0.618 | 0.453 | 0.503 | 0.290 |
Models | Val/Test | Precision | Recall | mAP50 | mAP50–95 |
---|---|---|---|---|---|
Yolo11L [7,33] | val | 0.795 | 0.557 | 0.594 | 0.389 |
test | 0.764 | 0.420 | 0.479 | 0.314 | |
Difference (val-test) | 0.031 | 0.137 | 0.115 | 0.075 | |
Ours | val | 0.674 | 0.554 | 0.600 | 0.332 |
test | 0.618 | 0.453 | 0.503 | 0.290 | |
Difference (val-test) | 0.056 | 0.101 | 0.097 | 0.042 |
Models | Proportion of SSDD (%) | Precision | Recall | mAP50 | mAP50–95 |
---|---|---|---|---|---|
Only ACIA | 0 | 0.494 | 0.518 | 0.491 | 0.171 |
5 | 0.795 | 0.684 | 0.777 | 0.475 | |
10 | 0.776 | 0.733 | 0.797 | 0.492 | |
Only DTA | 0 | 0.578 | 0.377 | 0.408 | 0.187 |
5 | 0.780 | 0.741 | 0.800 | 0.407 | |
10 | 0.748 | 0.745 | 0.803 | 0.345 | |
ACIA + DTA | 0 | 0.560 | 0.527 | 0.540 | 0.214 |
5 | 0.818 | 0.738 | 0.820 | 0.450 | |
10 | 0.840 | 0.744 | 0.838 | 0.444 |
Group | FocalEntropy (H) ↓ | Noise Suppression (NSR) ↑ | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 | |
b1 b5 | 0.735 | 0.849 | 0.969 | 0.909 | 0.935 | 104.928 | 32.174 | 3.733 | 28.084 | 8.409 |
c1 c5 | 0.731 | 0.829 | 0.969 | 0.907 | 0.931 | 111.216 | 43.030 | 3.862 | 28.608 | 8.558 |
e1 e5 | 0.914 | 0.848 | 0.934 | 0.907 | 0.922 | 14.105 | 46.512 | 7.148 | 6.096 | 23.248 |
f1 f5 | 0.908 | 0.843 | 0.933 | 0.905 | 0.914 | 14.604 | 47.056 | 7.326 | 6.109 | 23.512 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yuan, Y.; Yang, J.; Shi, L.; Zhao, L. SAM–Attention Synergistic Enhancement: SAR Image Object Detection Method Based on Visual Large Model. Remote Sens. 2025, 17, 3311. https://doi.org/10.3390/rs17193311
Yuan Y, Yang J, Shi L, Zhao L. SAM–Attention Synergistic Enhancement: SAR Image Object Detection Method Based on Visual Large Model. Remote Sensing. 2025; 17(19):3311. https://doi.org/10.3390/rs17193311
Chicago/Turabian StyleYuan, Yirong, Jie Yang, Lei Shi, and Lingli Zhao. 2025. "SAM–Attention Synergistic Enhancement: SAR Image Object Detection Method Based on Visual Large Model" Remote Sensing 17, no. 19: 3311. https://doi.org/10.3390/rs17193311
APA StyleYuan, Y., Yang, J., Shi, L., & Zhao, L. (2025). SAM–Attention Synergistic Enhancement: SAR Image Object Detection Method Based on Visual Large Model. Remote Sensing, 17(19), 3311. https://doi.org/10.3390/rs17193311