SAM–Attention Synergistic Enhancement: SAR Image Object Detection Method Based on Visual Large Model

Yuan, Yirong; Yang, Jie; Shi, Lei; Zhao, Lingli

doi:10.3390/rs17193311

Open AccessArticle

SAM–Attention Synergistic Enhancement: SAR Image Object Detection Method Based on Visual Large Model

¹

State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China

²

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(19), 3311; https://doi.org/10.3390/rs17193311

Submission received: 6 August 2025 / Revised: 17 September 2025 / Accepted: 25 September 2025 / Published: 26 September 2025

(This article belongs to the Special Issue Big Data Era: AI Technology for SAR and PolSAR Image)

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

An SAM-based visual large model detector for SAR is proposed, combining an Adaptive Channel Interactive Attention (ACIA) module for speckle suppression rather than generic robustness and a Dynamic Tandem Attention (DTA) decoder for multi-scale spatial focusing and task decoupling.
This model has strong generalization ability and stability, and demonstrates excellent performance in cross-domain detection on different ship datasets and few-shot detection on aircraft datasets.

What is the implication of the main finding?

It validates the transferability and advantages of visual large model (VLM)-based networks for SAR object detection, mitigating baselines’ reliance on SAR texture cues and providing new technical support for SAR interpretation and microwave remote sensing image processing.
The approach reduces labeled-data demands and remains robust under domain shift, improving the practicality of SAR object detection for military operations, marine supervision, disaster detection, etc.

Abstract

The object detection model for synthetic aperture radar (SAR) images needs to have strong generalization ability and more stable detection performance due to the complex scattering mechanism, high sensitivity of the orientation angle, and susceptibility to speckle noise. Visual large models possess strong generalization capabilities for natural image processing, but their application to SAR imagery remains relatively rare. This paper attempts to introduce a visual large model into the SAR object detection task, aiming to alleviate the problems of weak cross-domain generalization and poor adaptability to few-shot samples caused by the characteristics of SAR images in existing models. The proposed model comprises an image encoder, an attention module, and a detection decoder. The image encoder leverages the pre-trained Segment Anything Model (SAM) for effective feature extraction from SAR images. An Adaptive Channel Interactive Attention (ACIA) module is introduced to suppress SAR speckle noise. Further, a Dynamic Tandem Attention (DTA) mechanism is proposed in the decoder to integrate scale perception, spatial focusing, and task adaptation, while decoupling classification from detection for improved accuracy. Leveraging the strong representational and few-shot adaptation capabilities of large pre-trained models, this study evaluates their cross-domain and few-shot detection performance on SAR imagery. For cross-domain detection, the model was trained on AIR-SARShip-1.0 and tested on SSDD, achieving an mAP50 of 0.54. For few-shot detection on SAR-AIRcraft-1.0, using only 10% of the training samples, the model reached an mAP50 of 0.503.

Keywords:

SAR; object detection; visual large model; Segment Anything Model

1. Introduction

Synthetic aperture radar (SAR) plays an important role in tasks such as marine surveillance, disaster monitoring, and military reconnaissance, utilizing its all-weather and all-day imaging capability [1,2,3]. Early SAR object detection studies relied mainly on template matching [4] and scattering center analysis [5], but it has been difficult to meet practical requirements for detection accuracy due to speckle noise and geometric deformation of the image. Recently, deep learning-based methods, such as Faster R-CNN [6], YOLO [7], and Transformer [8], have significantly improved the detection performance through end-to-end feature learning. However, these methods usually need a lot of data to train, and they assume that training and test data come from the same distribution. In contrast, in real-world application scenarios, the data available for SAR images is limited and expensive. The modeling performance is drastically degraded by the effects of different sensor imaging bands, resolutions, and angles. Existing object detection methods mainly alleviate this problem by enhancing the data or stacking many network modules [9]. Although enhancing the data can improve the diversity of samples, it is challenging to capture the physical complexity of real scenes, and the complex design of the network structure also limits the model’s capacity to generalize.

In recent years, the development of large model technology has provided new insights for addressing the challenges of model generalization and stability. In Natural Language Processing (NLP), the GPT series models [10] have demonstrated a strong generalization ability between tasks through the “pre-training and fine-tuning” approach. In the area of vision, the Segment Anything Model (SAM) [11] has become a new benchmark for generalized image segmentation due to its large-scale pre-training and cue-driven flexible architecture. Nevertheless, SAR’s scattering properties differ greatly from real pictures’ textural aspects, and the generalized features obtained by existing large vision models trained on daily photos are difficult to directly adapt to SAR images. However, these large models have strong global semantic understanding and contextual information extraction capabilities and can effectively capture the spatial structure and contour information of the object. Although some generalization features are not completely consistent with the physical scattering mechanism of SAR images, after certain migrations and adaptations, the dependence on the labeled data of SAR images can still be effectively reduced, and the robustness of the model against complex backgrounds, speckle noise, and the diversity of object morphology changes can still be improved. Many scholars adopt attention modules to better adapt SAR images to networks, such as SE [12] and wavelet attention [13], etc. Although the designs of the modules vary, the aim is to enhance effective features and suppress clutter noise. Therefore, this paper constructs an SAM-based object detection model for SAR images, and the following are the primary contributions:

1.: The image encoder obtained by pre-training SAM on natural images is used as the backbone of SAR object detection, which utilizes its generalized representation ability learned from large-scale data to extract detailed features in SAR images while alleviating the overfitting problem under few-shot annotation.
2.: Adaptive Channel Interactive Attention (ACTA) and Dynamic Tandem Attention (DTA) are combined to suppress SAR speckle noise and enhance object scattering characteristics through global–local channel fusion and a three-level attention tandem.
3.: Taking two types of objects, ships and airplanes, as an example, experiments are conducted on public SAR datasets, which fully demonstrate the great potential of visual large models and multi-dimensional attention synergy in SAR object detection.

2. Related Work

2.1. SAR Object Detection

In recent decades, synthetic aperture radar has emerged as a crucial object identification technique in the field of remote sensing because of its all-weather and all-day observation of the ground and its independence from natural conditions such as clouds and fog. Initially, SAR object detection relied on traditional processing methods. Gao et al. [14] proposed a fast CFAR detection algorithm based on the G0 distribution, which solved the problem of traditional CFAR in accurately modeling SAR clutter wave distributions, but was inadequate in multi-scale object detection. Emad et al. [15] introduced a P-G0 distributed CFAR model under polarized matched filtering, which improves the detection accuracy in complex backgrounds but has limited robustness to non-uniform clutter and very small targets.

As deep learning gains popularity, more scholars are adopting neural network architectures for SAR object detection. Bakirci et al. [16] conducted a systematic evaluation of SAR ship detection in open sea areas and complex nearshore backgrounds based on YOLO11 and provided conclusions on speed accuracy trade-off and scene sensitivity. Li et al. [17] proposed spatial-frequency selective convolution and lightweighting for SAR target detection, enhancing feature diversity and suppressing nearshore clutter within a single-layer convolution through the “split-perception-selection” strategy. Transformer [8] has shown excellent performance in dealing with long-term reliance and complex semantics by the self-attention mechanism, and many experts have introduced Vision Transformer (ViT) to SAR object detection tasks. Zhang et al. [18] developed CCDN-DETR, which accelerates convergence and improves multi-category detection performance through cross-scale encoder and contrast denoising training, but still suffers from leakage detection in extreme multi-object scenarios; Li et al. [19] suggested Refined Deformable-DETR, which fuses half-window filtering with multi-scale adapters to fully utilize the SAR signal a priori, which significantly improves the overall detection accuracy, but greatly increases the computational overhead.

In summary, SAR object detection, either using physical mechanisms or deep learning network methods, can achieve a high level of accuracy in specific scenarios, but few studies have explored the performance of the models in different data domains and with fewer samples.

2.2. Cross-Domain Detection and Few-Sample Detection

Image cross-domain detection mainly focuses on data from the source domains and the destination domains, although the training set and the test set are labeled with the same broad class of objects, their marginal distributions in the data space differ significantly due to different factors such as sensor type, imaging angle, or resolution. To address the marginal distribution of data across domains, Fu et al. [20] proposed CD-ViTO based on an open set detector. It first quantifies the inter-domain differences with inter-class variance, and then enhances cross-domain generalization through learnable instance alignment, instance reweighting, and domain suggester synthesis. Huang et al. [21] proposed the Joint Distribution Adaptive Alignment Framework (JDAF), which enhances the cross-domain transfer performance of remote sensing image segmentation by combining different alignment distributions with uncertainty adaptive learning strategies. However, the adaptation ability for the case of incomplete class overlap has not yet been verified. To address the scenario of low data volume resulting in a low percentage of training samples, Han et al. [22] proposed a few-shot object detection framework based on a fundamental model, which performs context-driven classification of candidate boxes through carefully designed language instructions, significantly simplifying the cumbersome design of traditional metric learning networks; Lin et al. [23] proposed GE-FSOD, which enhances multi-scale representation through cross-layer fusion pyramid attention, improves candidate quality through multi-stage refinement, and enhances few-shot classification with generalized classification loss. It has demonstrated outstanding accuracy in few-shot object detection tasks, but whether it is suitable for SAR images still needs to be verified.

Most of the above methods need to be repeatedly tuned for different scenarios to construct accurate alignment strategies, which makes it difficult to achieve stable and efficient detection results when facing SAR object detection tasks with large sensor differences and high variety within classes.

2.3. Large Model Development

Recently, large-scale models have made rapid development. The large language model represented by ChatGPT-3 [10] emerged unexpectedly, and through the self-supervised training of a massive corpus and fine-tuning of multiple rounds of dialogues, it has demonstrated excellent generalization ability and low-sample learning ability, bringing a paradigm-level change to natural language understanding and generation. Inspired by this, many large model frameworks have also been derived in the vision field. Meta AI released SAM [11], which builds a promptable segmentation framework based on large-scale labeling and self-supervised pre-training, and can perform zero-sample segmentation of any object with only a simple point, box, or textual cue, which greatly reduces the reliance on specialized labeling data; Simeoni et al. [24] proposed the self-supervised visual foundation model DINOv3. Through large-scale data and model training, and by introducing Gram anchoring to stabilize the dense features under long-term training, it demonstrated strong cross-task capabilities without fine-tuning.

In SAR imaging applications, self-supervised pre-training improves the generalization of the model to complicated signals. Feature-Guided Masked Autoencoder (FG-MAE) [25] greatly improves ViT’s performance on SAR downstream classification and segmentation tasks by using the histogram of directions (HOG) of SAR images as the reconstruction object; Li et al. [26] proposed a self-supervised framework SAR-JEPA for SAR object recognition. Through a joint embedded prediction strategy of “local masking and multi-scale SAR gradient characterization”, it constructs high-quality learning signals and shows certain advantages in multi-object detection. For existing visual large models, ClassWise-SAM-Adapter [27] enables efficient fine-tuning of SAR land cover classification by inserting a lightweight adapter and designing a category mask decoder while freezing most of the SAM parameters; Ren et al. [28] optimized the synergy between SAM and Grounding DINO for remote sensing zero-shot segmentation by generating candidate boxes through textual prompts and invoking SAM to complete the segmentation, demonstrating the open-domain adaptation potential of the foundation model in remote sensing scenarios.

Although significant progress has been made in remote sensing image processing using visual large models domestically and internationally, research on object detection for SAR images remains limited. In this paper, we will adopt the visual large model framework to leverage the richer and broader feature representations of large models and their strong generalization ability, using SAM as an example to explore the role of visual large models in object detection under the cross-domain scenario of SAR image data with fewer training samples.

3. Methods

3.1. Overall

Figure 1 depicts the design of the suggested model. In the encoder stage, we adopt the image encoder of SAM [11], retaining its Vision Transformer backbone to extract geometric, textural, and scattering features from the input SAR image. Subsequently, an adaptive channel interaction attention mechanism is applied to integrate both local and global information, enhancing the scattering-dominant channels while suppressing those dominated by speckle noise. The decoder consists of two detection heads for regression and classification. First, Dynamic Tandem Attention is computed on the channel-weighted feature maps to adapt to variations in object size and to focus on strong scattering points. Then, the features are decoupled into two branches that decode the deep representations into bounding boxes containing both object location and category information.

The design of the model structure takes the SAM image encoder as the backbone, aiming to utilize the powerful generalization ability and stability of the visual large model to achieve SAR object detection. To enable the model to better adapt to and process the input of SAR images, this paper incorporates an attention mechanism. The noise inherent to SAR images is suppressed through the ACIA, and the key features of the object are highlighted through the DTA, enabling precise object detection. The visualization results of the feature maps at each stage are shown in Figure 2.

In the entire detection model, the input SAR image first extracts hierarchical features through the SAM encoder. Subsequently, the ACIA interacts and weights these features in both channel and spatial dimensions to suppress speckle noise, etc. Then, the DTA dynamically aggregates positional, spatial, and task-aware attention to highlight the object-related features. Finally, the processed features are decoupled into classification and regression branches, focusing respectively on the discrimination of object categories and the localization of bounding boxes.

3.2. SAM Image Encoder

The image encoder in the SAM model is built upon a Transformer [8] architecture. A Transformer globally models context information through a self-attention mechanism, enhancing detection robustness in complex scenarios. Its parallelized structure can also capture long-distance dependencies and enhance the feature expression ability of multi-scale objects. It begins by segmenting the input image and adding learnable positional embeddings to generate a sequence of tokens. This sequence then passes through a series of Vision Transformer (ViT) blocks, which include four layers of global attention and eight layers of windowed local attention. After the ViT blocks, the channel dimension is decreased to 768 by using a 1 × 1 convolution. This is followed by a 3 × 3 convolution and layer normalization, which perform local smoothing and feature fusion. The final output is an image embedding with a shape of 256 × 64 × 64.

The SAM model was extensively trained on the SA-1B dataset, which primarily consists of natural images rather than remote sensing imagery. Nevertheless, the model is capable of learning rich visual and semantic features through large-scale data. Although natural and SAR images are not the same, the texture and structure of various objects in SAR images exhibit consistent variation, especially under ultra-high-resolution imaging techniques such as microwave photon-based systems. These characteristics render SAR images visually similar to natural images in certain aspects. Therefore, in this work, we utilize the officially released SAM weights by isolating and retaining the image encoder’s parameters, which are loaded during training. This approach provides two advantages: first, it endows the network with strong spatial awareness and structured feature extraction capabilities from the outset; second, it significantly reduces the adaptation time required to address issues unique to SAR, including speckle noise, object scattering characteristics, and texture patterns, thereby improving both training efficiency and detection performance.

3.3. Adaptive Channel Interaction Attention

SAR images are characterized by speckle noise and strong multipath scattering [29], while the initial visual features extracted by the encoder primarily focus on texture and structural information. This mismatch makes it difficult for the encoder to effectively suppress the high-noise background typical of SAR images. To address this challenge and maintain robust model performance, we introduce an adaptive channel interaction attention (ACIA) mechanism tailored to the noise characteristics and scattering structures of SAR data. Specifically, the global channel dependence helps the model to obtain the scattering feature patterns in the global field of view and effectively capture the overall structural profile of the object, while the local channel dependence increases the model’s capacity to obtain the local scattering and texture information of the object. This adaptive multi-scale fusion of channel information enables the model to selectively amplify feature channels associated with object scattering while effectively suppressing interference from non-object regions, thereby enhancing the model’s robustness in SAR images with a high-noise background.

This attention is inspired by the Squeeze-and-Excitation (SE) module [12]. The SE attention captures global information via fully connected layers but lacks effective interaction with local information, which may lead to biased feature weighting for image regions relevant to the object. The attention mechanism proposed in this study effectively integrates both global and local information, enabling a more rational allocation of channel weights and extraction of more informative features. Specifically, for a given feature map

F \in R^{C \times H \times W}

, the use of global average pooling produces a channel descriptor U of length C (1):

U_{n} = \frac{1}{H \times W} \sum_{α = 1}^{H} \sum_{β = 1}^{W} F_{n} (α, β)

(1)

F_{n} (α, β)

indicates the feature’s position in space

(α, β)

in the n-th channel. To capture local dependencies between channels without increasing computational complexity, a one-dimensional convolution is used, to flatten the tensor spatially and rearrange its channel dimensions, yielding the local dependency vector

U_{L}

. Similarly, to model long-range dependencies across channels, a two-dimensional convolution is used to generate the global dependency vector

U_{G}

. To enable effective interaction between local and global information, we perform a correlation operation to combine the local and global dependency vectors, capturing multi-scale relevance between them. The correlation matrix M is computed as follows (2):

M = U_{G} \cdot U_{L}^{T}

(2)

To accurately map the obtained correlation representations into feature weights for each channel, this paper adopts an adaptive fusion strategy, and the specific process is shown in Figure 3.

In this process, the two correlated representations are first individually compressed and then normalized using a sigmoid activation. Subsequently, a learnable parameter

λ

is introduced, and a second sigmoid operation is applied to fuse the global and local weights in a hybrid manner. This approach not only avoids redundant cross-correlation operations between local and global information but also further enhances their interaction. The fusion process is formulated as (3):

W = σ (σ (λ) \times σ (U_{G}^{'}) + (1 - σ (λ)) \times σ (U_{L}^{'}))

(3)

U_{G}

and

U_{L}

represent the adaptive attention scores for each channel. Specifically, at the beginning of the training,

λ

is initialized to 0, at which point

σ (λ)

is 0.5.

σ (λ)

is used as an unbiased starting point, and then it is jointly optimized end-to-end with the remaining weights of the network.

λ

is added to the parameter group of AdamW, but weight decay is not applied. The total loss is detected and back-passed to

σ (λ)

, and then chained to

λ

to update the parameters. The fusion process is formulated as (4):

\frac{\partial L}{\partial λ} = 〈\frac{\partial L}{\partial W}, σ (U_{G}^{'}) - σ (U_{L}^{'})〉 \cdot σ (λ) (1 - σ (λ))

(4)

Finally, the input feature map is utilized to multiply the fused channel weights element-by-element to produce the enhanced feature representation, which can be stated as (5):

F^{*} = W ⊙ F

(5)

Through the adaptive channel interaction attention mechanism, the model selectively emphasizes response channels associated with object scattering features while suppressing interference channels related to sea surface, ground scenes, and other background clutter.

3.4. Decoder Combined with Dynamic Tandem Attention

A significant obstacle in SAR target detection lies in the diversity of object scales, the sparsity of spatial distribution, and the variability in task types. To address these issues, we design a decoder that incorporates Dynamic Tandem Attention (DTA). While its structure aligns with mainstream object detection networks, a distinctive feature is the insertion of a DTA block before the decoupling of regression and classification tasks. This block applies a high-dimensional refinement to the feature maps through the coordinated operation of three attention strategies: scale awareness, spatial focusing, and task adaptivity. The scale dimension introduces a multi-scale perception mechanism to enhance the ability of the model to detect and locate targets at different scales, which ADAPTS to the variable scale characteristics of targets in SAR images. The spatial dimension utilizes sparse sampling across levels and domains to guide the model to focus on regions with significant spatial heterogeneity, effectively capturing the target edges and key discriminative features. The task dimension dynamically adjusts the channel weights according to different task objectives, enabling the model to flexibly adapt to the differential concern areas of sub-tasks such as classification and positioning, thereby enhancing the overall detection performance and task discrimination accuracy. The detailed architecture is illustrated in Figure 4.

In this module, the input feature vector is decomposed into a three-stage sequence, corresponding respectively to the scale dimension, spatial dimension, and task-specific dimension. This process can be formulated as (6):

W (Y) = A_{C} (A_{S} (A_{L} (Y) \cdot Y) \cdot Y) \cdot Y

(6)

where A is the attention function and L, S, and C denote scale, space, and task type.

The scale dimension adaptively fuses features from different semantic levels based on the varying sizes of targets in SAR images. This improves the model’s performance to accurately detect and localize objects across a wide range of scales, thereby improving recall and localization precision in multi-scale object detection tasks. This process can be expressed as (7):

A_{L} (Y) \cdot Y = σ (f (\frac{1}{S C} \sum Y)) \cdot Y

(7)

where f is a linear function of a 1 × 1 convolution.

The spatial dimension performs adaptive sparse sampling across both hierarchical levels and neighboring positions at the same spatial location. By leveraging learnable offsets and sampling weights, it enables sparse focusing within the same channel across levels and neighborhoods, dynamically extracting the most discriminative spatial features. This process can be expressed as (8):

A_{S} (Y) \cdot Y = \frac{1}{L} \sum_{l = 1}^{L} \sum_{k = 1}^{K} w_{l, k} \cdot Y (l; p_{k} + Δ p_{k}; c) \cdot Δ m_{k}

(8)

where K represents the number of spatially sparse samples,

p_{k} + Δ p_{k}

is the offset of a self-learning spatial position,

Δ p_{k}

focuses more on the discriminative region, and

Δ m_{k}

marks the importance at

p_{k}

.

The task dimension automatically adjusts the allocation of feature channel weights to meet the distinct feature requirements of the two sub-tasks in SAR detection: bounding box localization and object classification. This allows the classification branch to focus more on structural and scattering characteristics of the targets, while the regression branch emphasizes spatial precision in delineating object contours and boundaries. This process can be expressed as (9):

A_{C} (Y) \cdot Y = max (α^{1} (Y) \cdot Y_{c} + β^{1} (Y), α^{2} (Y) \cdot Y_{c} + β^{2} (Y))

(9)

In the formula,

F_{c}

is the feature slice corresponding to the c-th channel, and

α^{1}

,

β^{1}

,

α^{2}

and

β^{2}

are the parameters that control the activation threshold.

Through the effective coordination of the three attention dimensions described above, the model’s capacity to adjust to SAR object detection is much improved. Subsequently, the classification and regression branches are fully decoupled, further improving both bounding box localization accuracy and category prediction precision. In the regression branch, the model employs three convolution layers to output the parameters of the bounding boxes. In the classification branch, the traditional 3 × 3 convolution is replaced with depthwise separable convolution, which substantially reduces the number of parameters while ultimately producing category confidence scores.

4. Experimental Results and Analysis

4.1. Experimental Details

4.1.1. Datasets

The datasets used in this study include AIR-SARShip-1.0 [30], SSDD [31], and SAR-AIRcraft-1.0 [32]. Among them, AIR-SARShip-1.0 and SSDD are ship datasets, while SAR-AIRcraft-1.0 consists of aircraft imagery. To assess how well the suggested model works, two sets of experiments are conducted. The AIR-SARShip-1.0 and SSDD datasets are used in the first experiment to evaluate the model’s cross-domain detection capabilities across several ship datasets. The SAR-AIRcraft-1.0 dataset is used for the second experiment. Given the limited availability of SAR aircraft datasets, we adopt a few-shot detection setting to verify the model’s robustness under data-scarce conditions. Representative image samples from these datasets are shown in Figure 5.

The GaoFen-3 satellite’s imagery is used to create the AIR-SARShip-1.0 dataset. It includes 461 ship occurrences altogether across 31 large-scale SAR pictures. The spatial resolutions are 1 m and 3 m, and the imaging modes include spotlight and Ultra-fine strip. All images are in single-polarization mode. The SSDD dataset comprises 1160 SAR images collected from multiple satellites, including RadarSat-2, TerraSAR-X, and Sentinel-1. The resolution ranges from 1 to 10 m, with an average image size of approximately 480 × 330 pixels. It includes multiple polarizations such as HH, VV, VH, and HV. The SAR-AIRcraft-1.0 dataset is also based on GaoFen-3 and consists of 4368 images with a total of 16,463 annotated aircraft targets. It has a spatial resolution of 1 m and employs single-polarization imaging. The dataset includes seven categories of aircraft: A220, Boeing 787, Boeing 737, A320/321, ARJ21, A330, and others.

The object sizes vary across the aforementioned datasets, providing a valuable basis for evaluating the capacity of the model to detect objects of different scales. The specific distribution of object sizes is illustrated in Figure 6, where the horizontal and vertical axes represent the goals’ width and height, respectively, measured in pixels.

4.1.2. Relevant Details

Every experiment in this study was carried out using a workstation that has a NVIDIA GeForce RTX 4090 GPU installed, using PyTorch2.1.2 as the deep learning framework. The image is input with a size of 1024 × 1024. Considering the factor of video RAM, the batch size is set to 4. Hybrid enhancements such as Mosaic and MixUp are turned off. The initial value of

λ

in ACIA is 0.0, and the initial values of

α

and

β

in DTA are all 0.0. During training, with a weight decay of 0.0001 and an initial learning rate of 0.00035, the AdamW optimizer was used. The SAM encoder’s pre-trained weights were retained, and the first three ViT blocks were frozen throughout the training process. An early stop strategy was adopted on the validation set and terminated in advance based on the mAP. All experiments were carried out under identical training configurations.

4.1.3. Evaluation Index

To impartially assess the suggested model’s performance, we adopt several standard metrics, including precision, recall, mean average precision at Intersection-over-Union threshold 0.5 (mAP50), and mean average precision across IoU thresholds from 0.5 to 0.95 (mAP50–95). Precision and recall are defined as follows (10):

Precision = \frac{TP}{TP + FP}, Recall = \frac{TP}{TP + FN} .

(10)

In these definitions, TP denotes the number of correctly detected objects, FP refers to the number of false detections, and FN shows how many objects were missed. mAP50 represents the average precision across all classes when the IoU threshold is set to 0.5. In contrast, averaging accuracy values over many IoU thresholds between 0.5 and 0.95 with a step size of 0.05 yields mAP50-95.

4.2. Experimental Result

4.2.1. Cross-Domain Detection on Ship Datasets

To assess the efficacy of the visual large model under marginal distribution across domains, experiments are conducted on the AIR-SARShip-1.0 and SSDD datasets. First, the AIR-SARShip-1.0 data is cropped into 1024 × 1024 image patches and used for training. The trained model is then tested on the SSDD dataset. Three experimental settings are designed: In the first setting, the model is trained solely on AIR-SARShip-1.0 and tested on the full SSDD dataset. The second and the third, respectively, added 5% and 10% of SSDD data based on the training data of the first setting. The comparative results are presented in corresponding Table 1 and Figure 7. The bolded values in the table represent the best results among the comparative experiment results of this group.

The detection results of different SSDD data proportions are visualized as shown in Figure 8.

The results demonstrate that when the training data only consists of AIR-SARShip-1.0, the model trained based on SAM exhibits a clear advantage in pure cross-domain detection, achieving a mAP50 of 0.54 across all metrics. As shown in the detection visualizations, the proposed model produces fewer missed detections than other models, both in nearshore regions and in open-sea areas with relatively clean backgrounds. When 5% of the SSDD dataset is added to the training set, the performance of all models improves due to partial overlap between training and testing domains. Under this condition, the proposed model still maintains high precision, achieving a mAP50 of 0.82. While the inclusion of SSDD data introduces some false positives across models, the proposed method exhibits relatively fewer false alarms, as observed in the detection results. Further increasing the proportion of SSDD data in the training set to 10% yields similar trends. The detection accuracy continues to improve, and the proposed model maintains its strong performance. Across all three experimental conditions, CNN-based models tend to perform well when the domain gap between training and testing data is few. However, in scenarios involving complete cross-domain or minimal domain overlap, Transformer-based models demonstrate superior detection performance. In particular, fine-tuning a large vision model such as SAM significantly boosts detection effectiveness, thereby validating the robustness and cross-domain generalization capability of the proposed model in SAR object detection tasks.

4.2.2. Few-Shot Detection on Aircraft Datasets

To evaluate the detection performance of the proposed model under few-shot training data conditions, experiments are conducted on the SAR-AIRcraft-1.0 dataset. Following the original dataset split ratio, only 10% of the original training set is used for training in this study, while the proportions of the validation and test sets remain unchanged. The experimental results are presented in the corresponding Table 2 and Figure 9.

Under few-shot training data conditions, the large model demonstrates a significant advantage in SAR object detection tasks, achieving a mAP50 of 0.503 on the test set. Compared to other baseline models, it shows a clear improvement across overall metrics, indicating that large models are capable of learning more informative representations from fewer images.

The detection result visualizations in Figure 10 further confirm that the proposed model exhibits fewer missed and false detections, particularly in complex environments. Especially for the models of the A320/321 and A330 categories, their mAP50 values are 0.711 and 0.636, respectively, and the recall rates are 0.731 and 0.479. This can be attributed to the fact that these aircraft types have larger fuselage lengths and wingspans compared to other categories, occupying more pixels in SAR images and exhibiting more prominent scattering point distributions. Moreover, the fuselages of these two aircraft types are constructed using traditional aluminum alloy skins, which strongly reflect radar waves and generate stable scattering centers. In contrast, the Boeing series incorporates large amounts of carbon fiber composite materials, which tend to absorb radar waves more than reflect them, resulting in weaker backscatter signals.

Additionally, the best-performing baseline model, YOLOv11, is selected for further comparison with the proposed model. The performance differences on both the validation and test sets are analyzed, and the results are presented in Table 3.

The two models exhibit a noticeable performance difference between the validation and test sets, primarily due to significant differences in image distribution across the two subsets. While this distribution shift causes a decline in detection accuracy for both models, the proposed model demonstrates a stronger ability to mitigate this domain discrepancy, exhibiting smaller performance degradation. Specifically, the proposed model shows 1.8% less mAP50 drop and 3.6% less recall drop compared to YOLOv11. Although the accuracy of our model is lower compared to Yolo11L, this is because Yolo11L performs extremely well only in high-confidence regions, and its performance fluctuates greatly in the rest. In contrast, our model can better cover effective targets, and the detection results are more stable. These results confirm that the proposed method maintains more stable detection and recognition performance under few-shot SAR data conditions.

4.2.3. Ablation Experiment

To validate the effectiveness of the proposed Adaptive Channel Interaction Attention (ACIA) mechanism and the Dynamic Tandem Attention (DTA) module in the decoder, ablation experiments are conducted using the cross-domain ship detection task as a representative case. These experiments aim to quantify the contribution of each component to the overall model performance. The detailed results are presented in Table 4.

When only the ACIA in the decoder is used, the model achieves a mAP50 of 0.491 in the fully cross-domain detection setting, which is a 4.9% decrease compared to using the full model. In contrast, when only the DTA in the decoder is used, the mAP50 drops to 0.408 under the same setting. These results indicate that ACIA contributes more significantly to cross-domain detection performance under no domain overlap. However, as a portion of the SSDD training data is incorporated, the contribution of DTA becomes increasingly evident. For example, with 10% of the SSDD data included in training, the mAP50 reaches 0.797 when only ACIA is used, while using only DTA yields a slightly higher mAP50 of 0.803. Figure 11 is a visual display of the heat map before and after the addition of the attention mechanism.

As observed in Figure 11, the model is capable of roughly localizing the object regions even before the attention layers are applied. This indirectly demonstrates the effectiveness of the large model encoder in extracting meaningful features from SAR imagery. However, at this stage, the model also tends to learn background noise within the object window and may miss objects with less distinctive features. Therefore, by adding the attention module, the model can alleviate the interference caused by speckle noise in SAR images and further highlight the target features. To further quantify the role of attention on the model, this paper compares the results before and after the addition of attention through focus entropy and noise suppression. The specific formulas of the two evaluation indicators are (11) and (12):

H = - \sum_{i} p_{i} log p_{i}, p_{i} = \frac{w_{i}}{\sum w}

(11)

NSR = \frac{μ_{fg}}{μ_{bg}}

(12)

where

w_{i}

is the attention intensity,

μ_{fg}

is the foreground region attention mean, and

μ_{bg}

is the background region attention mean. The specific indicator results are shown in Table 5:

Among them, b, c, e, and f respectively correspond to the row numbers in Figure 11, and 1 to 5 respectively represent the image sequence from left to right. After integrating the proposed attention mechanisms, the model’s focus becomes more concentrated on the object itself, with noticeably reduced attention to background noise. This effectively suppresses speckle noise commonly present in SAR images, enabling the model to better capture object-relevant features and improving overall detection performance.

Based on this, this paper attempts to modify the attention module and use wavelet attention as a comparison to verify the advantages of building the model. Wavelet attention divides the input features into low-frequency and high-frequency components through wavelet transform, applies channel attention or spatial attention on different frequency subbands, guides the model to focus on important frequency domain information, and finally re-fuses the low-frequency and high-frequency features modulated by attention to restore the spatial domain features. Its overall logic is consistent with that of ACIA and other attention modules. The specific experimental results are shown in Figure 12:

From the experimental results, it can be seen that wavelet attention is weaker than ACIA in the concentration of the object and the inhibition of background interference, but still has a positive effect compared with the results in Figure 11, which only goes through the encoder. From the analysis of the structure of the attention module, wavelet attention first decomposes the features into low-frequency and high-frequency subbands in the frequency domain, models the attention on the subbands, and then reconstructs them. ACIA conducts cross-branch interaction and adaptive fusion within the spatial-channel domain. Both have the same goal, but wavelet attention relies more on frequency-domain transformation and thresholds, and is good at detail and edge recovery. ACIA mainly relies on channel interaction, directly adjusting saliency and suppressing redundant textures among multi-scale features, which can better suppress the interference brought by the environment.

From an overall perspective, the core of this paper is to construct a SAR object detection network based on a visual large model. The addition of the attention module can enable the network to better adapt to the characteristics of SAR images, specifically manifested in suppressing background noise and highlighting target features, etc. This section conducts a detailed ablation comparison before and after the addition of attention, and attempts to change the attention module. The results all prove its positive effect on improving network performance. In the future, we will continue to research new modules or methods to better adapt SAR images to visual large models.

5. Conclusions

In this paper, we tried to introduce the visual large model into the SAR image object detection task, and achieved accurate SAR image object detection by constructing a SAM-based deep learning network architecture, utilizing the powerful general feature representation capability of the SAM image encoder, introducing the adaptive channel interaction attention, and combining the tandem dynamic attention to decouple the regression and classification tasks. The superiority of the deep learning network based on a visual large model in SAR image object detection is verified through experiments on AIR-SARShip-1.0, SSDD, and SAR-AIRcraft-1.0.

With the increasing number of remote sensing satellites and the richness of accessible remote sensing data, as well as the continuous improvement of computational resources and arithmetic power, attempting to better empower visual large models for SAR image applications is the mainstream trend of current research. Although the model constructed in this paper is only the tip of the iceberg of the visual large model, it is proved through experimental validation that the visual large model has a very good effect in object detection of SAR images, which means that the visual large model has more possibilities in the downstream tasks of remote sensing images, especially for the decoding and processing of SAR images and so on. At the same time, the introduction of the large model provides new technical ideas for the development of microwave vision theory systems, and also provides new application references for marine supervision, disaster monitoring, and military reconnaissance, which is of great research significance and practical value.

Author Contributions

Conceptualization, Y.Y., J.Y. and L.S.; methodology, Y.Y., J.Y. and L.Z.; software, Y.Y.; validation, Y.Y.; formal analysis, Y.Y.; investigation, Y.Y., J.Y.; resources, Y.Y.; data curation, L.Z.; writing—original draft preparation, Y.Y.; writing—review and editing, J.Y., L.S.; visualization, Y.Y., L.S.; supervision, J.Y., L.S. and L.Z.; project administration, J.Y.; funding acquisition, L.S., L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program Project: Remote Sensing Precision Monitoring and Safety Early Warning Platform for Key Border and Coastal Areas 2022YFB3903605, National Natural Science Foundation of China under Grant No. 62471337, and in part by the National Key Research and Development Program of China under Grant 2024YFC3810804.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are openly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhu, Q.; Zhang, Y.; Li, Z.; Yan, X.; Guan, Q.; Zhong, Y.; Zhang, L.; Li, D. Oil spill contextual and boundary-supervised detection network based on marine SAR images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5213910. [Google Scholar] [CrossRef]
Amitrano, D.; Di Martino, G.; Di Simone, A.; Imperatore, P. Flood detection with SAR: A review of techniques and datasets. Remote Sens. 2024, 16, 656. [Google Scholar] [CrossRef]
Brenner, A.R.; Ender, J.H. Demonstration of advanced reconnaissance techniques with the airborne SAR/GMTI sensor PAMIR. IEE Proc.-Radar Sonar Navig. 2006, 153, 152–162. [Google Scholar] [CrossRef]
Ikeuchi, K.; Shakunaga, T.; Wheeler, M.D.; Yamazaki, T. Invariant histograms and deformable template matching for SAR target recognition. In Proceedings of the Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 18-20 June 1996; pp. 100–105. [Google Scholar]
Jianxiong, Z.; Zhiguang, S.; Xiao, C.; Qiang, F. Automatic target recognition of SAR images based on global scattering center model. IEEE Trans. Geosci. Remote Sens. 2011, 49, 3713–3729. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Ding, J.; Chen, B.; Liu, H.; Huang, M. Convolutional neural network with data augmentation for SAR target recognition. IEEE Geosci. Remote Sens. Lett. 2016, 13, 364–368. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Ding, S.; Wang, Q.; Guo, L.; Li, X.; Ding, L.; Wu, X. Wavelet and adaptive coordinate attention guided fine-grained residual network for image denoising. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 6156–6166. [Google Scholar] [CrossRef]
Gao, G.; Liu, L.; Zhao, L.; Shi, G.; Kuang, G. An adaptive and fast CFAR algorithm based on automatic censoring for target detection in high-resolution SAR images. IEEE Trans. Geosci. Remote Sens. 2008, 47, 1685–1697. [Google Scholar] [CrossRef]
Al-Hussaini, E.K. Performance of the greater-of and censored greater-of detectors in multiple target environments. In Proceedings of the IEE Proceedings F (Communications, Radar and Signal Processing); IET: Stevenage, UK, 1988; Volume 135, pp. 193–198. [Google Scholar]
Bakirci, M.; Bayraktar, I. Assessment of YOLO11 for ship detection in SAR imagery under open ocean and coastal challenges. In Proceedings of the 2024 21st International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE), Mexico City, Mexico, 23–25 October 2024; pp. 1–6. [Google Scholar]
Li, K.; Wang, D.; Hu, Z.; Zhu, W.; Li, S.; Wang, Q. Unleashing channel potential: Space-frequency selection convolution for SAR object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17323–17332. [Google Scholar]
Zhang, L.; Zheng, J.; Li, C.; Xu, Z.; Yang, J.; Wei, Q.; Wu, X. Ccdn-detr: A detection transformer based on constrained contrast denoising for multi-class synthetic aperture radar object detection. Sensors 2024, 24, 1793. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Zhou, X. Refined deformable-detr for sar target detection and radio signal detection. Remote. Sensing 2025, 17, 1406. [Google Scholar] [CrossRef]
Fu, Y.; Wang, Y.; Pan, Y.; Huai, L.; Qiu, X.; Shangguan, Z.; Liu, T.; Fu, Y.; Van Gool, L.; Jiang, X. Cross-domain few-shot object detection via enhanced open-set object detector. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 247–264. [Google Scholar]
Huang, H.; Li, B.; Zhang, Y.; Chen, T.; Wang, B. Joint distribution adaptive-alignment for cross-domain segmentation of high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5401214. [Google Scholar] [CrossRef]
Han, G.; Lim, S.N. Few-shot object detection with foundation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 28608–28618. [Google Scholar]
Lin, H.; Li, N.; Yao, P.; Dong, K.; Guo, Y.; Hong, D.; Zhang, Y.; Wen, C. Generalization-enhanced few-shot object detection in remote sensing. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 5445–5460. [Google Scholar] [CrossRef]
Siméoni, O.; Vo, H.V.; Seitzer, M.; Baldassarre, F.; Oquab, M.; Jose, C.; Khalidov, V.; Szafraniec, M.; Yi, S.; Ramamonjisoa, M.; et al. DINOv3. arXiv 2025, arXiv:2508.10104. [Google Scholar]
Wang, Y.; Hernández, H.H.; Albrecht, C.M.; Zhu, X.X. Feature guided masked autoencoder for self-supervised learning in remote sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 18, 321–336. [Google Scholar] [CrossRef]
Li, W.; Yang, W.; Liu, T.; Hou, Y.; Li, Y.; Liu, Z.; Liu, Y.; Liu, L. Predicting gradient is better: Exploring self-supervised learning for SAR ATR with a joint-embedding predictive architecture. ISPRS J. Photogramm. Remote Sens. 2024, 218, 326–338. [Google Scholar] [CrossRef]
Pu, X.; Jia, H.; Zheng, L.; Wang, F.; Xu, F. ClassWise-SAM-adapter: Parameter efficient fine-tuning adapts segment anything to SAR domain for semantic segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 4791–4804. [Google Scholar] [CrossRef]
Ren, T.; Liu, S.; Zeng, A.; Lin, J.; Li, K.; Cao, H.; Chen, J.; Huang, X.; Chen, Y.; Yan, F.; et al. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv 2024, arXiv:2401.14159. [Google Scholar] [CrossRef]
Baraha, S.; Sahoo, A.K. Synthetic aperture radar image and its despeckling using variational methods: A review of recent trends. Signal Process. 2023, 212, 109156. [Google Scholar] [CrossRef]
Xian, S.; Zhirui, W.; Yuanrui, S.; Wenhui, D.; Yue, Z.; Kun, F. AIR-SARShip-1.0: High-resolution SAR ship detection dataset. J. Radars 2019, 8, 852–863. [Google Scholar]
Zhang, T.; Zhang, X.; Li, J.; Xu, X.; Wang, B.; Zhan, X.; Xu, Y.; Ke, X.; Zeng, T.; Su, H.; et al. SAR ship detection dataset (SSDD): Official release and comprehensive data analysis. Remote Sens. 2021, 13, 3690. [Google Scholar] [CrossRef]
Zhirui, W.; Yuzhuo, K.; Xuan, Z.; Yuelei, W.; Ting, Z.; Xian, S. SAR-AIRcraft-1.0: High-resolution SAR aircraft detection and recognition dataset. J. Radars 2023, 12, 906–922. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Zhang, H.; Chang, H.; Ma, B.; Wang, N.; Chen, X. Dynamic R-CNN: Towards high quality object detection via dynamic training. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 260–275. [Google Scholar]
Lv, W.; Zhao, Y.; Chang, Q.; Huang, K.; Wang, G.; Liu, Y. Rt-detrv2: Improved baseline with bag-of-freebies for real-time detection transformer. arXiv 2024, arXiv:2407.17140. [Google Scholar]
Chai, B.; Nie, X.; Zhou, Q.; Zhou, X. Enhanced cascade R-CNN for multiscale object detection in dense scenes from SAR images. IEEE Sens. J. 2024, 24, 20143–20153. [Google Scholar] [CrossRef]

Figure 1. Overall structure for object detection in SAR images based on SAM. The model is generally divided into three parts: SAM image encoder, Adaptive Channel Interaction Attention (ACIA), and decoder combined with Dynamic Tandem Attention. Some network details have been explained at the bottom of the figure.

Figure 2. Example of visualization results of model feature maps. (a) Feature map of the image passing through the SAM image encoder. (b) Feature map of the image passing through the ACIA block. (c) Feature map of the image passing through the DTA block.

Figure 3. Adaptive fusion strategy for channel interaction attention.

Figure 4. Structure of the Dynamic Tandem Attention. The module is divided into scale dimension, spatial dimension, and task dimension, and the three dimensions are connected in series. This module can be used multiple times, and N represents the number of times the module is repeated.

Figure 5. AIR-SARShip-1.0, SSDD, and SAR-AIRcraft-1.0 dataset sample examples.

Figure 6. Object size distribution diagram for the AIR-SARShip-1.0, SSDD, and SAR-AIRcraft-1.0 datasets.

Figure 7. Cross-domain detection results analysis on AIR-SARShip-1.0 and SSDD.

Figure 8. Visual comparison of cross-domain detection results on AIR-SARShip-1.0 and SSDD. (a) SSDD data accounts for 0% of the training data; (b) SSDD data accounts for 5% of the training data; (c) SSDD data accounts for 10% of the training data.

Figure 9. Few-shot detection results analysis on SAR-AIRcraft-1.0.

Figure 10. Visual comparison of few-shot detection results on SAR-AIRcraft-1.0. Different aircraft categories are represented by boxes of different colors, and the specific color categories correspond as shown in the legend.

Figure 11. Heat map visualization results before and after attention is added. (a,d) are the original ship and aircraft images. (b,e) are the results after the ship and aircraft images have been processed by the image encoder, respectively. (c,f) are the results after the ship and aircraft have been processed by the entire attention.

Figure 12. Heat map comparison between wavelet attention and attention constructed in this paper. (a,d) are the original ship and aircraft images. (b,e) are the results of ship and aircraft images after wavelet attention. (c,f) are the results of the attention processing constructed in this paper for ships and aircraft.

Table 1. Cross-domain detection experiments on AIR-SARShip-1.0 and SSDD.

Models	Proportion of SSDD (%)	Precision	Recall	mAP₅₀	mAP_50–95
Yolo11L [7,33]	0	0.543	0.390	0.388	0.131
	5	0.806	0.696	0.783	0.332
	10	0.758	0.713	0.786	0.319
Faster-RCNN [6]	0	0.145	0.203	0.068	0.014
	5	0.244	0.302	0.198	0.063
	10	0.750	0.752	0.802	0.364
Cascade-RCNN [34]	0	0.467	0.405	0.387	0.205
	5	0.773	0.661	0.757	0.329
	10	0.750	0.707	0.786	0.456
DETR [8,35]	0	0.600	0.495	0.464	0.141
	5	0.754	0.792	0.774	0.306
	10	0.803	0.780	0.812	0.351
Dynamic-RCNN [36]	0	0.535	0.408	0.390	0.144
	5	0.794	0.727	0.794	0.400
	10	0.831	0.788	0.857	0.521
RT-DETR v2 [37]	0	0.362	0.432	0.352	0.213
	5	0.788	0.693	0.782	0.409
	10	0.766	0.756	0.814	0.543
EC-RCNN [38]	0	0.271	0.428	0.265	0.181
	5	0.744	0.578	0.671	0.419
	10	0.731	0.638	0.724	0.461
Ours	0	0.560	0.527	0.540	0.214
	5	0.818	0.738	0.820	0.450
	10	0.840	0.744	0.838	0.444

Table 2. Few-shot detection experiments on SAR-AIRcraft-1.0.The bolded values in the table represent the best results among the comparative experiment results of this group.

Models	Precision	Recall	mAP₅₀	mAP_50–95
Yolo11L [7,33]	0.764	0.420	0.479	0.214
Faster-RCNN [6]	0.523	0.252	0.285	0.153
Cascade-RCNN [34]	0.676	0.331	0.428	0.261
DETR [8,35]	0.310	0.382	0.294	0.143
Dynamic-RCNN [36]	0.648	0.388	0.446	0.269
RT-DETR v2 [37]	0.348	0.443	0.339	0.226
EC-RCNN [38]	0.336	0.432	0.324	0.216
Ours	0.618	0.453	0.503	0.290

Table 3. Comparison of metric differences between the validation set and the test set.The bolded values in the table represent the best results among the comparative experiment results of this group.

Models	Val/Test	Precision	Recall	mAP₅₀	mAP_50–95
Yolo11L [7,33]	val	0.795	0.557	0.594	0.389
	test	0.764	0.420	0.479	0.314
	Difference (val-test)	0.031	0.137	0.115	0.075
Ours	val	0.674	0.554	0.600	0.332
	test	0.618	0.453	0.503	0.290
	Difference (val-test)	0.056	0.101	0.097	0.042

Table 4. The results of ablation experiments.

Models	Proportion of SSDD (%)	Precision	Recall	mAP₅₀	mAP_50–95
Only ACIA	0	0.494	0.518	0.491	0.171
	5	0.795	0.684	0.777	0.475
	10	0.776	0.733	0.797	0.492
Only DTA	0	0.578	0.377	0.408	0.187
	5	0.780	0.741	0.800	0.407
	10	0.748	0.745	0.803	0.345
ACIA + DTA	0	0.560	0.527	0.540	0.214
	5	0.818	0.738	0.820	0.450
	10	0.840	0.744	0.838	0.444

Table 5. Statistics of focus entropy and noise suppression ratio indicators for different groups. The arrow indicates the changing trend of this value.

Group	FocalEntropy (H) ↓					Noise Suppression (NSR) ↑
Group	1	2	3	4	5	1	2	3	4	5
b1 b5	0.735	0.849	0.969	0.909	0.935	104.928	32.174	3.733	28.084	8.409
c1 c5	0.731	0.829	0.969	0.907	0.931	111.216	43.030	3.862	28.608	8.558
e1 e5	0.914	0.848	0.934	0.907	0.922	14.105	46.512	7.148	6.096	23.248
f1 f5	0.908	0.843	0.933	0.905	0.914	14.604	47.056	7.326	6.109	23.512

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yuan, Y.; Yang, J.; Shi, L.; Zhao, L. SAM–Attention Synergistic Enhancement: SAR Image Object Detection Method Based on Visual Large Model. Remote Sens. 2025, 17, 3311. https://doi.org/10.3390/rs17193311

AMA Style

Yuan Y, Yang J, Shi L, Zhao L. SAM–Attention Synergistic Enhancement: SAR Image Object Detection Method Based on Visual Large Model. Remote Sensing. 2025; 17(19):3311. https://doi.org/10.3390/rs17193311

Chicago/Turabian Style

Yuan, Yirong, Jie Yang, Lei Shi, and Lingli Zhao. 2025. "SAM–Attention Synergistic Enhancement: SAR Image Object Detection Method Based on Visual Large Model" Remote Sensing 17, no. 19: 3311. https://doi.org/10.3390/rs17193311

APA Style

Yuan, Y., Yang, J., Shi, L., & Zhao, L. (2025). SAM–Attention Synergistic Enhancement: SAR Image Object Detection Method Based on Visual Large Model. Remote Sensing, 17(19), 3311. https://doi.org/10.3390/rs17193311

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SAM–Attention Synergistic Enhancement: SAR Image Object Detection Method Based on Visual Large Model

Abstract

Highlights

Abstract

1. Introduction

2. Related Work

2.1. SAR Object Detection

2.2. Cross-Domain Detection and Few-Sample Detection

2.3. Large Model Development

3. Methods

3.1. Overall

3.2. SAM Image Encoder

3.3. Adaptive Channel Interaction Attention

3.4. Decoder Combined with Dynamic Tandem Attention

4. Experimental Results and Analysis

4.1. Experimental Details

4.1.1. Datasets

4.1.2. Relevant Details

4.1.3. Evaluation Index

4.2. Experimental Result

4.2.1. Cross-Domain Detection on Ship Datasets

4.2.2. Few-Shot Detection on Aircraft Datasets

4.2.3. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI