Ship Detection in SAR Images Using Sparse R-CNN with Wavelet Deformable Convolution and Attention Mechanism

Zeng, Zhiqiang; Chen, Zongsi; Yin, Junjun; Lin, Huiping

doi:10.3390/rs17233794

Open AccessArticle

Ship Detection in SAR Images Using Sparse R-CNN with Wavelet Deformable Convolution and Attention Mechanism

¹

School of Microelectronics and Communication Engineering, Chongqing University, Chongqing 400044, China

²

Key Laboratory for Information Science of Electromagnetic Waves (MoE), Fudan University, Shanghai 200433, China

³

School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2025, 17(23), 3794; https://doi.org/10.3390/rs17233794 (registering DOI)

Submission received: 5 September 2025 / Revised: 10 November 2025 / Accepted: 20 November 2025 / Published: 22 November 2025

(This article belongs to the Special Issue Microwave Remote Sensing on Ocean Observation)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A novel wavelet deformable convolution (WDC) module is proposed, which incorporates wavelet-domain information and adaptively models multi-scale ship targets with improved edge and boundary representation.
A position-encoded multi-head attention mechanism (PEMA) is introduced to replace the original dynamic head in Sparse R-CNN, enabling more effective focus on spatially and semantically relevant regions for sparse target detection.

What are the implications of the main findings?

The proposed method significantly improves detection accuracy for sparse, multi-scale, and irregularly distributed ships in SAR images, particularly under complex background conditions.
By combining wavelet-domain representation, deformable convolution, and attention mechanisms, the framework provides a robust solution that advances SAR-based maritime surveillance and monitoring applications.

Abstract

This paper proposes a synthetic aperture radar (SAR) ship detection method based on wavelet-domain deformable convolution (WDC) and multi-head attention, built upon the Sparse R-CNN framework. First, a wavelet-domain convolution module is introduced to enhance the modeling of ship targets with diverse scales and shapes while incorporating frequency-domain information. Deformable convolution adaptively adjusts sampling locations, overcoming the limitations of traditional convolution in capturing target edges and blurred boundaries. Next, a position encoding module is employed to normalize candidate bounding box coordinates and integrate them into region-of-interest features. By providing spatial context, position encoding strengthens spatial perception and enables the subsequent multi-head attention mechanism to more effectively capture associations between targets and candidate regions, thereby improving localization accuracy under arbitrary spatial distributions. Furthermore, the original dynamic head is replaced with a multi-head attention mechanism. Through position-encoded multi-head attention, the model more accurately emphasizes regions with spatial and semantic correlations to the target, enhancing both focus and discrimination for sparse targets. Extensive experiments conducted on two benchmark datasets (SSDD and HRSID) demonstrate the effectiveness and superiority of the proposed method. Overall, the method significantly improves the detection of sparse, multi-scale, and randomly distributed ship targets in SAR images.

Keywords:

ship detection; wavelet deformable convolution (WDC); attention mechanism; synthetic aperture radar (SAR)

1. Introduction

Ship detection in synthetic aperture radar (SAR) imagery has garnered significant attention due to its vital role in maritime surveillance, national defense, and marine resource management. Unlike optical sensors, SAR operates independently of illumination and weather conditions, offering all-weather, day-and-night imaging capabilities that are indispensable for reliable maritime monitoring. Accurate ship detection supports a wide range of applications, including traffic regulation, illegal fishing monitoring, maritime search and rescue, and early warning of potential security threats. Nevertheless, SAR-based ship detection remains challenging due to the inherent complexities of SAR data, such as speckle noise, low signal-to-clutter ratio, and strong interference from sea clutter and coastal environments. Furthermore, ships in SAR imagery exhibit substantial variability in size, orientation, and spatial distribution, spanning from densely packed harbor scenes to sparsely distributed vessels in open waters. These challenges highlight the technical difficulty and practical importance of developing robust and efficient SAR ship detection algorithms [1].

Early approaches to SAR ship detection were dominated by the constant false-alarm rate (CFAR) method [2,3,4,5]. Although widely adopted, CFAR suffers from limitations in detecting small ships and handling complex scenes. It also relies heavily on modeling sea clutter distributions and requires computationally expensive sliding-window processing [1]. To address these issues, subsequent research introduced machine learning techniques [6,7,8], including sparse representation [9], dictionary learning [10,11,12], and Fisher vectors [13,14]. In general, ship detection is formulated as a binary classification problem, distinguishing between background clutter and targets. However, the limited expressive power of these models constrains their ability to generalize across complex scenes and diverse datasets [15].

With the rapid progress of deep learning in computer vision, its application to remote sensing has demonstrated remarkable success [16,17,18,19,20]. The increasing availability of SAR image data has ushered ship detection into the deep learning era [21,22]. Deep learning-based methods can be broadly categorized into two groups: two-stage and one-stage detectors. Two-stage approaches first generate candidate regions and then refine their localization, exemplified by region convolutional neural networks (R-CNN) [23], Fast R-CNN [24], Faster R-CNN [21,25], cascade R-CNN [26], Mask R-CNN [27], and feature pyramid networks [28,29]. In contrast, one-stage methods directly regress target locations and predict class probabilities in a single step, including single-shot multibox detectors [30], RetinaNet [31,32], and the You Only Look Once (YOLO) family of models [33,34,35,36,37,38,39,40,41].

In addition to conventional deep learning-based detectors, recent studies have emphasized the importance of scattering feature fusion for enhancing ship detection performance in SAR imagery. Traditional convolutional networks often rely solely on intensity or texture features and thus fail to fully exploit the intrinsic scattering characteristics of ship targets. To address this limitation, Pan et al. proposed SFFNet [42], a dual-branch network that reconstructs scattering-center feature maps and fuses them through a scattering feature attention fusion module, significantly improving the detection of small ships in complex sea environments. Similarly, Wang et al. introduced SIFNet [43], which integrates a multiscale contextual semantic information fusion module and a scattering-point learning branch to enhance robustness under varying imaging conditions and ship orientations. Moreover, Gao et al. [44] designed SCANet, a scattering-characteristic-aware detection framework based on a four-component decomposition model for fully polarimetric SAR data, achieving superior discrimination between ships and sea clutter. Together, these studies highlight the growing trend of multi-domain feature fusion—combining scattering, intensity, texture, and contextual information—to improve detection reliability in complex maritime environments.

Beyond scattering, hydrodynamic phenomena such as ship wakes also provide important contextual cues for ship detection and motion estimation. While Kelvin wakes have traditionally been modeled using simplified linear approximations [45], recent work by Xu et al. [46] demonstrated that time-varying scattering and decoherence significantly affect wake visibility, particularly in L-band SAR. More recently, Ding et al. [47] proposed a lightweight YOLO-based model capable of jointly detecting ships and wakes, achieving real-time performance on Gaofen-3 imagery.

Considering the unique characteristics of SAR imagery compared with optical images, ship detection still faces several challenges, including complex background interference, multi-scale variations, and diverse ship shapes [1,48]. To improve near-shore detection accuracy, Wu et al. [49] proposed a bow classification network in the candidate region extraction stage, generating smaller and more precise regions for ship targets. To mitigate the interference of strong land scattering, Sun et al. [50] introduced an attention module to enhance texture features and suppress false alarms caused by land clutter. Fan et al. [51] proposed a UNet-based segmentation approach to reduce false alarms from sea clutter, while Jiao et al. [52] designed a new training strategy that emphasized hard examples by adjusting loss weights.

To address scale variability, dense feature pyramid networks (FPNs) have been widely adopted [53,54,55]. Wan et al. [56] introduced a YOLOX-based multiscale enhancement method, while Yang et al. [57] developed a feature refinement and reuse module to improve small ship detection in an enhanced FCOS framework. More recently, Gao et al. [58] proposed a YOLOv5-based method that incorporates contextual attention and task-specific context decoupling, achieving significant performance gains.

Most existing detection algorithms rely on dense anchor-box mechanisms, which place a large number of anchors with fixed sizes and aspect ratios across feature maps for bounding-box regression and classification. Although effective for natural image detection, this strategy has major drawbacks in SAR ship detection, where targets are sparse. First, sparsity leads to a large number of redundant anchors, increasing computational cost, inference latency, and class imbalance between positive and negative samples, which hinders convergence. Second, fixed anchor sizes, shapes, and positions cannot adapt to the large scale variations and irregular spatial distribution of ships, limiting matching accuracy. Finally, the dependence on non-maximum suppression to remove redundant proposals increases the likelihood of missed detections and reduces robustness.

To overcome these limitations, Sparse R-CNN [59] (shown in Figure 1) introduces a query-based detection paradigm that replaces dense anchors with a small, fixed number of learnable object queries. Each query dynamically interacts with image features through attention mechanisms to localize and classify potential targets. This design eliminates the need for predefined anchors and NMS, naturally addressing the sparsity and imbalance problems inherent in SAR ship detection. Moreover, Sparse R-CNN achieves high accuracy and efficiency by focusing computation on a compact set of informative regions rather than exhaustive candidate enumeration. These characteristics align well with the sparse and small-object nature of ships in SAR imagery, making Sparse R-CNN an appropriate and efficient baseline framework for our study. Building upon this foundation, the proposed method integrates deformable convolution [60,61,62] and multi-head attention [63,64,65] to further enhance adaptive feature extraction and contextual modeling for more robust ship detection. We evaluate the proposed method on two public SAR ship detection datasets. Comparisons with mainstream detectors and ablation studies confirm the effectiveness of our approach in handling complex backgrounds and detecting multi-scale targets. The main contributions of this work are summarized as follows:

We propose a novel wavelet deformable convolution (WDC) module that extracts wavelet-domain information while capturing geometric transformations of multi-scale targets. Specifically, the WDC module applies discrete wavelet transform (DWT) to project input data into the wavelet domain, performs subband-based deformable convolution, and reconstructs features in the spatial domain using inverse DWT (IDWT).
We introduce a position-encoded multi-head attention (PEMA) mechanism to replace the original dynamic convolution module. PEMA enables the model to focus more accurately on regions with spatial and semantic relevance to target areas, thereby improving discrimination of sparse targets.
Extensive experiments on two public datasets demonstrate that our method significantly outperforms baseline approaches. In particular, it achieves higher detection accuracy in challenging scenarios involving multi-scale targets, complex backgrounds, and sparse ship distributions.

The remainder of this paper is organized as follows. Section 2 reviews Sparse R-CNN. Section 3 presents the proposed method in detail. Section 4 reports experimental results and analysis. Finally, Section 5 concludes the paper.

2. Overview of the Sparse R-CNN Framework

In object detection, dense anchor-based methods such as Faster R-CNN [24], RetinaNet [31,32], and YOLO [33,34,35,36,37,38,39,40,41] have long dominated the field. These approaches rely on a large number of predefined anchor boxes or densely sampled regions across the feature map to predict object locations and categories. While effective for natural image detection, such methods introduce significant redundancy and computational overhead. They also require complex post-processing (e.g., non-maximum suppression) and rely on carefully designed hyperparameters for anchor size, aspect ratio, and placement. Furthermore, the assignment of positive and negative samples depends on hand-crafted rules, which can limit generalization across datasets and imaging modalities.

To overcome these limitations, Sparse R-CNN [59] reformulates object detection as a sparse set prediction problem, inspired by the transformer-based DETR architecture. The workflow of Sparse R-CNN is shown in Figure 2. Instead of generating dense anchors, it uses a small, fixed number of learnable proposal boxes (typically 100–300) that directly represent potential objects in the image. Each proposal box, parameterized by its center

(x, y)

, width w, and height h, is optimized jointly with network weights during training. This anchor-free and proposal-learnable design eliminates hand-crafted heuristics and dramatically reduces the number of candidate regions, thereby improving both computational efficiency and training stability.

The architecture of Sparse R-CNN consists of three major components:

Backbone and Feature Pyramid Network: The backbone (usually a ResNet) extracts hierarchical feature maps from the input image, while the FPN fuses multi-scale features through lateral connections and upsampling operations. This combination enables effective detection of objects with large scale variations.
Dynamic Instance Interaction Head: Each learnable proposal is associated with a proposal feature vector that encodes instance-specific information such as appearance, shape, and context. During training and inference, the proposal feature interacts dynamically with the region of interest (ROI) features through attention mechanisms. This dynamic interaction allows the network to refine both classification and bounding-box regression results iteratively.
Set-Based Matching and Loss Function: Sparse R-CNN replaces the conventional dense assignment of anchors with Hungarian matching, which establishes a one-to-one correspondence between predicted and ground-truth boxes. This set-based supervision avoids duplicate predictions and simplifies label assignment.

Sparse R-CNN achieves a balance between accuracy and sparsity by directly optimizing a small set of object proposals through learnable interactions. This design eliminates the need for dense anchors, non-maximum suppression, and heuristic matching rules, making it well suited for SAR ship detection, where targets are sparse, small, and often located in complex backgrounds. The proposed method in this paper builds upon this efficient sparse detection framework by introducing WDC and PEMA to further enhance feature adaptability and contextual modeling in SAR imagery.

3. Proposed Method

Figure 3 illustrates the overall workflow of the proposed method. First, the SAR image is preprocessed and fed into a ResNet-101 backbone for feature extraction, following [59]. In the residual blocks of ResNet-101, the standard

3 \times 3

convolutions in the original bottleneck blocks are replaced with WDC modules, denoted as WDC-Res2, WDC-Res3, WDC-Res4, and WDC-Res5.

The residual blocks generate feature maps

C 2

,

C 3

,

C 4

, and

C 5

, which are processed by a FPN. To unify channel dimensions, each feature map is first passed through a

1 \times 1

lateral convolution. The maps are then iteratively fused in a top-down pathway, where higher-resolution maps are obtained by upsampling lower-resolution ones. A subsequent

3 \times 3

deformable convolution refines the fused maps, producing multi-scale outputs

P 2

,

P 3

,

P 4

, and

P 5

.

The multi-scale features are then pooled by a ROI pooling layer to obtain ROI features. To enhance spatial awareness, position encoding is applied before passing the ROI features to the multi-head attention module. Here, the learnable proposal features from Sparse R-CNN act as query vectors and compute attention with the ROI features, yielding target-specific representations. These target features are subsequently fed into the classification and regression branches to produce the final detection results.

Relative to the original Sparse R-CNN, the key improvements of our method are the introduction of WDC and PEMA. By projecting features into the wavelet domain, WDC captures frequency-specific information and suppresses noise, while the deformable convolution adaptively models geometric variations in ship structures. This enables the network to represent multi-scale ships more effectively and to distinguish them from cluttered backgrounds. Meanwhile, PEMA incorporates spatial context into the attention mechanism, allowing the network to focus on regions that are both semantically and spatially correlated with the target. This enhances ship discrimination in complex sea or coastal environments and improves robustness to variations in scale and orientation.

3.1. Wavelet Deformable Convolution

Figure 4 illustrates the structure of the proposed WDC module. The input feature map

V_{i n}

first passes through a deformable convolution layer, which may also perform downsampling or upsampling, and is then transformed into the wavelet domain using DWT. In this work, the Haar transform is employed due to its simplicity, efficiency, and widespread use. Compared with higher-order wavelets (e.g., Daubechies, Symlet, or Coiflet families), the Haar transform provides compact support and minimal filter length, allowing for efficient decomposition and reconstruction while preserving sharp intensity transitions that are characteristic of ship edges and wakes in SAR images. Moreover, its non-overlapping basis functions effectively reduce redundancy in the feature representation, which facilitates stable gradient propagation and faster convergence. The DWT produces four subbands: approximation coefficients

F_{1}

, horizontal detail coefficients

F_{2}

, vertical detail coefficients

F_{3}

, and diagonal detail coefficients

F_{4}

:

F_{1}, F_{2}, F_{3}, F_{4} = D W T (D c o n v (V_{i n})),

(1)

where

D c o n v

denotes deformable convolution. Next, the low-frequency subband

F_{1}

is processed by a set of deformable convolutions, while the high-frequency subbands

F_{2}

,

F_{3}

, and

F_{4}

are concatenated and filtered by another set of deformable convolutions:

F_{7} = c o n c a t (D c o n v (F_{1}), D c o n v (c o n c a t (F_{2}, F_{3}, F_{4}))),

(2)

where

F_{7}

represents the intermediate feature maps. These feature maps are then reconstructed in the spatial domain using IDWT, with a residual connection applied in the style of ResNet:

V_{o u t} = I D W T (F_{7}) + D c o n v (V_{i n}) .

(3)

The WDC module transforms input features into the wavelet domain via DWT, applies subband-based deformable convolution, and reconstructs them in the spatial domain using IDWT. As highlighted in [62], this design allows WDC to be integrated as a standalone layer in CNNs without significantly disturbing feature distributions, a drawback observed in prior works that constructed entire CNNs in the wavelet domain [66,67].

Another key component of the proposed WDC is the deformable convolution itself, implemented here by removing softmax normalization from deformable convolutional network v3 [68]. For each reference pixel

p_{0}

in the input feature

V_{i n}

, the operation is defined as:

D c o n v (p_{0}) = \sum_{g = 1}^{G} \sum_{n = 1}^{N} w_{g} m_{g n} x_{g} (p_{0} + p_{n} + Δ p_{g n}),

(4)

where N is the number of sampling points, n indexes each sampling location, and G is the number of aggregation groups. For the g-th group,

w_{g}

is the location-independent projection weight,

m_{g n}

is the modulation scalar for the n-th sampling point,

x_{g}

is the sliced input feature map, and

Δ p_{g n}

is the learned offset relative to the grid location

p_{n}

. Further implementation details can be found in [68,69].

In this work, a deformable kernel refers to a convolution kernel whose sampling positions are not fixed but are adaptively learned during training to better align with object structures. Unlike a standard

3 \times 3

kernel that samples on a uniform grid, the deformable kernel dynamically offsets each sampling point according to local feature variations. This enables the kernel to deform its shape and orientation to fit elongated or irregular ship targets in SAR images. Such adaptive sampling enhances the model’s ability to capture geometric variations caused by different ship sizes, rotations, and imaging angles. As illustrated in Figure 5, this flexibility allows the receptive field to concentrate on the most informative regions, thereby improving feature representation and detection robustness in complex maritime environments.

3.2. Position-Encoded Multi-Head Attention

The original Sparse R-CNN achieves efficient and accurate end-to-end detection by introducing a set of learnable sparse region proposals combined with a one-to-one dynamic convolution mechanism. A key component of this framework is the dynamic head, which fuses proposal features with ROI features through a unique feature interaction structure, thereby extracting increasingly precise object features.

In the original design, the dynamic head employs feature interaction based on batch matrix multiplication, where proposal features interact with corresponding ROI features to capture semantic information. However, this operation is essentially a global linear transformation and lacks the ability to fine-tune relationships across different feature subspaces such as semantics, location, and context. Moreover, the computation process is fixed and unified, making it difficult to adapt dynamically to the diverse feature representations of different ship targets. This limitation becomes critical when representing small or ambiguous targets in complex backgrounds, noisy conditions, or near-shore scenarios.

To overcome these limitations, we replace the original batch matrix multiplication with a more flexible and expressive multi-head attention mechanism. In this approach, proposal features serve as queries, while ROI features act as keys and values. Multi-head attention enables the model to capture associations across multiple subspaces in parallel, aggregating semantic and spatial information at finer granularity and enhancing object discrimination. This improves the flexibility of feature interactions and strengthens the model’s ability to perceive objects in complex SAR scenes.

However, in the original Sparse R-CNN, ROI features are obtained through operations such as RoI Align. Although they are semantically rich, they lack explicit spatial position encoding. Without positional information, attention calculation struggles to accurately capture the spatial distribution of candidate regions, leading to confusion when objects are spatially close yet semantically distinct. To address this, we introduce explicit positional encoding into ROI features, thereby enhancing spatial awareness and improving discrimination of small or edge targets.

3.2.1. Position Encoding Integrated into ROI Features

ROI features are extracted by sampling candidate regions from backbone feature maps. While these features contain strong semantic information, they discard geometric attributes of the original bounding boxes (e.g., center, width, height, and area), resulting in a lack of explicit spatial representation. This omission limits the model’s ability to differentiate objects in complex scenes, particularly under occlusion or when detecting small targets.

To enhance spatial modeling, we design a lightweight position encoding module that explicitly encodes the spatial information of candidate bounding boxes and fuses it with ROI features. The process includes three steps:

(1) Position information normalization: Each candidate region is represented by its upper-left and lower-right coordinates

(x_{1}, y_{1}, x_{2}, y_{2})

. To ensure scale invariance, coordinates and area are normalized to

[0, 1]

by dividing by image width W, height H, or their product:

x_{1}^{n o r m} = \frac{x_{1}}{W}, y_{1}^{n o r m} = \frac{y_{1}}{H},

(5)

x_{2}^{n o r m} = \frac{x_{2}}{W}, y_{2}^{n o r m} = \frac{y_{2}}{H},

(6)

S^{n o r m} = \frac{(x_{2} - x_{1}) (y_{2} - y_{1})}{W \cdot H} .

(7)

(2) Learnable position embeddings: The normalized coordinates and area are concatenated into a position vector P, which is mapped to the same dimension as ROI features via a lightweight perception module consisting of a linear layer, layer normalization, and a ReLU activation:

E_{p o s} = ReLU (LayerNorm (Linear (P))) .

(8)

(3) Fusion with ROI features: The position encoding is fused with the ROI feature through weighted addition:

F_{r o i_p o s} = F_{r o i} + E_{p o s},

(9)

where

F_{r o i}

denotes the ROI feature. This fusion introduces minimal computational overhead while significantly improving spatial modeling, enabling subsequent attention to more accurately distinguish objects by position.

3.2.2. Multi-Head Attention-Based Feature Fusion

The dynamic head establishes one-to-one interactions between proposal and ROI features to enhance feature representation and detection accuracy. In the original Sparse R-CNN, this relies on batch matrix multiplication, which, while effective, essentially reduces to a fixed-weight linear combination. As a result, it struggles to model complex feature relationships and adapt to the diversity of SAR ship targets and their spatial contexts.

To address this, we propose replacing the dynamic head with a multi-head attention mechanism. Compared with the original interaction scheme, multi-head attention offers stronger representation and relationship modeling. It captures correlations across scales, shapes, and spatial contexts in parallel, thereby improving detection of sparse and small ships in cluttered backgrounds. Moreover, its dynamic and interpretable attention weights are well-suited to handling strong scattering and background noise in SAR imagery, leading to improved robustness and accuracy.

Specifically, proposal features are linearly mapped to queries Q, while position-encoded ROI features are mapped to keys K and values V. The cross-attention mechanism is then applied. For the i-th proposal feature and ROI feature, the interaction is defined as:

Attention (Q_{i}, K_{i}, V_{i}) = Softmax (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{k}}}) V_{i},

(10)

where

d_{k}

is the feature dimension. By extending this operation to multiple heads, the model learns diverse relationships across different subspaces:

MultiHead (Q, K, V) = Concat (h e a d_{1}, \dots, h e a d_{h}) W^{O} .

(11)

The implementation of this multi-head attention mechanism for feature interaction between proposal and ROI features is illustrated in Figure 6.

4. Experiments and Results

To evaluate the effectiveness and robustness of the proposed method in practical SAR ship detection, experiments are conducted on two widely used public datasets: the SAR Ship Detection Dataset (SSDD) and the High-Resolution SAR Images Dataset (HRSID). The experimental study is organized as follows: Section 4.1 introduces SSDD and HRSID and describes the evaluation metrics; Section 4.2 outlines the software, hardware, and training configurations; Section 4.3 compares the proposed method with mainstream detection baselines, demonstrating its superior accuracy and robustness; Section 4.4 presents ablation experiments to analyze the contribution of each module; Section 4.4 shows visual comparison of detection results in sparse and multi-scale target scenes.

4.1. Datasets and Evaluation Metrics

4.1.1. SAR Ship Detection Datasets

(1) SSDD: The SAR Ship Detection Dataset, introduced by Li et al. [21], is the first open-source dataset for SAR ship detection. It contains 1160 SAR images with 2456 ship instances, averaging 2.12 ships per image. The images are collected from multiple sensors, including RadarSat-2, TerraSAR-X, and Sentinel-1, with polarimetric modes HH, VV, VH, and HV, and resolutions ranging from 1 to 15 m.

SSDD provides diverse imaging conditions and target scenarios, covering nearshore, offshore, and open-sea scenes, as shown in Figure 7a. Each ship is annotated with horizontal bounding boxes (center coordinates, width, and height), and even very small targets with only a few pixels are labeled, which benefits detection accuracy. The dataset poses challenges such as small-target detection in low-resolution images and strong background clutter, demanding high robustness from detection algorithms. Following the standard protocol, SSDD is randomly split into training and testing sets with a 4:1 ratio, making it suitable for fair performance comparison across methods.

(2) HRSID: The High-Resolution SAR Images Dataset, proposed by Wei et al. [22], was constructed for both ship detection and instance segmentation. It is derived from 136 high-resolution panoramic SAR images (1–5 m resolution), cropped into 5604 sub-images of size

800 \times 800

pixels, containing a total of 16,951 ship targets. Annotations are provided in MS COCO format, including bounding boxes and instance segmentation masks, making HRSID applicable to both detection and segmentation tasks.

HRSID offers substantial diversity in imaging modes and scenes, including harbors, offshore regions, and cluttered coastal backgrounds, as shown in Figure 7b. Data are collected from multiple imaging modes of Sentinel-1B and TerraSAR-X. To ensure annotation quality and reduce labeling errors, optical imagery from Google Earth is used as a reference during labeling.

4.1.2. Evaluation Metrics

The average precision (AP) metric is widely used to measure the performance of detection algorithms at different object sizes and matching thresholds. It exhibits strong generalization and representativeness. The AP metric relies on the precision-recall (PR) curve for calculation. The PR curve is a graph plotted with recall as the horizontal axis and precision as the vertical axis. The AP is represented on the image as the area enclosed by the PR curve and the vertical and horizontal axes. In the COCO evaluation system, AP is the core metric used to measure the accuracy of a detector at different IoU thresholds. mAP represents the average AP value across multiple categories. Since the SAR ship dataset contains only one category, AP is equivalent to mAP here. The standard AP

[IoU

= 0.50:0.95] is calculated for 10 IoU thresholds between 0.50 and 0.95, with a step size of 0.05, and then the average is calculated.

{AP}_{50}

represents the most commonly used AP value when the IoU is 0.5;

{AP}_{75}

represents the AP value when the IoU is 0.75;

{AP}_{S}

represents the AP value for small targets with a scale less than 32 × 32 pixels;

{AP}_{M}

represents the AP value for medium-sized targets with a scale between 32 × 32 and 96 × 96 pixels;

{AP}_{L}

represents the AP value for large targets with a scale greater than 96 × 96 pixels.

4.2. Experimental Settings

The experimental environment is configured as follows: the operating system is Ubuntu 18.04, the CPU uses Intel Xeon Gold 6226R (2.90 GHz), and the GPU device is 4 NVIDIA GeForce RTX 3090. The programming language is Python 3.9, and the main deep learning frameworks used are PyTorch 1.11.0 (CUDA version 11.3) and Detectron 2 (version 0.3). Other auxiliary libraries include torchvision 0.12.0, numpy 1.24.0, and fvcore 0.1.5.

The model is based on the Sparse R-CNN architecture, with the backbone network using ResNet-101, and the WDC module is introduced to enhance the feature extraction capability. The pre-trained weights come from the r-101 model trained on ImageNet. During training, the ADAMW optimizer was used, with a base learning rate of 2.5

\times 10^{- 5}

, a weight decay coefficient of 1

\times 10^{- 4}

, and a linear warm-up learning rate ramp-up with 1000 iterations. The total number of training iterations was 100,000, with the learning rate decaying at iterations 80,000 and 90,000. Data augmentation used only random cropping, which was enabled by default, and a training batch size of 8.

4.3. Performance Comparison with Reference Methods

To verify the effectiveness and robustness of the model proposed in this paper, this section conducts comparative experiments on the SSDD and HRSID with other advanced detection methods. Considering popularity and availability of reference methods, we select the two-stage methods Faster R-CNN [25], Cascade R-CNN [26], and Mask R-CNN [27], the one-stage method RetinaNet [32] and YOLO series variants (including YOLOv5s, YOLOv8s, YOLOv9s, YOLOv11s, YOLOv5n, YOLOv8n, YOLOv10n, YOLOv11n, CSS-YOLO [38], SHIP-YOLO [39] and Enhanced YOLOv8 [40], LHSDNet [41]), anchor-free method Sparse R-CNN [59], CenterNet [70], fully convolutional one-stage object detector (FCOS) [71] for performance comparison.

4.3.1. Comparative Experimental Results on SSDD

We evaluate the proposed method on the SSDD and compare it against representative two-stage methods, one-stage methods and anchor-free method. Two-stage methods include Faster R-CNN [25], Cascade R-CNN [26], and Mask R-CNN [27]. One-stage methods include RetinaNet [32], YOLOv5s, YOLOv8s, YOLOv9s, YOLOv11s, and CSS-YOLO [38]. The anchor-free method is the baseline model Sparse R-CNN [59].

Table 1 shows corresponding results of the comparative experiments. It can be seen that our method achieves an overall AP of 74.5%, ranking second among all methods and demonstrating competitive performance with state-of-the-art YOLO variants. Notably, it achieves the best results on

{AP}_{75}

(89.9%),

{AP}_{50}

(98.7%),

{AP}_{S}

(73.4%), and

{AP}_{M}

(80.5%), significantly outperforming existing approaches in precise localization and in detecting small and medium-sized objects, which are particularly challenging in SAR imagery. Although the performance on large objects (

{AP}_{L} = 70.8 %

) is slightly lower than that of YOLOv11s and YOLOv8s, the proposed method exhibits superior robustness and accuracy across most metrics, validating its effectiveness for SAR ship detection.

To more intuitively demonstrate the detection performance of the proposed model, we select Cascade R-CNN, CSS-YOLO, and Sparse R-CNN as representative two-stage, single-stage, and anchor-free detectors, respectively, based on the evaluation results in Table 1. Figure 8 shows the visual detection results across three scenarios. In the first row (shore-approach scene), the ship target is heavily obscured by shoreline buildings and surrounded by complex background clutter. Both RetinaNet and the baseline model fail to detect the target. Cascade R-CNN detects it but also produces a false positive by misclassifying background regions as ships. Only the proposed model correctly identifies the target, demonstrating that the WDC module and multi-head attention mechanism improve robustness under severe background interference. In the second row (nearshore scene) and the third row (offshore scene), all four methods detect the ship target, but Cascade R-CNN and CSS-YOLO produce redundant bounding boxes. By contrast, the baseline and the proposed model yield accurate detections without redundancy. These results highlight two points: first, compared with dense anchor-based detectors, Sparse R-CNN benefits from an end-to-end detection pipeline without complex post-processing; second, dense anchor-based methods are less suitable for SAR ship detection tasks with sparse targets.

4.3.2. Comparative Experimental Results on HRSID

We further evaluate our method on the HRSID by comparing the proposed method to reference methods with the results summarized in Table 2. Considering popularity and availability, the reference methods include Faster R-CNN [25], Cascade R-CNN [26], Mask R-CNN [27], RetinaNet [32], LHSDNet [41]), YOLOv5n, YOLOv8n, YOLOv10n, YOLOv11n, SHIP-YOLO [39], Enhanced YOLOv8 [40], Sparse R-CNN [59], CenterNet [70], fully convolutional one-stage object detector (FCOS) [71].

The proposed approach achieves the best overall AP of 68.7%, surpassing all competing methods, including the strong Sparse R-CNN (66.5%) and Cascade R-CNN (66.6%). In terms of localization accuracy, our method consistently outperforms others, achieving the highest

{AP}_{50}

(90.5%) and

{AP}_{75}

(79.7%), demonstrating superior robustness across different IoU thresholds. Moreover, the proposed model attains the best performance on small- and large-scale objects, with

{AP}_{S} = 69.9 %

and

{AP}_{L} = 55.2 %

, showing clear advantages in handling both fine-grained targets and large vessels that are often challenging for conventional detectors. While Enhanced YOLOv8 slightly outperforms our method on medium objects (

{AP}_{M} = 72.4 %

vs. 68.8%), the overall improvements across all other metrics highlight the balanced and reliable detection capability of our method on the complex HRSID.

Figure 9 presents a visual comparison of detection results obtained by the proposed model, Cascade R-CNN, Enhanced YOLOv8, and the baseline Sparse R-CNN. In the first row (shore-to-shore scene), the ship target occupies only a few pixels and is heavily obscured by background interference. While Enhanced YOLOv8 and the proposed model successfully detect the target, the other models either produce false positives or fail to detect it. In the second row (nearshore scene), the ship targets are sparse and extremely small, further complicated by numerous sea-surface noise points. All models miss some targets, and Cascade R-CNN even produces false positives by misclassifying noise points as ships. Despite these challenges, the proposed model misses only one target, achieving the most reliable detection among the compared methods. In the third row (offshore scene), two ships are sparsely distributed and located close to each other. Cascade R-CNN detects both targets but introduces multiple redundant bounding boxes, lowering precision despite achieving higher recall. By contrast, the proposed model and the baseline produce more accurate bounding-box localization, avoiding redundancy while maintaining robust detection performance.

4.4. Ablation Study

To verify the effectiveness of the proposed components, ablation experiments are conducted on the two core modules introduced in this work—WDC and PEMA. All experiments are performed under identical backbone architectures and training configurations, with key modules removed or replaced to isolate their effects on detection performance.

First, we evaluate the impact of integrating WDC into the ResNet-101 backbone. WDC enhances the geometric modeling capacity and improves the adaptability of receptive fields to local structural variations. Second, we investigate the effect of replacing the original dynamic interaction head in Sparse R-CNN with the proposed the PEMA module. This enhanced feature interaction module strengthens the fusion of semantic and spatial information, particularly improving detection of small and sparsely distributed ships.

The ablation study is conducted on the SSDD, where model variants are compared using the COCO AP metric. Specifically, we evaluate the baseline Sparse R-CNN, the model with WDC only, the model with PEMA only, and the full model incorporating both modules. This setup enables a systematic examination of the independent and combined effects of WDC and PEMA on overall detection accuracy and scale-specific performance. The experimental settings and results are summarized in Table 3. Here, WDC denotes the baseline model augmented with the WDC module, while PEMA refers to the baseline with its dynamic interaction head replaced by the position-encoded multi-head attention module.

Table 3 demonstrates that each component of the proposed method contributes positively to the ablation results on the SSDD. Compared with the baseline Sparse R-CNN, the full model achieves consistent and significant improvements across all detection metrics. Specifically, the overall AP increases by 3.7% (from 70.8% to 74.5%), while

{AP}_{50}

and

{AP}_{75}

improve by 2.8% (from 95.9% to 98.7%) and 3.3% (from 86.6% to 89.9%), respectively. These results indicate that the proposed modules notably enhance localization quality and bounding-box regression accuracy. For multi-scale performance,

{AP}_{S}

,

{AP}_{M}

, and

{AP}_{L}

increase by 4.3%, 3.7%, and 4.1%, respectively, reaching 73.4%, 80.5%, and 70.8%. This confirms that the proposed model effectively improves detection robustness across objects of different scales, particularly benefiting small and large targets in complex SAR scenes.

Further analysis of individual modules shows that when WDC is used alone, AP,

{AP}_{50}

, and

{AP}_{75}

improve by 3.1%, 1.5%, and 3.2%, respectively, demonstrating its strong contribution to feature extraction and localization. When PEMA is applied independently, AP,

{AP}_{50}

, and

{AP}_{75}

improve by 1.4%, 0.7%, and 1.2%, respectively, highlighting its role in enhancing semantic–spatial feature interactions. The best results are achieved when both modules are combined, confirming their complementary nature and synergistic effect in boosting overall detection performance.

Table 4 presents the ablation results of the proposed components on the HRSID. The baseline Sparse R-CNN achieves an AP of 66.5%, which provides a solid foundation but leaves room for improvement in complex maritime environments. Incorporating the WDC module yields a performance gain of 1.3 percentage points in overall AP, confirming that the integration of wavelet-domain representation and deformable sampling effectively enhances feature adaptability to multi-scale and geometrically diverse ship targets. When only the PEMA mechanism is added, the AP increases by 0.9 points compared to the baseline, indicating that explicit spatial encoding and attention-guided feature refinement improve discrimination of sparse ship targets against cluttered sea backgrounds.

When both modules are integrated, the proposed method achieves the best overall performance, with 68.7% AP, 90.5%

AP 50

, and 79.7%

AP 75

. This demonstrates that WDC and PEMA are complementary—WDC strengthens geometric and multi-scale feature extraction, while PEMA enhances contextual correlation and localization precision. The performance gains on small and medium targets (

{AP}_{S} = 69.9 %

and

{AP}_{M} = 68.8 %

) further highlight the proposed method’s robustness in detecting small ships under varying sea clutter conditions.

To further illustrate the effectiveness of WDC, we visualize feature maps of the ResNet-101 backbone at the Res4 and Res5 stages before and after its integration. Figure 10 and Figure 11 present two representative scenarios: docking and offshore. In each figure, the first row shows results from the baseline model, while the second row shows those after introducing WDC. Within each row, the left side depicts ground-truth annotations and the right side shows detection outputs.

The visual comparisons demonstrate that WDC significantly strengthens the model’s ability to capture ship features of varying shapes, reducing both false alarms and missed detections. Taking the docking scenario in Figure 10 as an example, the baseline model struggles with closely spaced ships and interference from port structures, leading to frequent errors. After applying WDC, the model better distinguishes adjacent ships and effectively suppresses background interference. From a feature representation perspective, the Res4 stage primarily models fine-grained shape and structural details, whereas the Res5 stage emphasizes global context and classification cues. The WDC-enhanced features show clear separation of targets from clutter compared to the baseline’s entangled responses. Similarly, in the offshore scenario (Figure 11), ships are sparsely distributed and mostly small, making them highly susceptible to background noise. The baseline model often misses these small targets. With WDC, the feature maps exhibit stronger responses for small ships, leading to more accurate detections and demonstrating improved robustness against noise and sparsity.

4.5. Discussion on Sparse and Multi-Scale Ship Detection

To further illustrate the advantages of the proposed model over conventional dense anchor-based detectors in sparse and multi-scale SAR ship detection scenarios, qualitative comparisons are presented in Figure 12 and Figure 13. Specifically, Figure 12 shows visual detection results for sparse ship targets on the SSDD, while Figure 13 presents results for multi-scale targets on the HRSID.

As shown in Figure 12, detection methods based on dense anchor mechanisms frequently suffer from false alarms or missed detections when dealing with sparsely distributed targets. These errors primarily occur due to the use of redundant predefined anchors and post-processing heuristics that cannot effectively adapt to arbitrary or irregular ship distributions. In contrast, the proposed sparse detection framework eliminates the need for dense anchors and handcrafted matching rules, demonstrating superior robustness and adaptability in sparse maritime environments.

Figure 13 highlights the effectiveness of the proposed model in detecting ships of varying scales. The results show that Cascade R-CNN and Enhanced YOLOv8 struggle with redundant detections or missed small targets, especially under complex background conditions and scale variations. For instance, in the first-row example, both Cascade R-CNN and the proposed model successfully detect all ship targets, whereas Enhanced YOLOv8 fails to capture one. In the third-row scene, where strong clutter interference and large scale differences exist, all models exhibit some performance degradation. However, the proposed method misses only three small targets, while the comparison models show a significantly higher number of false and missed detections. These observations confirm that the proposed WDC and PEMA modules effectively enhance the model’s multi-scale adaptability and robustness against clutter and noise.

5. Conclusions

This paper proposed a SAR ship detection method that enhances Sparse R-CNN with the WDC and PEMA modules. WDC improves feature representation by jointly leveraging spatial and frequency-domain information while adaptively modeling multi-scale geometric variations, and PEMA strengthens spatial perception and semantic interaction by integrating positional encoding into the attention mechanism. Experiments on SSDD and HRSID demonstrated that the proposed approach consistently outperforms anchor-based and anchor-free baselines, significantly improving detection accuracy for sparse, small, and complex targets while reducing false alarms in cluttered backgrounds. Ablation studies further verified that WDC and PEMA contribute complementary benefits, with their combination achieving the best overall performance. These results highlight the effectiveness of the proposed framework in addressing key challenges of SAR ship detection, and future work will explore real-time adaptation, lightweight deployment, and cross-domain generalization for broader maritime applications.

Author Contributions

Conceptualization, Z.Z. and Z.C.; methodology, H.L., Z.C. and J.Y.; software, H.L.; validation, Z.Z. and Z.C.; formal analysis, Z.C.; investigation, Z.Z. and H.L.; resources, H.L.; data curation, J.Y.; writing—original draft preparation, Z.C.; writing—review and editing, Z.Z. and H.L.; visualization, Z.Z.; supervision, H.L. and J.Y.; project administration, H.L.; funding acquisition, H.L. and J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Chongqing Natural Science Foundation under granted CSTB2025NSCQ-GPX0743, in part by the National Natural Science Foundation of China under Grant no. 62301164, Grant no. 62222102 and Grant no. 62171023, in part by the National Key Research and Development Program of China under granted 2024YFB3909800, and in part by the Fundamental Research Funds for the Central Universities with Project NO. 0214005203001.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would also like to thank the reviewers for their constructive comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, C.; Zhang, X.; Gao, G.; Lang, H.; Liu, G.; Cao, C.; Song, Y.; Guan, Y.; Dai, Y. Development and application of ship detection and classification datasets: A review. IEEE Geosci. Remote Sens. Mag. 2024, 12, 12–45. [Google Scholar] [CrossRef]
Liu, T.; Zhang, J.; Gao, G.; Yang, J.; Marino, A. CFAR ship detection in polarimetric synthetic aperture radar images based on whitening filter. IEEE Trans. Geosci. Remote Sens. 2019, 58, 58–81. [Google Scholar] [CrossRef]
Qin, X.; Zhou, S.; Zou, H.; Gao, G. A CFAR detection algorithm for generalized gamma distributed background in high-resolution SAR images. IEEE Geosci. Remote Sens. Lett. 2012, 10, 806–810. [Google Scholar]
Zeng, T.; Zhang, T.; Shao, Z.; Xu, X.; Zhang, W.; Shi, J.; Wei, S.; Zhang, X. CFAR-DP-FW: A CFAR-guided dual-polarization fusion framework for large-scene SAR ship detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 7242–7259. [Google Scholar] [CrossRef]
Li, M.D.; Cui, X.C.; Chen, S.W. Adaptive superpixel-level CFAR detector for SAR inshore dense ship detection. IEEE Geosci. Remote Sens. Lett. 2021, 19, 4010405. [Google Scholar] [CrossRef]
Zhang, T.; Ji, J.; Li, X.; Yu, W.; Xiong, H. Ship detection from PolSAR imagery using the complete polarimetric covariance difference matrix. IEEE Trans. Geosci. Remote Sens. 2018, 57, 2824–2839. [Google Scholar] [CrossRef]
Deng, J.; Wang, W.; Zhang, H.; Zhang, T.; Zhang, J. PolSAR Ship Detection Based on Superpixel-Level Contrast Enhancement. IEEE Geosci. Remote Sens. Lett. 2024, 21, 4008805. [Google Scholar] [CrossRef]
Wang, J.; Quan, S.; Xing, S.; Li, Y.; Wu, H.; Meng, W. PSO-based fine polarimetric decomposition for ship scattering characterization. ISPRS J. Photogramm. Remote Sens. 2025, 220, 18–31. [Google Scholar] [CrossRef]
Xing, X.; Ji, K.; Zou, H.; Chen, W.; Sun, J. Ship classification in TerraSAR-X images with feature space based sparse representation. IEEE Geosci. Remote Sens. Lett. 2013, 10, 1562–1566. [Google Scholar] [CrossRef]
Lin, H.; Song, S.; Yang, J. Ship classification based on MSHOG feature and task-driven dictionary learning with structured incoherent constraints in SAR images. Remote Sens. 2018, 10, 190. [Google Scholar] [CrossRef]
Lin, H.; Chen, H.; Wang, H.; Yin, J.; Yang, J. Ship detection for PolSAR images via task-driven discriminative dictionary learning. Remote Sens. 2019, 11, 769. [Google Scholar] [CrossRef]
Wang, Y.; Chen, L.; Shi, H.; Zhang, B. Ship detection in synthetic aperture radar imagery based on discriminative dictionary learning. In Proceedings of the 2019 6th Asia-Pacific Conference on Synthetic Aperture Radar (APSAR), Xiamen, China, 26–29 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–4. [Google Scholar]
Lin, H.; Chen, H.; Jin, K.; Zeng, L.; Yang, J. Ship detection with superpixel-level Fisher vector in high-resolution SAR images. IEEE Geosci. Remote Sens. Lett. 2019, 17, 247–251. [Google Scholar] [CrossRef]
Wang, X.; Li, G.; Plaza, A.; He, Y. Ship detection in SAR images via enhanced nonnegative sparse locality-representation of Fisher vectors. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9424–9438. [Google Scholar] [CrossRef]
Jin, K.; Chen, Y.; Xu, B.; Yin, J.; Wang, X.; Yang, J. A patch-to-pixel convolutional neural network for small ship detection with PolSAR images. IEEE Trans. Geosci. Remote Sens. 2020, 58, 6623–6638. [Google Scholar] [CrossRef]
Zhou, L.; Yu, H.; Lan, Y.; Gong, S.; Xing, M. CANet: An unsupervised deep convolutional neural network for efficient cluster-analysis-based multibaseline InSAR phase unwrapping. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5212315. [Google Scholar] [CrossRef]
Zhou, L.; Yu, H.; Lan, Y.; Xing, M. Deep learning-based branch-cut method for InSAR two-dimensional phase unwrapping. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5209615. [Google Scholar] [CrossRef]
Zhang, Z.; Mei, S.; Ma, M.; Han, Z. Adaptive composite feature generation for object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5631716. [Google Scholar] [CrossRef]
Xie, N.; Zhang, T.; Zhang, L.; Chen, J.; Wei, F.; Yu, W. VLF-SAR: A Novel Vision-Language Framework for Few-shot SAR Target Recognition. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 9530–9544. [Google Scholar] [CrossRef]
Tian, Z.; Wang, W.; Zhou, K.; Song, X.; Shen, Y.; Liu, S. Weighted pseudo-labels and bounding boxes for semisupervised SAR target detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 5193–5203. [Google Scholar] [CrossRef]
Li, J.; Qu, C.; Shao, J. Ship detection in SAR images based on an improved faster R-CNN. In Proceedings of the 2017 SAR in Big Data Era: Models, Methods and Applications (BIGSARDATA), Beijing, China, 13–14 November 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–6. [Google Scholar]
Wei, S.; Zeng, X.; Qu, Q.; Wang, M.; Su, H.; Shi, J. HRSID: A high-resolution SAR images dataset for ship detection and instance segmentation. IEEE Access 2020, 8, 120234–120254. [Google Scholar] [CrossRef]
Zhang, S.; Wu, R.; Xu, K.; Wang, J.; Sun, W. R-CNN-based ship detection from high resolution remote sensing imagery. Remote Sens. 2019, 11, 631. [Google Scholar] [CrossRef]
Xu, C.; Yin, C.; Wang, D.; Han, W. Fast ship detection combining visual saliency and a cascade CNN in SAR images. IET Radar Sonar Navig. 2020, 14, 1879–1887. [Google Scholar] [CrossRef]
Li, Y.; Zhang, S.; Wang, W.Q. A lightweight faster R-CNN for ship detection in SAR images. IEEE Geosci. Remote Sens. Lett. 2020, 19, 4006105. [Google Scholar] [CrossRef]
Chai, B.; Nie, X.; Zhou, Q.; Zhou, X. Enhanced cascade R-CNN for multiscale object detection in dense scenes from SAR images. IEEE Sens. J. 2024, 24, 20143–20153. [Google Scholar] [CrossRef]
Qian, Y.; Liu, Q.; Zhu, H.; Fan, H.; Du, B.; Liu, S. Mask R-CNN for object detection in multitemporal SAR images. In Proceedings of the 2019 10th International Workshop on the Analysis of Multitemporal Remote Sensing Images (MultiTemp), Shanghai, China, 5–7 August 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–4. [Google Scholar]
Zhang, T.; Zhang, X.; Ke, X. Quad-FPN: A novel quad feature pyramid network for SAR ship detection. Remote Sens. 2021, 13, 2771. [Google Scholar] [CrossRef]
Zhang, Z.T.; Zhang, X.; Shao, Z. Deform-FPN: A novel FPN with deformable convolution for multi-scale SAR ship detection. In Proceedings of the IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 5273–5276. [Google Scholar]
Han, L.; Ye, W.; Li, J.; Ran, D. Small ship detection in SAR images based on modified SSD. In Proceedings of the 2019 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), Chongqing, China, 11–13 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar]
Wang, Y.; Wang, C.; Zhang, H.; Dong, Y.; Wei, S. Automatic ship detection based on RetinaNet using multi-resolution Gaofen-3 imagery. Remote Sens. 2019, 11, 531. [Google Scholar] [CrossRef]
Miao, T.; Zeng, H.; Yang, W.; Chu, B.; Zou, F.; Ren, W.; Chen, J. An improved lightweight RetinaNet for ship detection in SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 4667–4679. [Google Scholar] [CrossRef]
Jiang, S.; Zhu, M.; He, Y.; Zheng, Z.; Zhou, F.; Zhou, G. Ship detection with SAR based on YOLO. In Proceedings of the IGARSS 2020-2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1647–1650. [Google Scholar]
Khan, H.M.; Yunze, C. Ship detection in SAR Image using YOLOv2. In Proceedings of the 2018 37th Chinese Control Conference (CCC), Wuhan, China, 25–27 July 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 9495–9499. [Google Scholar]
Hong, Z.; Yang, T.; Tong, X.; Zhang, Y.; Jiang, S.; Zhou, R.; Han, Y.; Wang, J.; Yang, S.; Liu, S. Multi-scale ship detection from SAR and optical imagery via a more accurate YOLOv3. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6083–6101. [Google Scholar] [CrossRef]
Jiang, J.; Fu, X.; Qin, R.; Wang, X.; Ma, Z. High-speed lightweight ship detection algorithm based on YOLO-v4 for three-channels RGB SAR image. Remote Sens. 2021, 13, 1909. [Google Scholar] [CrossRef]
Yu, C.; Shin, Y. SAR ship detection based on improved YOLOv5 and BiFPN. ICT Express 2024, 10, 28–33. [Google Scholar] [CrossRef]
Zhou, L.; Zhang, G.; Yang, J.; Xie, Y.; Liu, C.; Liu, Y. CSS-YOLO: A SAR Image Ship Detection Method for Complex Scenes. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 20636–20654. [Google Scholar] [CrossRef]
Luo, Y.; Li, M.; Wen, G.; Tan, Y.; Shi, C. SHIP-YOLO: A lightweight synthetic aperture radar ship detection model based on YOLOv8n algorithm. IEEE Access 2024, 12, 37030–37041. [Google Scholar] [CrossRef]
Guan, T.; Chang, S.; Wang, C.; Jia, X. SAR Small Ship Detection Based on Enhanced YOLO Network. Remote Sens. 2025, 17, 839. [Google Scholar] [CrossRef]
Dai, D.; Wu, H.; Wang, Y.; Ji, P. LHSDNet: A Lightweight and High-Accuracy SAR Ship Object Detection Algorithm. Remote Sens. 2024, 16, 4527. [Google Scholar] [CrossRef]
Pan, X.; Han, M.; Liao, G.; Yang, L.; Shao, R.; Li, Y. SFFNet: A ship detection method using scattering feature fusion for sea surface SAR images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 4018305. [Google Scholar] [CrossRef]
Wang, H.; Liu, S.; Lv, Y.; Li, S. Scattering information fusion network for oriented ship detection in SAR images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 4013105. [Google Scholar] [CrossRef]
Gao, G.; Zhang, C.; Zhang, L.; Duan, D. Scattering characteristic-aware fully polarized SAR ship detection network based on a four-component decomposition model. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5222722. [Google Scholar] [CrossRef]
Wang, J.; Guo, L.; Wei, Y.; Chai, S. Study on ship Kelvin wake detection in numerically simulated SAR images. Remote Sens. 2023, 15, 1089. [Google Scholar] [CrossRef]
Xu, C.; Qi, R.; Wang, X.; Sun, Z. Identifiability of Kelvin wakes in SAR imageries: The role of time-varying characteristics and decoherence effect of wake. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 20197–20213. [Google Scholar] [CrossRef]
Ding, K.; Yang, J.; Lin, H.; Wang, Z.; Wang, D.; Wang, X.; Ni, K.; Zhou, Q. Towards real-time detection of ships and wakes with lightweight deep learning model in Gaofen-3 SAR images. Remote Sens. Environ. 2023, 284, 113345. [Google Scholar] [CrossRef]
Lang, P.; Fu, X.; Dong, J.; Yang, H.; Yin, J.; Yang, J.; Martorella, M. Recent Advances in Deep Learning Based SAR Image Targets Detection and Recognition. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 6884–6915. [Google Scholar] [CrossRef]
Wu, F.; Zhou, Z.; Wang, B.; Ma, J. Inshore ship detection based on convolutional neural network in optical satellite images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 4005–4015. [Google Scholar] [CrossRef]
Sun, Y.; Sun, X.; Wang, Z.; Fu, K. Oriented ship detection based on strong scattering points network in large-scale SAR images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5218018. [Google Scholar] [CrossRef]
Fan, Q.; Chen, F.; Cheng, M.; Lou, S.; Xiao, R.; Zhang, B.; Wang, C.; Li, J. Ship detection using a fully convolutional network with compact polarimetric SAR images. Remote Sens. 2019, 11, 2171. [Google Scholar] [CrossRef]
Jiao, J.; Zhang, Y.; Sun, H.; Yang, X.; Gao, X.; Hong, W.; Fu, K.; Sun, X. A densely connected end-to-end neural network for multiscale and multiscene SAR ship detection. IEEE Access 2018, 6, 20881–20892. [Google Scholar] [CrossRef]
Cui, Z.; Li, Q.; Cao, Z.; Liu, N. Dense attention pyramid networks for multi-scale ship detection in SAR images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 8983–8997. [Google Scholar] [CrossRef]
Zhou, Y.; Liu, H.; Ma, F.; Pan, Z.; Zhang, F. A sidelobe-aware small ship detection network for synthetic aperture radar imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5205516. [Google Scholar] [CrossRef]
Li, Q.; Min, R.; Cui, Z.; Pi, Y.; Xu, Z. Multiscale ship detection based on dense attention pyramid network in SAR images. In Proceedings of the IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–4. [Google Scholar]
Wan, H.; Chen, J.; Huang, Z.; Xia, R.; Wu, B.; Sun, L.; Yao, B.; Liu, X.; Xing, M. AFSar: An anchor-free SAR target detection algorithm based on multiscale enhancement representation learning. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5219514. [Google Scholar] [CrossRef]
Yang, S.; An, W.; Li, S.; Wei, G.; Zou, B. An improved FCOS method for ship detection in SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8910–8927. [Google Scholar] [CrossRef]
Gao, G.; Wang, Y.; Chen, Y.; Yang, G.; Yao, L.; Zhang, X.; Li, H.; Li, G. An oriented ship detection method of remote sensing image with contextual global attention mechanism and lightweight task-specific context decoupling. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4200918. [Google Scholar] [CrossRef]
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14454–14463. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Chen, F.; Wu, F.; Xu, J.; Gao, G.; Ge, Q.; Jing, X.Y. Adaptive deformable convolutional network. Neurocomputing 2021, 453, 853–864. [Google Scholar] [CrossRef]
Fu, H.; Liang, J.; Fang, Z.; Han, J.; Liang, F.; Zhang, G. Weconvene: Learned image compression with wavelet-domain convolution and entropy model. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 37–53. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998. [Google Scholar]
Cui, Z.; Wang, X.; Liu, N.; Cao, Z.; Yang, J. Ship detection in large-scale SAR images via spatial shuffle-group enhance attention. IEEE Trans. Geosci. Remote Sens. 2020, 59, 379–391. [Google Scholar] [CrossRef]
Zhao, Y.; Zhao, L.; Xiong, B.; Kuang, G. Attention receptive pyramid network for ship detection in SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 2738–2756. [Google Scholar] [CrossRef]
Akyazi, P.; Ebrahimi, T. Learning-Based Image Compression using Convolutional Autoencoder and Wavelet Decomposition. In Proceedings of the CVPR Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Iliopoulou, S.; Tsinganos, P.; Ampeliotis, D.; Skodras, A. Learned Image Compression with Wavelet Preprocessing for Low Bit Rates. In Proceedings of the 2023 24th International Conference on Digital Signal Processing (DSP), Rhodes, Greece, 11–13 June 2023; pp. 1–5. [Google Scholar]
Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14408–14419. [Google Scholar]
Dong, W.; Zhou, H.; Wang, R.; Liu, X.; Zhai, G.; Chen, J. Dehazedct: Towards effective non-homogeneous dehazing via deformable convolutional transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 6405–6414. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 6569–6578. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1922–1933. [Google Scholar] [CrossRef]

Figure 1. Sparse R-CNN framework. The input consists of an image, a set of proposal boxes, and proposal features, where the latter two are learnable parameters. The backbone extracts a feature map, and each proposal box together with its corresponding proposal feature is fed into a dedicated dynamic head to generate object features, which are then used to produce the final classification and localization outputs [59].

Figure 2. Workflow of the Sparse R-CNN framework.

Figure 3. Flowchart of the proposed method. The input SAR image is first processed by a ResNet-101 backbone, in which standard convolutions are replaced with WDC modules (WDC-Res2 WDC-Res5). The resulting feature maps are fused through a FPN to generate multi-scale features (

P 2

–

P 5

). ROI pooling and position encoding are then applied, followed by a multi-head attention module that interacts with learnable proposal features from Sparse R-CNN. The generated target representations are finally used for classification and bounding-box regression to produce detection results.

Figure 3. Flowchart of the proposed method. The input SAR image is first processed by a ResNet-101 backbone, in which standard convolutions are replaced with WDC modules (WDC-Res2 WDC-Res5). The resulting feature maps are fused through a FPN to generate multi-scale features (

P 2

–

P 5

). ROI pooling and position encoding are then applied, followed by a multi-head attention module that interacts with learnable proposal features from Sparse R-CNN. The generated target representations are finally used for classification and bounding-box regression to produce detection results.

Figure 4. The architecture of the proposed WDC block. Dconv denotes the deformable convolution operation.

Figure 5. Deformable convolution kernel. (a) Standard 3 × 3 convolution kernel, where light blue points denote standard sampling locations. (b) Deformable convolution kernel obtained by adding offsets to the standard kernel, where dark blue points indicate deformed sampling locations and arrows represent offset directions.

Figure 6. Multi -head attention-based feature fusion.

Figure 7. Image samples of the experimental datasets. (a) SSDD. (b) HRSID.

Figure 8. Visualization of detection results of different models on SSDD: (a) Cascade R-CNN, representative of the two-stage ship detection method; (b) CSS-YOLO, representative of the single-stage method; (c) Sparse R-CNN, representative of the anchor-free method; (d) The proposed method; (e) Groundtruth. From top to bottom: at shore, near shore and open sea. The green box marks the correctly detected target, and the red box marks the false alarm target.

Figure 9. Visualization of detection results of different models on HRSID: (a) Cascade R-CNN, representative of the two-stage ship detection method; (b) Enhanced YOLOv8, representative of the single-stage ship detection method; (c) Sparse R-CNN, representative of the anchor-free method; (d) The proposed method; (e) Groundtruth. From top to bottom: at shore, near shore and open sea. The green box marks the correctly detected target, and the red box marks the false alarm target.

Figure 10. Comparison of feature maps from two typical docking scene samples before and after introducing WDC. The first row presents the baseline results, and the second row shows the results after introducing WDC. The green box marks the groundtruth or the correctly detected target, and the red box marks the false alarm target. From left to right: ground truth, feature maps of the res4 block, feature maps of the res5 block, and detection results.

Figure 11. Comparison of feature maps from two typical offshore scene samples before and after introducing WDC. The first row presents the baseline results, and the second row shows the results after introducing WDC. The green box marks the groundtruth or the correctly detected target. From left to right: ground truth, feature maps of the res4 block, feature maps of the res5 block, and detection results.

Figure 12. Visual detection results of three typical sparse ship target recognition scenes on SSDD, from left to right: Cascade R-CNN, CSS-YOLO, our model, and the ground truth. The green box marks the correctly detected target, and the red box marks the false alarm target.

Figure 13. Visual detection results of three typical ship target recognition scenes on HRSID, from left to right: Cascade R-CNN, Enhanced YOLOv8, our model, and the ground truth. The green box marks the correctly detected target, and the red box marks the false alarm target.

Table 1. Comparative experimental results of the SSDD (%). The best results are emphasized in boldface, and the suboptimal results are indicated by underlining.

Test Methods	AP	${AP}_{50}$	${AP}_{75}$	${AP}_{S}$	${AP}_{M}$	${AP}_{L}$
Faster R-CNN	59.7	94.5	68.7	55.9	66.7	48.2
Cascade R-CNN	60.4	93.8	68.3	55.5	67.8	58.5
Mask R-CNN	59.6	94.1	67.3	56.0	65.6	50.3
RetinaNet	56.3	91.0	62.9	51.9	63.3	46.1
YOLOv5s	73.5	98.3	80.0	66.8	74.9	67.6
YOLOv8s	74.7	98.2	82.6	66.9	75.7	73.1
YOLOv9s	74.2	98.4	86.7	66.7	76.3	70.8
YOLOv11s	74.0	98.3	85.5	66.6	74.6	73.8
CSS-YOLO	73.0	98.6	87.2	65.9	73.6	65.5
Sparse R-CNN	70.8	95.9	86.6	69.1	76.8	66.7
Proposed	74.5	98.7	89.9	73.4	80.5	70.8

Table 2. Comparative experimental results of the HRSID (%). The best results are emphasized in boldface, and the suboptimal results are indicated by underlining.

Test Methods	AP	${AP}_{50}$	${AP}_{75}$	${AP}_{S}$	${AP}_{M}$	${AP}_{L}$
Faster R-CNN	63.5	86.8	73.3	64.4	65.1	16.4
Cascade R-CNN	66.6	87.9	76.4	67.6	67.7	28.8
Mask R-CNN	65.0	88.0	75.2	66.1	66.1	17.3
LHSDNet	60.7	87.0	70.3	60.9	69.1	12.1
RetinaNet	60.0	84.8	67.2	60.9	60.9	26.8
YOLOv5n	61.7	86.3	71.6	61.3	69.1	8.3
YOLOv8n	62.7	87.7	73.0	62.1	71.3	11.5
YOLOv10n	58.8	83.7	66.8	59.2	61.7	7.4
YOLOv11n	61.7	86.3	67.9	60.9	69.9	9.0
SHIP-YOLO	61.3	86.0	71.5	62.5	70.2	7.7
Enhanced YOLOv8	63.4	88.4	72.7	62.9	72.4	15.4
CenterNet	56.8	85.7	64.1	57.7	35.2	14.4
FCOS	41.5	69.9	50.2	43.0	7.6	2.8
Sparse R-CNN	66.5	88.6	77.4	67.5	67.7	49.1
Proposed	68.7	90.5	79.7	69.9	68.8	55.2

Table 3. Ablation experiment results on SSDD (%). The best results are emphasized in boldface.

	WDC	PEMA	AP	${AP}_{50}$	${AP}_{75}$	${AP}_{S}$	${AP}_{M}$	${AP}_{L}$
Sparse R-CNN	✗	✗	70.8	95.9	86.6	69.1	76.8	66.7
Sparse R-CNN + WDC	✔	✗	73.9	97.4	89.8	72.1	80.3	69.9
Sparse R-CNN + PEMA	✗	✔	72.2	96.6	87.8	70.4	78.6	67.6
Proposed	✔	✔	74.5	98.7	89.9	73.4	80.5	70.8

Table 4. Ablation experiment results on HRSID (%). The best results are emphasized in boldface.

	WDC	PEMA	AP	${AP}_{50}$	${AP}_{75}$	${AP}_{S}$	${AP}_{M}$	${AP}_{L}$
Sparse R-CNN	✗	✗	66.5	88.6	77.4	67.5	67.7	49.1
Sparse R-CNN + WDC	✔	✗	67.8	89.7	78.9	68.9	68.2	51.9
Sparse R-CNN + PEMA	✗	✔	67.4	89.3	78.5	68.6	68.0	52.3
Proposed	✔	✔	68.7	90.5	79.7	69.9	68.8	55.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zeng, Z.; Chen, Z.; Yin, J.; Lin, H. Ship Detection in SAR Images Using Sparse R-CNN with Wavelet Deformable Convolution and Attention Mechanism. Remote Sens. 2025, 17, 3794. https://doi.org/10.3390/rs17233794

AMA Style

Zeng Z, Chen Z, Yin J, Lin H. Ship Detection in SAR Images Using Sparse R-CNN with Wavelet Deformable Convolution and Attention Mechanism. Remote Sensing. 2025; 17(23):3794. https://doi.org/10.3390/rs17233794

Chicago/Turabian Style

Zeng, Zhiqiang, Zongsi Chen, Junjun Yin, and Huiping Lin. 2025. "Ship Detection in SAR Images Using Sparse R-CNN with Wavelet Deformable Convolution and Attention Mechanism" Remote Sensing 17, no. 23: 3794. https://doi.org/10.3390/rs17233794

APA Style

Zeng, Z., Chen, Z., Yin, J., & Lin, H. (2025). Ship Detection in SAR Images Using Sparse R-CNN with Wavelet Deformable Convolution and Attention Mechanism. Remote Sensing, 17(23), 3794. https://doi.org/10.3390/rs17233794

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Ship Detection in SAR Images Using Sparse R-CNN with Wavelet Deformable Convolution and Attention Mechanism

Highlights

Abstract

1. Introduction

2. Overview of the Sparse R-CNN Framework

3. Proposed Method

3.1. Wavelet Deformable Convolution

3.2. Position-Encoded Multi-Head Attention

3.2.1. Position Encoding Integrated into ROI Features

3.2.2. Multi-Head Attention-Based Feature Fusion

4. Experiments and Results

4.1. Datasets and Evaluation Metrics

4.1.1. SAR Ship Detection Datasets

4.1.2. Evaluation Metrics

4.2. Experimental Settings

4.3. Performance Comparison with Reference Methods

4.3.1. Comparative Experimental Results on SSDD

4.3.2. Comparative Experimental Results on HRSID

4.4. Ablation Study

4.5. Discussion on Sparse and Multi-Scale Ship Detection

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI