1. Introduction
Synthetic aperture radar (SAR) is an active microwave imaging sensor that offers significant advantages over optical sensors. Firstly, SAR operates independently of weather conditions and ambient illumination, enabling continuous all-day and all-weather observations of the Earth. Additionally, SAR possesses the capability to detect concealed targets and gather critical information in complex environments. Due to these benefits, SAR has become an indispensable information acquisition platform in remote sensing [
1,
2,
3,
4]. Its applications span various domains, including topographic mapping, agricultural monitoring, disaster detection, and ocean monitoring [
5,
6,
7]. Notably, the utilization of SAR imagery for ship detection has been applied across various fields, including marine surveillance, fishery management, disaster rescue, and others [
8].
Traditional SAR ship detection methods mainly encompass constant false alarm rate (CFAR)-based methods [
9], visual saliency-based methods [
10], global threshold-based methods [
11], polarimetry decomposition-based methods [
12], and wavelet transform-based methods [
13]. These traditional methods heavily rely on manually crafted features and limited shallow learning representations, rendering them highly vulnerable to variations in background statistics. In complex scenarios, such as coastal areas characterized by small islands or rocky outcrops, traditional SAR ship detection algorithms struggle to identify effective features necessary for achieving optimal results and exhibit restricted generalization capabilities.
With the rapid development of deep learning technology [
14], researchers have increasingly integrated this technology into the study of ship detection tasks in SAR images. Initially, convolutional neural networks (CNNs) were employed to enhance the detection performance of traditional SAR image ship detection methods, such as constant false alarm rate (CFAR) [
15,
16]. Subsequently, as more SAR ship detection training datasets became available, researchers shifted their focus towards end-to-end SAR image ship detection methods that are entirely based on CNNs. These methods demonstrated exceptionally high levels of efficiency and accuracy in detection. Ship detection methods based on CNNs can be categorized into two types: one is a two-stage detection network exemplified by Region-CNN (R-CNN) [
17], while the other is a one-stage detection network represented by You Only Look Once (YOLO) and its subsequent iterations [
18,
19,
20,
21,
22]. The two-stage detection network employs selective search techniques to generate candidate regions for further analysis. In contrast, the one-stage detection network eliminates the region proposal generation phase, thereby simplifying the object detection problem into a regression problem, resulting in faster detection speeds compared to two-stage detectors.
Due to the development of CNN-based object detection methods, numerous advanced SAR ship detection methods have been proposed in recent years. Cui et al. [
23] introduced a novel CenterNet that incorporates a spatial shuffle-group enhanced attention module, enabling the extraction of more refined semantic features while effectively suppressing noise to mitigate false alarms caused by coastal and inland interferences. Shan et al. [
24] proposed a multi-layer deep dense network that includes an improved DenseNet block along with a SimAM attention mechanism, which adeptly adapts to the specific speckle noise distribution present in SAR images and dynamically enhances spatial features, thereby allowing the entire network to exhibit robust generalization capabilities and high accuracy. Hu et al. [
25] presented an anchor-free framework utilizing a balanced attention network aimed at enhancing ship detection across various scales. This framework incorporates both local and non-local attention modules to achieve an equilibrium between local and non-local feature representations throughout the entire network’s operation. With ongoing optimizations and widespread applications of YOLO-series algorithms, an increasing number of researchers are employing YOLO algorithms to optimize ship detection methods. Sun et al. [
26] proposed a novel YOLO-based SAR ship detector using bi-directional feature fusion and angular classification (BiFA-YOLO). The novel bi-directional feature fusion module (Bi-DFFM) can efficiently aggregate multi-scale features through bi-directional (top-down and bottom-up) information interaction, which is helpful for detecting multi-scale ships. Zhao et al. [
27] integrated a fused attention C2fSE module along with a DenseASPP module into the baseline YOLOv8n architecture to bolster both feature extraction and fusion capabilities. Luo et al. [
28] introduced SHIP-YOLO, a lightweight SAR ship detection model that is built upon the YOLOv8n framework. The model incorporates GhostConv, a lightweight convolutional layer designed to replace conventional convolutions, and integrates the re-parameterized RepGhost bottleneck structure within the C2f module. This innovative approach effectively reduces both the parameter count and computational complexity of the model. Additionally, they implemented WIoU and attention mechanisms to enhance detection accuracy. Dai et al. [
29] proposed LHSDNet, a lightweight yet high-accuracy SAR ship image detection network. This architecture employs GhostHGNetV2 as its feature extraction backbone and includes a lightweight feature fusion module to minimize overall computational load. Furthermore, parameter sharing is utilized in the feature extraction component, while the design of the detection head prioritizes lightness to conserve computing resources further.
SAR images often contain a significant number of small targets, which can lead to a considerable decline in detection accuracy. Nevertheless, the existing methods for identifying small targets exhibit limited detection capabilities. The challenges associated with detecting small ships in maritime contexts have garnered considerable attention from researchers. Some scholars have validated the detectability of small boats within SAR images through studies focused on specific types of vessels. Lanz et al. [
30,
31] prepared test vessels, including thin inflated rubber ships and simulated fully occupied refugee vessels, in an effort to identify the most effective detectors for various satellite platforms. The study presented in [
32] highlights the advantages of utilizing very-high-resolution TerraSAR-X data acquired in staring spotlight mode for the detection of small boats. Shin et al. [
33] proposed a method to improve the detection capability of small ships through the integration of multiple polarization images and the application of an adaptive threshold technique. In the domain of Convolutional Neural Networks (CNNs), researchers have conducted numerous explorations. Wang et al. [
34] redesigned the feature extraction network and proposed a path augmentation fusion network aimed at integrating spatial and semantic information across different scales through both bottom-up and top-down approaches. Yang et al. [
35] introduced an anchor-free enhanced FCOS method that employs a multi-level feature attention mechanism along with a feature refinement and reuse module to extract effective features while refining those related to small ships, thereby facilitating the effective detection of small ships near shorelines. Zhou et al. [
36] integrated the transformer self-attention mechanism into their backbone network, utilizing channel attention and spatial attention to enhance feature space integration. Hu et al. [
37] proposed a transformer-based dynamic sparse attention module and a small target-friendly detection head to improve the focus and extraction of small ship features.
Research on CNN-based ship detection has made progress in both accuracy and efficiency, and researchers have begun to focus on difficulties in ship detection such as small-sized ship detection, but there still exist several intractable challenges that require solutions. First, ships exhibit rich diversity in real SAR images, and variations in scale continue to pose challenges for detection. In particular, small ships occupy a limited number of pixels within SAR images, making them susceptible to being overlooked, which can lead to missed detections. Second, due to the intrinsic imaging mechanism of SAR, speckle noise is an unavoidable characteristic in SAR images. In complex backgrounds, small targets are more susceptible to being submerged by sea clutter and speckle noise, causing a decline in detection accuracy. Third, there is an inherent contradiction between detection accuracy and detection speed. To effectively identify small targets, it is necessary to increase the complexity of the model, which poses challenges in meeting the requirements for real-time ship detection.
Based on the analysis presented above, this paper proposes an enhanced YOLO network with the Shuffle Re-parameterization (SR) module, space-to-depth (SPD) module, Hybrid Attention (HA) module, and shape-NWD loss. The novel network replaces the C2f module with the SR module, which facilitates the propagation of target information throughout the feature hierarchy in both backbone and neck networks. The SPD module is employed to perform down-sampling operations to retain more available information. The incorporation of the HA module contributes to an overall enhancement of ship-related features. And the shape-NWD loss proves advantageous for detecting small-sized ships. The main contributions of this article can be summarized as follows:
- We propose a Shuffle Re-parameterization block (SRB) that integrates channel shuffle with a re-parameterized convolution block (RepConv). This approach enhances the capabilities of feature extraction during the training phase, while concurrently reducing memory consumption and inference time in the inference phase. To further improve the detection capabilities of the model, we substitute the original C2f module with the Shuffle Re-parameterization module (SR) constructed using the proposed RepConv and SRB. 
- We employ a Space-to-Depth module (SPD) for down-sampling operations to mitigate information loss associated with small targets, thereby enhancing the detection accuracy of small-sized ships. 
- An attention module called Hybrid Attention module (HA) is proposed, which leverages both frequency-domain and spatial-domain information. This module enhances the focus on ship-related features while effectively minimizing the influence of irrelevant background interference. 
- We add the shape-NWD loss to enhance detection accuracy further, since it exhibits insensitivity to targets of varying scales, and is particularly well suited for measuring the similarity between small targets. 
The subsequent sections of this paper are structured as follows. 
Section 2 elaborates on the proposed methodology; 
Section 3 presents the experimental outcomes; 
Section 4 discusses the results; and 
Section 5 is the conclusion.
  2. Methodology
  2.1. The Overall Network Architecture
YOLOv8 is a well-established and widely adopted one-stage anchor-free detection model that aims to achieve an optimal balance between accuracy and speed. It has demonstrated exceptional performance across various object detection tasks, including SAR ship detection. In comparison to anchor-based detection model YOLOv5, the anchor-free design of YOLOv8 is more adept at accommodating the multi-scale characteristics of ship targets. Meanwhile, while YOLOv11 demonstrates superior performance in general object detection tasks, YOLOv8 exhibited greater flexibility in complex scenes and achieved enhanced results in ship detection within SAR images. Given the computational complexity and efficiency requirements associated with SAR ship detection, we selected YOLOv8n as the baseline model for constructing our network. The overall architecture of the network is illustrated in 
Figure 1, which consists of three components: a backbone network for feature extraction, a neck for multi-scale feature fusion, and three decoupled prediction heads dedicated to classification and regression. The detailed network structure of the reserved YOLOv8 module, specifically the SPPF module and detection heads, is also illustrated in 
Figure 1. The modules that deviate from the baseline network are highlighted in red and will be discussed in greater detail in the following sections.
First, we introduce a re-parameterized convolution block (RepConv) to propose the Shuffle Re-parameterization module (SR). The SR module replaces the C2f module in YOLOv8, aiming to enhance the feature representation capabilities of the detection network and extract texture features more effectively. Second, we incorporate the Space-to-Depth module (SPD) to replace the down-sampling convolution layer in the backbone network, intending to reduce the information loss associated with small targets and preserve more valuable features. The output of the backbone network is subsequently fed to the neck network, which comprises a top-down path FPN and a bottom-up Path Aggregation Network (PAN), to extract contextual information. We propose a Hybrid Attention (HA) module, strategically positioned between FPN and PAN. The HA module is designed to enhance features related to ships and suppress surrounding interference by implementing frequency-domain and spatial-domain attention mechanisms. After the neck network, the decoupled head employs an anchor-free structure for category and position prediction. This approach aims to mitigate the inaccuracies in predictions arising from the multi-scale characteristics of ships. Finally, we incorporate the shape-NWD loss into the bounding box regression loss. The shape-NWD loss integrates the concept of shape-IoU with the Normalized Gaussian Wasserstein Distance (NWD) to enhance detection performance for small targets. The proposed method is applied for small ship detection in SAR imagery, resulting in improved performance in detection accuracy.
The rest of 
Section 2 is organized as follows. 
Section 2.2 introduces the SR module, including the RepConv block, the SRB, and the SR module. The SPD module is presented in 
Section 2.3. In 
Section 2.4, the HA module is described in detail. 
Section 2.5 introduces the shape-NWD loss.
  2.2. The Shuffle Re-Parameterization Module
In Convolutional Neural Networks, deepening the depth of the network can improve the feature representation capability. However, as the depth of the network increases, there is a corresponding risk of losing feature information related to the target. Unlike optical images, the distinctive features of ship targets in SAR images are often less pronounced, which complicates the process of effective feature extraction. To enhance the feature representation capabilities and extract ship-related features more effectively, we propose the re-parameterized convolution block (RepConv). This innovation is subsequently employed to construct the SRB and the SR module.
It is well established that integrating branches of varying scales and complexities can enrich the feature space, thereby enhancing the representational capacity of individual convolutions. In this way, the capability for feature representation can be improved without necessitating an increase in the depth of the network. We adhere to the six equivalent transformations proposed in DBB [
38] to construct a novel re-parameterized convolution block (RepConv), as illustrated in 
Figure 2. The RepConv block comprises three branches: the 1 × 1 convolution branch, the sequential 1 × 1 − 3 × 3 convolution branch, and the 1 × 1 convolution-average pooling branch. In the sequential 1 × 1 − 3 × 3 convolution branch, we set the internal channels equal to the input. A Batch Normalization (BN) layer is applied after each convolutional or average pooling layer to introduce nonlinearity during the training phase. The RepConv block employs various transformations, including BN, average pooling, sequences of convolutions, and branch addition. This multi-branch topology architecture exhibits diverse receptive fields and varying complexities, facilitating the acquisition of rich feature information during the training phase. In the inference stage, the RepConv block is converted into a 3 × 3 convolution through structural re-parameterization to enable rapid inference.
Inspired by ShuffleNet [
39], we construct a Shuffle Re-parameterization block (SRB) that integrates the RepConv block and the channel shuffle operation, as shown in 
Figure 3a. The input tensor is partitioned into two distinct channel-wise tensors with equal dimensions following the channel split operator. After conducting multi-branch training on one tensor, it is subsequently concatenated with the other tensor in a channel-wise manner. The channel shuffle mechanism operates on stacked group convolutions to enhance information fusion and facilitate informative feature representation.
The Shuffle Re-parameterization module (SR) is constructed by stacking RepConv and Shuffle Re-parameterization block (SRB), as illustrated in 
Figure 3b. Drawing inspiration from the One-Shot Aggregation (OSA) [
40] module, we develop the SR module by aggregating features with varying receptive fields only once at the final feature maps. The SR module enhances the network’s sensitivity to targets of different scales through channel shuffling and feature stacking, thereby improving the model’s detection capability for multi-scale targets. Due to the incorporation of the RepConv block and the channel shuffle operation, the SR module can extract abundant target features. We substituted the C2f module in the YOLOv8 network with the SR module. This modification enhances the feature extraction capabilities of the detection network, thereby improving its modeling proficiency for detection targets.
  2.3. The Space-to-Depth Module
In SAR images, most ship targets appear as small entities with limited pixel size and restricted information content. Consequently, ship targets are prone to experiencing feature loss in the backbone network. Specifically, the down-sampling operation by convolution and pooling contributes significantly to information degradation and insufficient feature learning, which complicates the detection of small targets. To address this issue, we employ the SPD module [
41] instead of the down-sampling convolution layer. The SPD module is composed of a 3 × 3 convolution layer, an SPD layer, and a 1 × 1 convolution layer, as shown in 
Figure 4.
The SPD layer partitions the input feature map into four distinct feature blocks, reducing both the length and width of the original input feature map by half while simultaneously increasing the number of channels by a factor of four. It plays a crucial role in reducing each spatial dimension of the input feature map while simultaneously enhancing the channel dimension, all without compromising the information contained within each channel. This is achieved by mapping features from the input feature map directly to channels. Consequently, as the spatial dimensions decrease, the size of the channel dimension increases. This transition layer effectively augments the depth of the feature map while preserving information, thereby mitigating issues related to information loss that are often encountered with traditional down-sampling layers. Following the SPD layer, a 1 × 1 convolution is applied as a standard convolution operation, which does not reduce the spatial dimensions of the feature map. Thus, it facilitates further processing of features through learnable parameters while retaining more fine-grained details. This combination endows the SPD module with enhanced performance when addressing low-resolution images and small targets. By minimizing information loss associated with down-sampling, the SPD module significantly enhances detection accuracy for small targets.
  2.4. The Hybrid Attention Module
The object detection algorithm based on deep learning relies on the neck network for effective feature fusion. In YOLOv8, the PAN-FPN architecture generates feature maps that contain rich contextual and semantic information through bi-directional path fusion and horizontal connections. However, the long propagation path of PAN-FPN may result in the loss of ship texture details, which is detrimental to the detection of small ships. Furthermore, the widespread presence of clutter and noise heightens the probability that small ship targets may become entangled with surrounding interference or even be submerged, which presents additional challenges for achieving accurate feature representation. Consequently, we propose a Hybrid Attention (HA) module strategically positioned between the top-down and bottom-up pathways. This design aims to enhance the effective features of ship targets while mitigating information loss associated with long propagation paths.
Attention mechanisms have been widely used in computer vision tasks and have demonstrated remarkable effectiveness in producing discernible feature representations. In addition to channel attention mechanisms and spatial attention mechanisms, the frequency-domain information is proven to help the model fully understand the context. The Multi-axis External Weights module (MEW) [
42] is a 2D-DFT-based attention mechanism that effectively extracts and fuses features to obtain more comprehensive global and local information. 
Figure 5a indicates that the feature map is divided into four branches in accordance with the channel dimension. Regarding the first three branches, features are converted to the frequency domain by means of 2D-DFT along three axes (Height–Width, Channel–Width, and Channel–Height). Subsequently, the corresponding learnable weights are employed to multiply the frequency domain maps, and the map is transformed back to the spatial domain through the application of 2D-IDFT. For the fourth branch, DW convolution is used to acquire local information, and the feature map is concatenated along the channel dimension to restore the same size as the input. Eventually, the residual connection of the input is adopted to obtain the output. Given a feature map 
 as input, the MEW mechanism can be expressed by the equations below.
        where ⊙ is the element-wise product. Split and Concat represent the split and concatenation operation along the channel dimension.
The Spatial Attention module (SA) [
43] is illustrated in 
Figure 5b. SA is designed to emphasize or suppress different features within the feature map, enabling the network to focus on significant regions of the image. The spatial feature information is extracted by global maximum pooling and global average pooling, while convolution operations are employed to generate spatial attention weights. Given an input feature map 
, the SA can be expressed by the equation below.
        where Concat represents the concatenation operation. 
 and 
 are 1D global maximum pooling operation and 1D global average pooling operation, respectively. 
 includes a 3 × 3 convolution operation and a SiLU activation function. 
 denotes the 2D conventional convolution with a kernel size of 1.
Combining SA and MEW, we introduce the hybrid attention (HA) module. The HA module employs MEW to transform feature maps into the frequency domain, thereby facilitating the acquisition of global information. The incorporation of frequency domain information addresses the limitation of CNNs in acquiring global information. Concurrently, SA is utilized to emphasize target-related features, thereby enhancing feature representation capabilities. The detailed architecture of the HA module is illustrated in 
Figure 5c. HA sequentially infers a frequency attention map and a spatial attention map, subsequently performing element-wise multiplication of the output features obtained from MEW and SA. The process can be expressed as follows:
        where ⊙ denotes element-wise multiplication. Resize represents resizing the size of the attention map to 
.
  2.5. The Loss Function
The loss functions used in the original YOLOv8 model include classification loss and regression loss. We concentrate on regression loss, which is composed of distribution focal loss (DFL) and Complete-IoU (CIoU) loss. The CIoU loss adds the aspect ratio criterion to the IoU to better fit the target frame on three geometric parameters: overlapping area, center point distance, and aspect ratio. The definition is as follows:
        where 
 is the distance between the center points of the predicted and ground-truth bounding boxes, 
c is the diagonal distance of the minimum rectangular box containing two bounding boxes, 
 is a balance factor that increases with IoU.
Zhang et al. [
44] pointed out that both the shape and scale factors of the bounding box regression sample influence the regression results. Notably, the small-scale bounding box was sensitive to variations in IoU values, and deviations along the short edge direction of the bounding box correspond to more pronounced changes in IoU values. Shape-IoU determines the loss by concentrating on the shape and scale of the bounding box itself, thus making the bounding box regression more precise. Given the large number of small-sized ship targets in SAR images, we use shape-NWD loss instead of CIoU loss. The shape-NWD integrates shape-IoU into NWD to avoid the sensitivity of IoU to the position deviation of small targets and improve detection performance. The shape-NWD loss is defined as follows:
        where 
C is the constant associated with the dataset; scale is the scale factor, which is related to the scale of the target in the dataset; 
ww and 
hh are the weight coefficients in the horizontal and vertical directions, respectively, whose values are related to the shape of the GT box.
  3. Experiments
  3.1. Datasets and Experimental Settings
For object detection utilizing deep learning techniques, we adhere to the definition of object size established in the COCO dataset [
45], focusing on the pixel proportion of the real object in the entire image, rather than their physical size. In this study, “small ship” refers specifically to those smaller pixel representations that occupy minor pixels in the whole image, rather than the real-world measurements of an actual ship (i.e., its length and width). Considering an image size of 800 × 800 pixels, targets with rectangular bounding box areas of less than 2342 pixels are classified as small objects, which constitute approximately 0.37% of the total 640,000 pixels.
The three publicly available remote sensing datasets used in this study are the LS-SSDD [
46] dataset, the HRSID [
47] dataset, and the iVision-MRSSD [
48] dataset. The detailed information of the three datasets is shown in 
Table 1. HRSID and iVision-MRSSD only provided a single optimal resolution, while resolution in LS-SSDD refers to R. × A., where R. represents the range and A. denotes the azimuth.
The LS-SSDD dataset is a large-scale background small ship detection dataset taken by Sentinel-1. It contains 15 large-scale SAR images of size 24,000 × 16,000 pixels and is divided into 9000 sub-images of size 800 × 800 pixels; the 9000 images are further divided into a training set and a test set in a 2:1 ratio. The large-scene small ship detection dataset meets the practical migration application of ship detection in large-scene space-borne SAR images in engineering.
The HRSID is a High-Resolution SAR image dataset sourced from Sentinel-1B, TerraSAR-X, and TanDEMX satellites. A total of 136 panoramic SAR images are cropped to 800 × 800 pixel sub-images under the overlapped ratio of 25%. The 5604 images are divided into a training set (65% SAR images) and a test set (35% SAR images) with the format of MS COCO.
The iVision-MRSSD dataset comprises 11,590 image tiles of size 512 × 512 pixels containing 27,885 ship examples. The dataset is produced by employing images from six distinct satellite sensors that cover a broad spectrum of the electromagnetic range, including C, L, and X band radar imaging frequencies. All of these sensors possess varying resolutions and imaging patterns. The dataset is randomly divided into training, validation, and test sets at a ratio of 70:20:10. The diverse circumstances allow the dataset to bring about a comprehensive understanding of the ship detection task in SAR satellite images.
In the ablation studies, SGD served as the optimizer with an initial learning rate of 0.01. LS-SSDD was trained for 160 epochs, while the other two datasets were trained for 120 epochs. A batch size of 16 and an input size of 800 × 800 were used. For the comparative experiments, the training duration was set to 60 epochs, with a learning rate of 0.01, a weight decay of 0.0001, and a momentum of 0.9. The batch size and input size remained consistent with the ablation studies. All experiments were executed on a machine equipped with an RTX3090Ti (24G) GPU and an Intel Core i7-12400 CPU, running Windows 10, Python 3.11, and Cuda 11.7. The deep learning models were constructed using PyTorch 1.13. Additionally, comparisons were made using the MMDetection-3.1.0 framework [
49] and the Ultralytics framework.
  3.2. Evaluation Metrics
To quantitatively evaluate the performance of the proposed method, evaluation metrics such as precision (P), recall (R), average precision (AP), AP50, and average precision for different target sizes, that is, APs (Average Precision for Small objects), APm (Average Precision for Medium-sized objects), and APl (Average Precision for Large objects), were employed to assess the accuracy performance of the models.
Precision measures the ratio of accurately detected ship samples to all predicted ship samples, whereas recall quantifies the ratio of accurately detected ship samples to all annotated ship samples. The formulas for both are as follows:
        where TP (true positive) represents the count of correctly detected positive samples. FP (false positive) represents the count of incorrectly detected positive samples. FN (false negative) represents the count of incorrectly detected negative samples.
Average precision (AP) delineates the area under the precision–recall curve, composed of precision, recall, and the coordinate axis. The formula for AP is as follows:
        where 
P represents precision, while 
R represents recall. AP50 corresponds to the average precision for an IoU threshold of 0.5; AP denotes the average precision when considering IoU ranges from 0.5 to 0.95 with a step of 0.05.
  3.3. Ablation Experiments
To assess the performance of each module within the proposed method, we incrementally incorporated these modules into the baseline network. The baseline network utilized in this study is YOLOv8n, which adopts the same training strategy as that of the proposed method. We conducted ablation experiments on the LS-SSDD, HRSID, and iVision-MRSSD datasets to evaluate the effectiveness of the SR module, SPD module, HA module, and the enhanced loss function. For specific details, please refer to 
Table 2, 
Table 3 and 
Table 4.
Based on the results presented in 
Table 2, it is evident that on the LS-SSDD dataset, the proposed method achieves a 4.1% improvement in AP50 and a 1.7% improvement in AP compared to the baseline network. The incorporation of any one of the proposed modules individually results in an increase of 1.3% to 1.9% in AP50, along with an approximate rise of nearly 1% in AP compared to the baseline network. This demonstrates that each module plays a significant role in enhancing the performance of the proposed detection network. 
Figure 6 presents the precise–recall curves obtained through the incremental integration of the proposed modules. This representation more intuitively highlights the characteristics of each module and demonstrates the overall effectiveness of the proposed method.
Figure 7 illustrates the visualized results of both the baseline and the proposed method across various scenarios involving small targets on LS-SSDD. Specifically, the first scenario is characterized by noisy far-sea conditions, where the proposed method effectively mitigates false alarms generated by the baseline network. In the second scenario, which involves nearshore small ships, it is evident that the baseline network experiences a higher rate of missed detections; conversely, our proposed method significantly reduces these missed detections. In the complex scenario of the third row, our proposed method not only effectively diminishes interference from rocks or structures—thereby reducing false alarms—but also prevents occurrences of missed detections altogether. The results indicate that our proposed method exhibits enhanced capabilities for detecting small ship targets while minimizing detection errors and improving overall detection accuracy.
 Table 3 and 
Table 4 show the results of the ablation experiments conducted on both the HRSID dataset and the iVision-MRSSD dataset. Given our emphasis on improving the detection rate of small targets, we have incorporated additional evaluation metrics—APs, APm, and APl—to assess the average precision of targets across different sizes, building upon the indicators established in 
Table 3. In terms of performance on the HRSID dataset, our proposed method demonstrates a notable improvement of 2.4% in both AP50 and AP, along with a 1.6% enhancement in APs. For the iVision-MRSSD dataset, our approach yields a 1.7% increase in AP and an improvement of 2.2% in APs. These experimental results highlight the effectiveness of the optimizations introduced in this study for enhancing SAR ship detection models.
 To demonstrate the universality of the proposed method, detection results from a variety of SAR satellite images are presented in 
Figure 8. The first row of images depicts small ship targets against a noisy background. It is evident that the baseline network experiences both missed and false detections, whereas the proposed method successfully mitigates these issues. Furthermore, the proposed method effectively identifies omissions made by the baseline network in scenes featuring small targets amidst man-made construction interference near shorelines, as shown in the second row. The third and fourth rows illustrate medium-sized target scenarios within similar near-shore environments, while the final row shows large target scenes. Notably, the proposed method rectifies problems associated with false alarms and inaccurate detection bounding box observed in the baseline network when subjected to interference. In general, scenarios involving small targets exhibit a higher likelihood of missed detections; conversely, for multi-scale ship targets in nearshore contexts, false alarms tend to be more prevalent due to disruptions caused by coastal infrastructure. Overall, our proposed method demonstrates superior performance compared to the baseline network across these diverse scenarios.
  3.4. Comparison Experiments
To further evaluate the effectiveness of the proposed method, this section compares it with several other commonly used target detection methods on the LS-SSDD dataset, HRSID dataset and iVision-MRSSD dataset. We employed eight general object detection algorithms, namely Faster RCNN [
17], CenterNet [
50], FCOS [
51], ATSS [
52], YOLOv5n, YOLOv8n, YOLOv10n [
22], and YOLOv11n. Additionally, we included two algorithms specifically designed for ship detection in SAR images: SHIP-YOLO [
28] and LHSDNet [
29], as part of our comparative experiments. A quantitative analysis was conducted on the three datasets utilized in this study. The detailed detection results are presented in 
Table 5.
Notably, the average precision values for the latest YOLOv10n and YOLOv11n in SAR image ship detection were found to be lower than those of our baseline network, YOLOv8n. This discrepancy may stem from the influence of data characteristics and application scenarios on model performance. While both YOLOv10 and YOLOv11 demonstrate superior performance in general object detection tasks, it appears that YOLOv8 exhibits greater flexibility in complex scenes and achieves better results specifically for SAR image ship detection. Furthermore, it is crucial to highlight that the two networks tailored for SAR image ship detection demonstrate good performance on only one or two datasets but fail to achieve satisfactory results across all three datasets simultaneously. In contrast, the method proposed in this paper exhibits outstanding performance across all three datasets.
The HRSID dataset accounts for over 90% of small targets, while the LS-SSDD dataset comprises nearly all small targets. It is evident that only the YOLO series algorithms perform effectively on these two datasets, whereas other algorithms exhibit low AP values. This observation further underscores the advantages of YOLO series algorithms in detecting small targets. The proposed method optimizes upon YOLOv8n, achieving a 2% improvement on the LS-SSDD dataset and a 1% enhancement on the HRSID dataset. The APl is low and irregular in HRSID due to the presence of less than 1% of large targets within the dataset. In terms of the comprehensive iVision-MRSSD dataset, both Faster R-CNN and CenterNet demonstrate strong performance. However, when it comes to small target detection, our proposed method achieves an additional 2% increase in AP value. Whether evaluated by average precision at an IoU threshold of 0.5 (AP50) or by small target measurement index APs, our method consistently attains the highest values. This outcome underscores the effectiveness and feasibility of our approach.
Our comparative experiments encompass eight general-purpose target detection networks and two SAR ship detection networks. For qualitative analysis, we selected various scenes from the LS-SSDD dataset, including maritime environments, nearshore settings, and complex background scenarios. The results are illustrated in 
Figure 9. The first column presents sparse small ship targets located in the far sea. Both Faster R-CNN and YOLO series methods, along with SAR image ship detection algorithms, demonstrate commendable performance in this context. The second column depicts a nearshore scene featuring small targets. Conventional target detection algorithms exhibit significant missed detections. Although CenterNet successfully identifies most small targets, it concurrently generates numerous false alarms. In contrast, LHSDNet and the proposed method show superior performance. In terms of complex scenes represented in the third column—where multiple ships coexist alongside islands, reefs, offshore structures, and other disturbances—various methods display varying degrees of missed detections and false alarms. Notably, our proposed method outperforms others by detecting a greater number of targets. This underscores its effectiveness in identifying sparse small targets within intricate environments. However, as depicted in the fourth column where background noise is further intensified, all methods experience substantial degradation in performance. Even specialized algorithms for SAR image ship detection struggle to identify ship targets under these conditions; nevertheless, our proposed method continues to outperform alternative approaches by successfully detecting over half of the visible ship targets presented in the figure. This series of detection outcomes validate the efficacy of our improved methodology and highlights its robust capability for detecting small ship targets across diverse scenarios.
  4. Discussion
Extensive experimental results based on three public datasets have demonstrated the superior performance of the proposed method for small-sized ship detection when compared to other object detection techniques. The LS-SSDD dataset is specifically designed for detecting small-sized ships within large scenes, where small targets account for over 99.8% of the total. The proposed method achieves the highest Average Precision at IoU threshold 50 (AP50) and overall Average Precision (AP) values, surpassing the baseline YOLOv8n by 0.021 and 0.018, respectively. For the HRSID and iVision-MRSSD datasets, we focus on APs metrics other than AP50 and AP, which characterize the average precision of small targets. In comparison to the baseline YOLOv8n, our proposed method enhances these APs metrics by 0.022 on the iVision-MRSSD dataset. Given the diversity present in the iVision-MRSSD dataset, we can conclude that our proposed method is effective in improving detection performance for small-sized ships.
The enhancement of the proposed method primarily stems from improvements made to the backbone, neck, and loss function, tailored to the characteristics of ships within the datasets. The backbone incorporates both SR and SPD modules, which effectively mitigate the issue of feature disappearance for small vessels while ensuring comprehensive feature extraction. An attention module (HA) has been integrated into the neck, enabling effective suppression of background noise and facilitating feature fusion across different levels. Furthermore, the loss function is designed to account for the sensitivity of small ships to variations in Intersection over Union (IoU), as well as incorporating shape and scale factors related to bounding boxes. This approach significantly enhances detection accuracy. Experimental results indicate that enhancements for small ships are considerably more pronounced compared to those for larger vessels; given that small ships constitute a substantial proportion of the datasets, these strategies collectively contribute to an overall improvement in performance.
In addition to detection accuracy, the computational efficiency of the detection network constitutes a critical factor. 
Table 6 presents a comparison of the computational efficiency and inference time between the baseline network YOLOv8n and the proposed method. To enhance the detection accuracy for small ship targets, we have increased both the model complexity and the resource requirements for training the network. Nevertheless, owing to the implementation of the re-parameterization module, the time consumption during the inference stage remains within a manageable range.
When inferring real SAR images, the initial step involves segmenting the large image into sub-images that match the sizes of the training data. Subsequently, ship detection is performed on these sub-images. Following detection, the sub-images are reassembled to reconstruct the complete SAR image as required. This approach is necessitated by the fact that in SAR image ship detection, ship size is determined based on the proportion of pixels occupied by each target. To ensure optimal performance of the detection network, it is essential that test images entering the network maintain pixel sizes consistent with those of the training images. As an illustrative example, we consider reasoning applied to the 11th image from the LS-SSDD dataset. The original image measuring 24,000 × 16,000 pixels was divided into 600 sub-images of 800 × 800 pixels each. The inference time for processing each sub-image was recorded at 5.4 ms, while the reasoning time for the whole image was 5 s, including data loading, image segmentation, and subsequent splicing operations.
For future work, we have the following considerations. Firstly, the datasets utilized in this study consist solely of horizontal bounding boxes; thus, there is potential to enhance the detection head for application in ship detection datasets derived from SAR images that utilize rotated bounding boxes. Secondly, given the real-time requirements associated with ship detection in SAR imagery, it is imperative to take into account both network complexity and computational resource consumption. In our future endeavors, we will explore lightweight modifications to the network architecture while ensuring that detection accuracy is preserved. Finally, we aim to investigate the integration of the proposed module into other network structures to enhance detection accuracy across a broader range of target detection tasks.