Next Article in Journal
Distinct Regional and Seasonal Patterns of Atmospheric NH3 Observed from Satellite over East Asia
Previous Article in Journal
Adaptive near Real-Time RFI Mitigation Using Karhunen–Loève Transform
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Small Ship Detection Based on Improved Neural Network Algorithm and SAR Images

1
State Key Laboratory of Spatial Datum, Xi’an 710054, China
2
Xi’an Research Institute of Surveying and Mapping, Xi’an 710054, China
3
College of Earth Sciences and Resources, China University of Geosciences, Beijing 100083, China
4
College of Architecture and Civil Engineering, Beijing University of Technology, Beijing 100124, China
5
Institute of Geospatial Information, Information Engineering University, Zhengzhou 450001, China
6
School of Information Mechanics and Sensing Engineering, Xidian University, Xi’an 710071, China
7
School of Geographic and Environmental Sciences, Tianjin Normal University, Tianjin 300387, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Remote Sens. 2025, 17(15), 2586; https://doi.org/10.3390/rs17152586
Submission received: 23 April 2025 / Revised: 10 June 2025 / Accepted: 13 June 2025 / Published: 24 July 2025

Abstract

Synthetic aperture radar images can be used for ship target detection. However, due to the unclear ship outline in SAR images, noise and land background factors affect the difficulty and accuracy of ship (especially small target ship) detection. Therefore, based on the YOLOv5s model, this paper improves its backbone network and feature fusion network algorithm to improve the accuracy of ship detection target recognition. First, the LSKModule is used to improve the backbone network of YOLOv5s. By adaptively aggregating the features extracted by large-size convolution kernels to fully obtain context information, at the same time, key features are enhanced and noise interference is suppressed. Secondly, multiple Depthwise Separable Convolution layers are added to the SPPF (Spatial Pyramid Pooling-Fast) structure. Although a small number of parameters and calculations are introduced, features of different receptive fields can be extracted. Third, the feature fusion network of YOLOv5s is improved based on BIFPN, and the shallow feature map is used to optimize the small target detection performance. Finally, the CoordConv module is added before the detect head of YOLOv5, and two coordinate channels are added during the convolution operation to further improve the accuracy of target detection. The map50 of this method for the SSDD dataset and HRSID dataset reached 97.6% and 91.7%, respectively, and was compared with a variety of advanced target detection models. The results show that the detection accuracy of this method is higher than other similar target detection algorithms.

1. Introduction

At present, maritime activities and the development of marine resources have become more frequent, related technologies and studies in marine fields have also developed, and the demand for monitoring and identification of ships has become increasingly important. Accurate monitoring of ship targets can help plan maritime transportation, port construction, and marine resource development. Ship target detection is of importance for locating and tracking trapped ships for marine monitoring and maritime rescue. Therefore, ship target detection can help monitor ship information in the sea area, improve maritime safety and management efficiency, and is of great significance in the fields of the economy and military.
Synthetic aperture radar (SAR) images are used for all-weather ship detection tasks due to their advantages of being independent of lighting conditions and unaffected by weather. Currently, the methods used for ship target detection in SAR images are divided into traditional methods and deep learning methods. Traditional methods, such as the method based on constant false alarm rate (CFAR) [1], usually use manually designed features, which are difficult to fully capture the high-level semantic features of the target [2]. Deep learning technology has better performance in big data analysis and processing capabilities, and has been widely used in the field of target detection. Deep learning-based ship detection methods are divided into two-stage methods and one-stage methods according to whether candidate regions are generated. The two-stage method generally generates a region proposal first, and then classifies and locates it through a convolutional neural network, such as R-CNN [3], SPPNet [4], Fast R-CNN [5], Faster R-CNN [6], and Mask R-CNN [7]. Liu et al. [8] proposed RR-CNN based on R-CNN, which can accurately extract the features of the rotated region and realize the rotation bounding box positioning of the ship target. Lin et al. [9] added a squeeze-and-excitation (SE) module [10] to the Faster R-CNN network structure to further improve the performance of ship detection. Zhang et al. [11] proposed a C-SE Mask R-CNN model for ship instance segmentation in SAR images and designed a contextual squeeze-and-excitation module (C-SE) embedded in Mask R-CNN to capture contextual information. To improve the detection speed, the one-stage detection method eliminates the candidate region generation process, such as YOLO [12] and SSD [13]. The YOLO series algorithm has now developed to YOLOV12 [14]. YOLO transforms the target detection task into a regression problem. It divides the input image into S × S grids. If the center point of the target falls within a grid, then this grid is responsible for predicting the target. Li et al. [15] proposed a lightweight ship detection model LSDM based on YOLOV3 using DenseNet as the backbone network and shallow features to improve the detection accuracy, and replacing the conventional convolution in FPN with spatially separable convolution. Tang et al. [16] proposed an N-YOLO algorithm, including a noise level classifier (NLC) for classifying SAR images according to noise levels, a SAR target area extraction module (STPAE) for extracting potential target areas, and a detection module based on YOLOV5. This method effectively reduces the impact of noise, but causes a partial loss of ship edge information. Yu et al. [17] added a small target detection layer to the YOLOV8 structure and used WIOU to adjust the bounding box loss function. Zhao et al. [18] proposed a ST-YOLOA model for SAR ship detection. First, Coordinate Attention was embedded in the Swin Transformer backbone network to improve feature extraction capabilities. Second, a Convolutional Block Attention Module was introduced during feature fusion. Finally, the decoupled detection head of YOLOX was used to achieve the final output. Compared with traditional methods, the deep learning-based methods can deal with the current massive remote sensing data and extract high-level semantic information of images, achieving fast and high-precision target detection. However, there is a large amount of speckle noise in SAR images, the outline of ships is relatively blurred in SAR images, and the detection of near-shore ships is also affected by land background factors. As a result, the existing deep learning target detection algorithm has certain false detection and missed detection phenomena for ship targets (especially small targets) in SAR images, thereby reducing the detection performance.
YOLOv5, a target detection model with good stability and robustness, has been widely used in various visual tasks. In the YOLOv5 series, YOLOv5s has achieved a good balance between accuracy and efficiency. Therefore, we propose an improved SAR small ship detection model based on YOLOv5s to extract small ship information. First, given that the LSKModule in LSKNet [19] can adaptively aggregate large-size convolution kernel information in the spatial dimension and applies the idea of spatial attention mechanism to enhance the features of key areas and suppress the interference of noise areas, we combine the C3 module and LSKModule to construct the LSKC3 module and add it to the backbone network. Second, considering that the SPPF structure of YOLOv5s only uses MaxPool to extract features of different receptive fields, which will cause certain information loss, Depthwise Separable Convolutions [20] with different kernel sizes are added to the SPPF structure to further extract features of different receptive fields while only increasing a small amount of parameters and computation. Then, the feature fusion network of the model is improved on the basis of BIFPN [21]. Considering that shallow feature maps have a small receptive field but rich location information, we add shallow feature maps during feature fusion to optimize the performance of small target detection. Before feature fusion, ParNet Block is used to further capture the local features of the above feature maps and enhance the key channel information of these feature maps. The connection method is redesigned to fully fuse feature maps of different scales. Finally, the detection head is improved by using CoordConv [22] to perform convolution operations on the additional coordinate channels to further improve the performance of the model. The effectiveness and accuracy of the improved method are evaluated and verified based on the public SSDD dataset and HRSID dataset.
The main objectives of this paper can be summarized as follows:
(1) Aiming at the problem of small ship target detection in SAR images, a SAR image ship detection method based on YOLOv5s is proposed. To capture a wide range of contextual information and improve the feature extraction ability of the model, the LSKC3 module is built based on the LSKModule and embedded into the backbone network.
(2) Depthwise Separable Convolutions with different kernel sizes are added to the SPPF structure to further extract features of different receptive fields with only a small increase in parameters and computational complexity.
(3) In addition, an Improved BIFPN is constructed to fully fuse feature maps of different scales, simultaneously improving the model’s detection ability for small targets.
(4) Since the traditional convolution operation does not consider the position information of pixels, a CoordConv layer is added before the detection head to further improve the detection performance of the model by introducing coordinate information.
(5) Ablation experiments are carried out on the SSDD dataset to verify the effectiveness of each structure, and then the proposed method is compared with other target detection algorithms on the SSDD dataset and the HRSID dataset. The experimental results show that the proposed method has superior performance.
This paper is organized as follows: Section 2 describes the research method, mainly introducing the improved YOLOv5s SAR small ship detection model in detail. Section 3 verifies and analyzes the experimental results. The effectiveness and advancement of the method are demonstrated through ablation experiments and comparative experiments. Section 4 is the conclusions and future prospects.

2. Methodology

The proposed method is shown in Figure 1 and consists of three parts: a backbone, neck, and head. First, to capture a wide range of contextual information, the LSKC3 module is constructed using LSKModule and added to the backbone for feature extraction. Secondly, multiple parallel Depthwise Separable Convolutions layers are added to the SPPF structure to further extract features of different receptive fields while increasing a small amount of parameters and computation. Then, the feature map extracted by the backbone network is passed through the Improved BIFPN to fully integrate feature maps of different scales, and the shallow feature map is used to improve the model’s detection ability for small ships. Finally, the CoordConv layer is added to the detection head to further improve the model performance by introducing coordinate information.

2.1. LSKC3

Large-size convolution kernels can cover a larger receptive field, thereby capturing more global information and facilitating target detection. LSKModule is the core module in LSKNet [19], which decomposes a large-size convolution kernel into a sequence of large-size deep convolutions with different dilation rates, thereby generating feature maps with different large receptive fields. Under the same receptive field, compared with a single standard large convolution kernel, the sequential decomposition method greatly reduces the number of parameters. Then, the idea of the spatial attention mechanism is combined to perform weighted fusion of these feature maps with different large receptive field information, and finally spatially merges them with the input feature map. The structure of LSKModule is shown in Figure 2.
The calculation process of LSKModule [19] is as follows.
First, the input feature X passes through different deep convolution kernels to obtain feature maps with different receptive fields. These feature maps are respectively passed through a 1 × 1 convolutional layer to compress the number of channels, and the obtained feature maps are concatenated in the channel dimension to obtain the feature map U ~ :
U ~ = [ U 1 ; ~ ; U i ~ ]
Then, average pooling P a v g ( · ) and maximum pooling P m a x ( · ) are performed on U ~ in the channel dimension to extract the spatial relationship of the feature map and obtain the pooled features S A a v g and S A m a x :
S A a v g = P a v g U ~ , S A m a x = P m a x U ~
The pooled features are concatenated in the channel dimension and converted into N spatial attention maps (N is the number of depthwise convolution kernels) using a convolutional layer F 2 N ( · ) :
S A ^ = F 2 N ( [ S A a v g ;   S A m a x ] )
For each spatial attention map S A i ^ , the sigmoid activation function is used to obtain the spatial selection mask of each feature map:
S A i ~ = σ S A i ^
Then, the different feature maps are weighted spatially to enhance the features of the key areas, and then element-by-element addition is performed and the feature map S is obtained through the convolution layer F ( · ) :
S = F ( i = 1 N ( S A i ~ · U i ~ ) )
The final output of LSKModule is the element-wise product between the input features X and S :
Y = X · S
Through the above process, LSKModule can adaptively aggregate the feature information extracted by large-size convolution kernels in the spatial dimension and enhance the key features of the feature map in the spatial dimension through the idea of the spatial attention mechanism. Considering that only 3 × 3 convolution kernels are used in the backbone network of YOLOv5s to extract local features, the model cannot fully extract large-scale context information. Therefore, to improve the feature extraction ability of the model while enhancing the model’s attention to key areas and suppressing noise interference, we built the LSKC3 module and added it to the backbone network. LSKC3 is based on C3, the LSKModule is added after the CBS module on one side to capture large-scale context information, and residual connection is used to reduce the loss of information during transmission. The LSKC3 module is shown in Figure 3. Two deep convolution kernels with sizes of 5 × 5 and 7 × 7 are used in the LSKModule mechanism, and the expansion rates are set to 1 and 3, respectively. The calculation process of the LSKC3 module is as follows.
First, the input feature X passes through two CBS modules to obtain two feature maps X ^ and X ~ :
X ^ = S i L U ( B N C o n v 2 d X ,   X ~ = S i L U ( B N ( C o n v 2 d ( X ) )
Then, in the upper branch, the X ~ feature passes through the LSKModule to adaptively aggregate the feature information extracted by the large-size convolution kernel, and then is added to the original feature to obtain the feature M :
M = X ~ + C o n v 2 d ( L S K M o d u l e ( G e L U ( C o n v 2 d ( X ~ ) ) ) )
In the lower branch, the X ^ feature passes through n bottleneck modules with shortcut structures to obtain the feature map W n :
Y 1 ^ = S i L U B N C o n v 2 d X ^ , W 1 = Y 1 ^ + S i L U ( B N ( C o n v 2 d Y 1 ^ ) )
Y 2 ^ = S i L U B N C o n v 2 d Z 1 ^ , W 2 = Y 2 ^ + S i L U ( B N ( C o n v 2 d Y 2 ^ ) )
Y n ^ = S i L U B N C o n v 2 d Z n 1 ^ , W n = Y n ^ + S i L U ( B N ( C o n v 2 d Y n ^ ) )
Finally, the feature map M and the feature map W n are concatenated in the channel dimension and passed through a CBS module to obtain the final feature map Y :
Y = S i L U ( B N ( C o n v 2 d O n ^   ;   M )
In this way, the model can fully capture the wide range of contextual information of the input feature map, enhance the model’s attention to key features, and effectively suppress the influence of noise.

2.2. Improved SPPF

The multi-receptive field design can extract contextual information of different spatial ranges at the same scale, which can enhance the model’s ability to distinguish between targets and similar backgrounds. The SPPF module in YOLOv5s uses multiple maximum pooling (MaxPool2d) layers in series to achieve multi-receptive field feature extraction. However, using only MaxPool operations will cause certain information loss. Therefore, multiple parallel Depthwise Separable Convolution [20] layers are added to fully extract the features of different receptive fields. Depthwise Separable Convolution decomposes a complete convolution operation into Depthwise Convolution and Pointwise Convolution. The number of convolution kernels in Depthwise Convolution is the same as the number of channels in the input feature map, and one convolution kernel only acts on one channel of the feature map. Depthwise Convolution performs convolution operations on each channel of the input layer independently, without using the feature information of different channels at the same spatial position. Pointwise Convolution uses a 1 × 1 conventional convolution kernel to process the output of Depthwise Convolution to use the feature information of different channels at the same spatial position and adjust the number of channels in the feature map. Compared with traditional convolution, this decomposition method greatly reduces the parameters and computational complexity of convolution operations. Figure 4 shows the difference between Standard Convolution and Depthwise Separable Convolution.
A feature map with an input of H × W × M is convolved to obtain a feature map of H × W × N, where H and W are the height and width of the feature map, M is the number of channels of the input feature map, and N is the number of channels of the output feature map.
The standard convolution operation requires N k × k × M convolution kernels, where k is the size of the convolution kernel. The calculation of the standard convolution is as follows:
H × W × M × N × k × k  
Depthwise Separable Convolution first performs convolution operation on the input feature map through Depthwise Convolution, which requires M k × k × 1 convolution kernels.
Each convolution kernel is only responsible for one channel of the input feature map. The calculation of Depthwise Convolution is as follows:
H × W × M × k × k
Then, the output results of Depthwise Convolution are integrated through Pointwise Convolution. N 1 × 1 × M convolution kernels are required. The calculation of Pointwise Convolution is as follows:
H × W × M × N
The total computational effort of Depthwise Separable Convolution is as follows:
H × W × M × k × k + H × W × M × N
Compared with standard convolution, the computational complexity of Depthwise Separable Convolution is reduced to the following:
H × W × M × k × k + H × W × M × N H × W × M × N × k × k = 1 N + 1 k 2
In this method, Depthwise Convolution uses convolution kernels of sizes 3 × 3, 5 × 5, 9 × 9, and 13 × 13. Pointwise Convolution uses a 1 × 1 conventional convolution kernel to process the feature maps generated after Depthwise Convolution. Finally, these feature maps are spliced with the feature maps after convolution and MaxPool branches to obtain feature maps with different receptive field information, which further improves the model’s ability to distinguish between targets and backgrounds. The Improved SPPF structure is shown in Figure 5.

2.3. Improved BIFPN

After the input image is extracted by the backbone network, a series of feature maps of different scales will be obtained. The feature fusion network in the neck fuses these feature maps to obtain more contextual information, which is helpful for multi-scale object detection. Feature pyramid network (FPN) [21] proposes a top-down multi-scale feature fusion method to transfer deep semantic features from high-level feature maps to low-level feature maps. PAN [23] proposes a bottom-up feature fusion method based on FPN to transfer shallow features with rich location information from low-level feature maps to high-level feature maps. The neck part of YOLOv5s adopts the structure of FPN and PAN at the same time to fuse the P3, P4, and P5 feature maps extracted by the backbone network. The BiFPN [24] feature fusion network introduces cross-layer connections and adopts feature weighting to fuse feature maps of more levels.
The specific steps of improving BIFPN are as follows: First, before feature fusion, the feature map extracted by the backbone network is further extracted through the ParNet Block structure while focusing on the important channel information of the feature map. ParNet Block consists of a 1 × 1 convolution layer, a 3 × 3 convolution layer, and an SSE (skip-squeeze-excitation) layer designed based on the SE (squeeze-and-excitation) module [10]. Secondly, considering that the receptive field of shallow feature maps is small and has rich position information, in order to improve the detection performance of the model for small targets, the shallow feature map P2 is used in feature fusion. In the Improved BIFPN structure, cross-layer connections are used for feature maps of the same scale to fuse shallow features and deep features of the same scale, and two bottom-up connections are added to the top-down fusion path to fully aggregate multi-scale features. A fast normalized fusion method is used to set a learnable weight parameter for each input feature map to achieve the weighted fusion of feature maps of different scales. The output of this process is used for the final prediction. The Improved BIFPN and ParNet Block structures are shown in Figure 6a and Figure 6b, respectively.
The Improved BIFPN uses P2, P3, P4, and P5 extracted by the backbone network as input features, with sizes of 1/4, 1/8, 1/16, and 1/32 of the input image, respectively.
First, to further extract local features and focus on the important channel information of the feature map, the four feature maps are respectively passed through the ParNet Block structure to obtain feature maps P2′, P3′, P4′, and P5′ for subsequent feature fusion.
Then, the four feature maps P2′, P3′, P4′, and P5′ are first obtained through a top-down path to obtain the four feature maps C2, C3, C4, and C5. Among these, the C5 feature map is obtained by processing the P5′ feature map through the Improved SPPF module, which contains information on multiple different receptive fields. The three feature maps C2, C3, and C4 are obtained by splicing feature maps of multiple levels in the channel dimension. The top-down feature fusion process is as follows:
C 2 = C o n c a t ( W 2 1 W 2 1 + W 2 2 + ε × P 2 ,   W 2 2 W 2 1 + W 2 2 + ε × U p S C 3 )
C 3 = C o n c a t ( W 3 1 W 3 1 + W 3 2 + W 3 3 + ε × D o w n S P 2 ,   W 3 2 W 3 1 + W 3 2 + W 3 3 + ε × P 3 ,   W 3 3 W 3 1 + W 3 2 + W 3 3 + ε × U p S ( C 4 ) )
C 4 = C o n c a t ( W 4 1 W 4 1 + W 4 2 + W 4 3 + ε × D o w n S P 3 ,   W 4 2 W 4 1 + W 4 2 + W 4 3 + ε × P 4 ,   W 4 3 W 4 1 + W 4 2 + W 4 3 + ε × U p S ( C 5 ) )
Finally, in the bottom-up path, the D2 feature map is obtained by passing the C2 feature map through the C3 module. The three feature maps D3, D4, and D5 are obtained by splicing the feature maps of multiple levels in the channel dimension. The bottom-up feature fusion process is as follows:
D 3 = C o n c a t ( W 3 4 W 3 4 + W 3 5 + W 3 6 + ε × P 3 ,   W 3 5 W 3 4 + W 3 5 + W 3 6 + ε × C 3 ,   W 3 6 W 3 4 + W 3 5 + W 3 6 + ε × D o w n S ( D 2 ) )
D 4 = C o n c a t ( W 4 4 W 4 4 + W 4 5 + W 4 6 + ε × P 4 ,   W 4 5 W 4 4 + W 4 5 + W 4 6 + ε × C 4 ,   W 4 6 W 4 4 + W 4 5 + W 4 6 + ε × D o w n S ( D 3 ) )
D 5 = C o n c a t ( W 5 1 W 5 1 + W 5 2 + ε × C 5 ,   W 5 2 W 5 1 + W 5 2 + ε × D o w n S ( D 4 ) )
where C o n c a t represents the concatenation operation on the channel dimension; U p S and D o w n S represent upsampling and downsampling operations, respectively, and downsampling is implemented by a convolution operation of size 3 × 3 and stride 2; W i j ( i = 2,3 , 4,5 ; j = 1,2 , 3,4 , 5,6 ) represents the trainable weight parameter; and ε is set to 0.0001 to avoid division by 0.

2.4. Detection Head Based on CoordConv

CoordConv [22] adds an additional coordinate channel to the input feature map in the channel dimension so that the convolution operation acts on the coordinate information of the feature map and allows the network to learn complete translation invariance or different degrees of translation dependence according to the needs of the task. The difference between the traditional convolution layer and the CoordConv [22] layer is shown in Figure 7.
Assume that the feature map X with an input size of H × W × C is convolved to obtain a feature map Y with a size of H × W × C . Traditional convolution requires C   K × K × C convolution kernels W . The calculation process of traditional convolution is as follows:
Y i , j , f = c = 0 C 1 m = 0 K 1 n = 0 K 1 W m , n , c , f · X i + m K / 2 , j + n K / 2 , c
CoordConv first needs to add an x-coordinate position channel and a y-coordinate position channel to the input feature map to obtain the feature map X ^ , and then use C   K × K × ( C + 2 ) convolution kernels W ^ to perform convolution operations. The coordinate convolution calculation process is as follows:
Y i , j , f = c = 0 C + 1 m = 0 K 1 n = 0 K 1 W ^ m , n , c , f · X ^ i + m K / 2 , j + n K / 2 , c
where i [ 0 , H 1 ] and j [ 0 , W 1 ] are the position indexes of the output feature map, f [ 0 , C 1 ] is the channel index of the output feature map, and · means round down.
The detection head in YOLOv5s uses only one 1 × 1 convolution layer to achieve category prediction and Box regression. Traditional convolution operations do not use pixel location information. In order to improve the performance of the model, the CoordConv [22] layer is embedded before the 1 × 1 convolution. By introducing coordinate information, the model’s perception of spatial location is enhanced, further improving the accuracy of target detection.

3. Experimental Results

This chapter introduces the experimental dataset and implementation details, and demonstrates the effectiveness and robustness of the method through extensive experiments.

3.1. Dataset and Implementation Details

(1) SSDD
The SSDD dataset (SAR ship detection dataset) [25] is obtained by downloading public SAR images from the internet, cropping the target area into a size of about 500 × 500 pixels, and manually annotating it. The data mainly comes from three sensors: RadarSat-2, TerraSAR-X and Sentinel-1. It includes four polarization modes: HH, HV, VV, and VH. The resolution is between 1 m and 15 m. There are 1160 images and 2456 ships in total, including ship targets in the sea and nearshore areas.
(2) HRSID
The HRSID dataset (high-resolution SAR image dataset) [26] is mainly derived from Sentinel-1 and TerraSARX satellites and is mainly used for ship detection and instance segmentation tasks in high-resolution SAR images. The dataset contains 5604 images and 16,965 objects. The size of each image is 800 × 800 pixels, and the resolution ranges from 1 m to 5 m.
All experiments were implemented on a PC using an Intel Core(TM) i7-13700H CPU @ 2.40 GHz (INTC.US, Santa Clara, CA, USA) and an NVIDIA RTX 4060 GPU (NVIDIA Corporation, Santa Clara, CA, USA) based on the PyTorch 3.0 framework. The ratio of the training set, validation set, and test set was 7:1:2, and the training epoch was set to 300.

3.2. Results and Discussion

In the experiment, precision, recall, and mean average precision (map) are used as evaluation indicators to evaluate the detection performance of the model.

3.2.1. Ablation Experiment Results and Discussion

To verify the effects and accuracy of the LSKC3 module, Improved SPPF, Improved BIFPN, and the detection head based on CoordConv, five ablation experiments were conducted on the SSDD dataset. In the first experiment, YOLOv5s was used as the benchmark model for comparison in subsequent experiments. In the second experiment, the LSKC3 module was used to replace the C3 module in the YOLOv5s backbone network to verify the target detection effect of the LSKC3 module on SAR images. In the third experiment, the original SPPF structure was replaced with the Improved SPPF structure based on the second experiment to verify the effectiveness of the structure. In the fourth experiment, the Improved BIFPN was used to replace the original FPN + PAN structure of YOLOv5s based on the third experiment to fuse feature maps of different scales to verify the impact of the structure on the detection of ship targets (especially small targets) in complex scenes. In the fifth experiment, the CoordConv layer was added before the detection head based on the fourth experiment. The training set, validation set, test set, and hyperparameter settings of all experiments remained unchanged. The target detection performance evaluation indicators of the five comparative experiments are shown in Table 1. The results show that with each improvement based on YOLOv5s, the map of the model has been improved. The precision, recall, mAP50, and mAP50:95 of the final method are 96.4%, 94.4%, 97.6%, and 62.9% respectively. Compared with the original YOLOv5s, precision increased by 2.8%, recall increased by 0.8%, mAP50 increased by 1.6%, and mAP50:95 increased by 3.5%.
The visualization of some results of the ablation experiment is shown in Figure 8, which shows the ship detection results in three scenes. The first row in the figure is the true label, and the second to sixth rows are the detection results of five experiments respectively. Although YOLOv5s has excellent detection performance, it can be seen from the second row of the figure that the backbone network structure of YOLOv5 cannot fully extract a wide range of context information, coupled with the influence of coherent speckle noise, resulting in the model identifying a single target as multiple targets, as shown in Figure 8(b1). Secondly, since the features of shore facilities in SAR images are extremely similar to those of ships and YOLOv5s cannot fully extract multi-receptive field features, resulting in its weak ability to distinguish targets and similar backgrounds, false alarms occur when facing interference from similar backgrounds, as shown in Figure 8(b2). In addition, YOLOv5s still has the problem of missed detection of small ships, as shown in Figure 8(b3). By replacing the C3 module of the backbone network with the LSKC3 module to fully extract a wide range of context information, the model can accurately learn the complete feature information of the target and suppress the interference of noise, effectively improving the problem of identifying a single target as multiple targets, as shown in Figure 8(c1), but it also leads to more false alarms, as shown in Figure 8(c2). This may be because the large convolution kernel used in the LSKC3 module will introduce a large amount of background information that is extremely similar to the target. After using the Improved SPPF to further improve the multi-receptive field feature extraction capability of the model, the model’s ability to distinguish between targets and similar backgrounds is enhanced, the false alarm phenomenon is significantly improved, and the model’s small target detection capability is also enhanced, as shown in Figure 8(d2),(d3). When using the Improved BIFPN for multi-scale feature fusion, it can be seen from the confidence information that the model’s detection capability for small ship targets is further improved, but it also causes slight false alarms, as shown in Figure 8(e2). This may be because adding shallow feature maps for feature fusion will introduce certain interference information. Finally, CoordConv was added before the detection head. The spatial position perception ability of the model is improved by introducing coordinate position information during the convolution operation, further improving the false alarm problem, as shown in Figure 8(f2).

3.2.2. Comparative Experimental Results and Discussion

To evaluate the effectiveness of the proposed method, comparative experiments were conducted on the SSDD dataset and the HRSID dataset, respectively, comparing the proposed method with several advanced target detection methods, including Faster R-CNN [6], SSD [13], CenterNet [27], EfficientDet [24], DEIM [28], YOLOv5s, YOLOv8s, YOLOv11s, and YOLOv12s [14,29]. All methods used the same training set, validation set, and test set, and kept the same number of training times. Table 2 shows the mapping results of different methods for the SSDD dataset and the HRSID dataset. It can be seen that the detection accuracy of the YOLO series is significantly better than other models, and our model has achieved the highest mAP value for both the SSDD and HRSID datasets. Its mAP0.5 for the SSDD dataset is 97.6%, which is 1.0 percentage point higher than the second-best model (YOLOv11s). mAP0.5:0.95 is 62.9%, which is 2.6 percentage points higher than the second-best model (YOLOv11s). Its mAP0.5 for the HRSID dataset is 91.7%, which is 2.0 percentage points higher than the second-best model (YOLOv5s), and mAP0.5:0.95 is 67.8%, which is also 2.0 percentage points higher than the second-best model (YOLOv5s).
However, it can be seen from the parameters and FLOPs that the complexity of the improved model in this paper is significantly increased compared with the original YOLOv5s. Nevertheless, compared with YOLOv8s, it can be found that the parameters and FLOPs of the two are not much different, but the accuracy of the improved model is higher than that of YOLOv8s. Its mAP50 for the SSDD dataset is 1.1 percentage points higher than that of YOLOv8s, and the mAP50:95 is 3.8 percentage points higher than that of YOLOv8s. The mAP50 for the HRSID dataset is 2.3 percentage points higher than that of YOLOv8s, and the mAP50:95 is 3.9 percentage points higher than that of YOLOv8s.
To further demonstrate the effectiveness of the proposed method intuitively, Figure 9 shows the detection results for several methods in different scenarios in the SSDD dataset, covering ship targets of different scales in speckle noise scenarios, pure ocean backgrounds, and near-shore scenarios. The first row in the figure is the true label, and the second to eleventh rows are the detection results for Faster R-CNN, SSD, CenterNet, EfficientDet, DEIM, YOLOv5s, YOLOv8s, YOLOv11s, YOLOv12s, and the method in this paper. As can be seen in the figure, the false alarm problem of Faster R-CNN is the most serious. This may be because the Region Proposal Network in Faster R-CNN will generate a large number of low-quality candidate boxes when facing the interference of noise and similar backgrounds. At the same time, Faster R-CNN also has obvious missed detection problems for small ships. The false alarm problem and missed detection problem of small ships in SSD are very serious. This may be because although SSD uses multi-scale feature maps to detect targets of different scales, there is no multi-scale feature fusion process, which makes the information transmission between these feature maps insufficient. CenterNet basically has no false alarm problem, but the missed detection problem is more obvious, especially for small ships. EfficientDet has relatively few false alarm problems, but the missed detection problem is the most serious. This may be because the backbone network structure of EfficientDet is too simple, which makes it unable to fully extract the characteristics of ship targets. DEIM, YOLOv5s, YOLOv8s, YOLOv11s, and YOLOv12s also have certain false alarm and missed detection problems, but they are relatively mild compared with the above situation. Compared with several other models, our improved model can still accurately detect ship targets under complex conditions, and its detection capability for small ships is significantly better than other models. There are almost no false alarms and missed detections, and it can be used for SAR ship target detection tasks in different scenarios.
Figure 10 shows the detection results of several methods in different scenarios in the SSDD dataset, which describes the multi-scale ship target detection in the pure ocean background and near-shore scene. The first row in the figure is the true label, and the second to eleventh rows are the detection results of Faster R-CNN, SSD, CenterNet, EfficientDet, DEIM, YOLOv5s, YOLOv8s, YOLOv11s, YOLOv12s, and the method in this paper. It can be seen from the figure that our improved model has the best detection effect, with the least false alarms and missed detections.

4. Conclusions and Outlook

In view of the unclear ship outline in SAR images, the influences of noise, land, and other factors increase the difficulty of ship (especially small target ship) detection. This paper proposes an improved target detection algorithm based on YOLOv5s for ship target detection in SAR images. First, in order to obtain extensive context information, the backbone network is improved by the LSKModule to improve the feature extraction ability of the model backbone. Secondly, the SPPF module is redesigned using Depthwise Separable Convolution, and the features of different receptive fields are further extracted by introducing a small number of parameters and calculations. Then, the Improved BIFPN feature fusion network is designed to fuse feature maps of different scales, which improves the model’s ability to detect small targets. Finally, the CoorConv layer is added before the detection head to further improve the performance of the model by introducing coordinate information. The experimental results for the SSDD dataset and the HRSID dataset show that this method has good performance. It can still accurately detect ship targets when facing blurred ship outlines and interference from factors such as noise and land. It also has strong detection capabilities for small targets, and the detection effect is better than other target detection algorithms.
However, although our improved YOLOv5s SAR small ship detection model has achieved good results, there is still room for improvement. In the future, further research can be carried out on the following aspects: (1) The improved YOLOv5 SAR small ship detection model is relatively complex, which may lead to the problem of limited computing resources when running on edge devices. Subsequent research may consider using model compression and acceleration technology to reduce computing resource consumption while ensuring detection accuracy. (2) The generalization ability of the model under extreme weather conditions needs to be further verified. In subsequent research, a diverse dataset containing different extreme weather scenes can be constructed to further improve the environmental adaptability of the model. (3) By fusing multimodal remote sensing data (such as visible light and infrared remote sensing images), more abundant information can be provided for small ship detection to further reduce the false alarm rate and missed detection rate.

Author Contributions

Conceptualization, H.H.; methodology, J.L. and H.H.; validation, J.L., L.G. and D.Z.; formal analysis, J.L., W.F., Y.L. and L.H.; investigation, J.L. and D.Z.; data curation, J.L. and L.H.; writing—original draft preparation, J.L. and H.H.; writing—review and editing, H.H. and D.Z.; visualization, J.L. and L.G.; supervision, D.Z., L.G. and H.H.; funding acquisition, D.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by State Key Laboratory of Spatial Datum, ‘Research on sample data of key functional elements for intelligent terrain analysis’, grant number SKLGIE2023-ZZ-6, and The APC was funded by De Zhang.

Data Availability Statement

All satellite remote sensing and field measured data used in this study are openly and freely available.

Acknowledgments

We are grateful for the careful review and valuable comments provided by the anonymous reviewers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Leng, X.; Ji, K.; Yang, K.; Zou, H. A bilateral CFAR algorithm for ship detection in SAR images. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1536–1540. [Google Scholar] [CrossRef]
  2. Gong, B.; Wang, Y.; Cui, L.; Xu, L.; Tao, M.; Wang, H.; Hou, Y. On the Ship Wake Simulation for Multi-Frequncy and Mutli-Polarization SAR Imaging. In Proceedings of the 2018 China International SAR Symposium (CISS), Shanghai, China, 10–12 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–6. [Google Scholar]
  3. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  4. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
  5. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 1440–1448. [Google Scholar]
  6. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  7. He, K.; Georgia, G.; Piotr, D.; Ross, G. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  8. Liu, Z.; Hu, J.; Weng, L.; Yang, Y. Rotated region-based CNN for ship detection. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 900–904. [Google Scholar]
  9. Lin, Z.; Ji, K.; Leng, X.; Kuang, G. Squeeze and excitation rank faster R-CNN for ship detection in SAR images. IEEE Geosci. Remote Sens. Lett. 2018, 16, 751–755. [Google Scholar] [CrossRef]
  10. Hu, J.; Li, S.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  11. Zhang, T.; Zhang, X.; Li, J.; Shi, J. Contextual squeeze-and-excitation mask r-cnn for sar ship instance segmentation. In Proceedings of the 2022 IEEE Radar Conference (RadarConf22), New York, NY, USA, 21–25 March 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
  12. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  13. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
  14. Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
  15. Li, Z.; Zhao, L.; Han, X.; Pan, M. Lightweight ship detection methods based on YOLOv3 and DenseNet. Math. Probl. Eng. 2020, 2020, 4813183. [Google Scholar] [CrossRef]
  16. Tang, G.; Zhuge, Y.; Claramunt, C.; Men, S. N-YOLO: A SAR ship detection using noise-classifying and complete-target extraction. Remote Sens. 2021, 13, 871. [Google Scholar] [CrossRef]
  17. Yu, C.; Shin, Y. Ship detection in synthetic aperture radar images with improved YOLOv8. In Proceedings of the 2023 14th International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 11–13 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 308–311. [Google Scholar]
  18. Zhao, K.; Lu, R.; Wang, S.; Yang, X.; Li, Q.; Fan, J. ST-YOLOA: A Swin-transformer-based YOLO model with an attention mechanism for SAR ship detection under complex background. Front. Neurorobot. 2023, 17, 1170163. [Google Scholar] [CrossRef] [PubMed]
  19. Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.-M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 16794–16805. [Google Scholar]
  20. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  21. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  22. Liu, R.; Lehman, J.; Molino, P.; Such, F.P.; Frank, E.; Sergeev, A.; Yosinski, J. An intriguing failing of convolutional neural networks and the coordconv solution. Adv. Neural Inf. Process. Syst. 2018, 31, 9628–9639. [Google Scholar]
  23. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
  24. Tan, M.; Quoc, R.P.; Le, V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
  25. Zhang, T.; Zhang, X.; Li, J.; Xu, X.; Wang, B.; Zhan, X.; Xu, Y.; Ke, X.; Zeng, T.; Su, H.; et al. SAR ship detection dataset (SSDD): Official release and comprehensive data analysis. Remote Sens. 2021, 13, 3690. [Google Scholar] [CrossRef]
  26. Wei, S.; Zeng, X.; Qu, Q.; Wang, M.; Su, H.; Shi, J. HRSID: A high-resolution SAR images dataset for ship detection and instance segmentation. IEEE Access 2020, 8, 120234–120254. [Google Scholar] [CrossRef]
  27. Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
  28. Huang, S.; Lu, Z.; Cun, X.; Yu, Y.; Zhou, X.; Shen, X. DEIM: DETR with improved matching for fast convergence. In Proceedings of the Computer Vision and Pattern Recognition Conference, Seattle, WA, USA, 17–26 June 2025; pp. 15162–15171. [Google Scholar]
  29. Goyal, A.; Bochkovskiy, A.; Deng, J.; Koltun, V. Non-deep networks. Adv. Neural Inf. Process. Syst. 2022, 35, 6789–6801. [Google Scholar]
Figure 1. The overall structure of the proposed method.
Figure 1. The overall structure of the proposed method.
Remotesensing 17 02586 g001
Figure 2. The structure of LSKModule.
Figure 2. The structure of LSKModule.
Remotesensing 17 02586 g002
Figure 3. The structure of LSKC3.
Figure 3. The structure of LSKC3.
Remotesensing 17 02586 g003
Figure 4. The difference between Standard Convolution (a) and Depthwise Separable Convolution (b).
Figure 4. The difference between Standard Convolution (a) and Depthwise Separable Convolution (b).
Remotesensing 17 02586 g004
Figure 5. The structure of the Improved SPPF.
Figure 5. The structure of the Improved SPPF.
Remotesensing 17 02586 g005
Figure 6. (a) The structure of the Improved BIFPN. (b) The structure of ParNet Block.
Figure 6. (a) The structure of the Improved BIFPN. (b) The structure of ParNet Block.
Remotesensing 17 02586 g006
Figure 7. Comparison of convolutional and CoordConv layers. This is a figure. Schemes follow the same formatting.
Figure 7. Comparison of convolutional and CoordConv layers. This is a figure. Schemes follow the same formatting.
Remotesensing 17 02586 g007
Figure 8. The visualization of the results of ablation experiment, where the first line, (a1a3), represents the ground truths, and the second line to the sixth line, (b1b3), (c1c3), (d1d3), (e1e3), (f1f3), are the detection results from five experiments in turn.
Figure 8. The visualization of the results of ablation experiment, where the first line, (a1a3), represents the ground truths, and the second line to the sixth line, (b1b3), (c1c3), (d1d3), (e1e3), (f1f3), are the detection results from five experiments in turn.
Remotesensing 17 02586 g008aRemotesensing 17 02586 g008b
Figure 9. Detection results of different methods for SSDD datasets.
Figure 9. Detection results of different methods for SSDD datasets.
Remotesensing 17 02586 g009aRemotesensing 17 02586 g009bRemotesensing 17 02586 g009c
Figure 10. Detection results of different methods for HRSID datasets.
Figure 10. Detection results of different methods for HRSID datasets.
Remotesensing 17 02586 g010aRemotesensing 17 02586 g010bRemotesensing 17 02586 g010cRemotesensing 17 02586 g010d
Table 1. The results of the ablation experiment.
Table 1. The results of the ablation experiment.
NumberBase ModelLSKC3 Improved
SPPF
Improved BIFPNCoordConvPrecision
(%)
Recall (%)mAP50
(%)
mAP50:95
(%)
No. 1 93.693.696.059.4
No. 2 91.895.596.259.8
No. 3 96.092.696.460.4
No. 4 97.492.697.162.1
No. 596.494.497.662.9
Table 2. The evaluation indexes of different methods.
Table 2. The evaluation indexes of different methods.
MethodParameters (M)FLOPs (G)SSDDHRSID
mAP50 (%)mAP50:95 (%)mAP50 (%)mAP50:95 (%)
Faster R-CNN136.7200.879.535.441.519.8
SSD23.6136.684.343.175.243.6
CenterNet191.2457.172.335.481.647.1
EfficientDet3.83.756.624.749.530.2
DEIM4785.453.867.646.5
YOLOv5s7.015.896.059.489.765.8
YOLOv8s11.128.496.559.189.463.9
YOLOv11s9.421.396.660.383.257.2
YOLOv12s9.119.395.958.087.260.7
Ours12.931.097.662.991.767.8
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, J.; Huo, H.; Guo, L.; Zhang, D.; Feng, W.; Lian, Y.; He, L. Small Ship Detection Based on Improved Neural Network Algorithm and SAR Images. Remote Sens. 2025, 17, 2586. https://doi.org/10.3390/rs17152586

AMA Style

Li J, Huo H, Guo L, Zhang D, Feng W, Lian Y, He L. Small Ship Detection Based on Improved Neural Network Algorithm and SAR Images. Remote Sensing. 2025; 17(15):2586. https://doi.org/10.3390/rs17152586

Chicago/Turabian Style

Li, Jiaqi, Hongyuan Huo, Li Guo, De Zhang, Wei Feng, Yi Lian, and Long He. 2025. "Small Ship Detection Based on Improved Neural Network Algorithm and SAR Images" Remote Sensing 17, no. 15: 2586. https://doi.org/10.3390/rs17152586

APA Style

Li, J., Huo, H., Guo, L., Zhang, D., Feng, W., Lian, Y., & He, L. (2025). Small Ship Detection Based on Improved Neural Network Algorithm and SAR Images. Remote Sensing, 17(15), 2586. https://doi.org/10.3390/rs17152586

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop