1. Introduction
At present, maritime activities and the development of marine resources have become more frequent, related technologies and studies in marine fields have also developed, and the demand for monitoring and identification of ships has become increasingly important. Accurate monitoring of ship targets can help plan maritime transportation, port construction, and marine resource development. Ship target detection is of importance for locating and tracking trapped ships for marine monitoring and maritime rescue. Therefore, ship target detection can help monitor ship information in the sea area, improve maritime safety and management efficiency, and is of great significance in the fields of the economy and military.
Synthetic aperture radar (SAR) images are used for all-weather ship detection tasks due to their advantages of being independent of lighting conditions and unaffected by weather. Currently, the methods used for ship target detection in SAR images are divided into traditional methods and deep learning methods. Traditional methods, such as the method based on constant false alarm rate (CFAR) [
1], usually use manually designed features, which are difficult to fully capture the high-level semantic features of the target [
2]. Deep learning technology has better performance in big data analysis and processing capabilities, and has been widely used in the field of target detection. Deep learning-based ship detection methods are divided into two-stage methods and one-stage methods according to whether candidate regions are generated. The two-stage method generally generates a region proposal first, and then classifies and locates it through a convolutional neural network, such as R-CNN [
3], SPPNet [
4], Fast R-CNN [
5], Faster R-CNN [
6], and Mask R-CNN [
7]. Liu et al. [
8] proposed RR-CNN based on R-CNN, which can accurately extract the features of the rotated region and realize the rotation bounding box positioning of the ship target. Lin et al. [
9] added a squeeze-and-excitation (SE) module [
10] to the Faster R-CNN network structure to further improve the performance of ship detection. Zhang et al. [
11] proposed a C-SE Mask R-CNN model for ship instance segmentation in SAR images and designed a contextual squeeze-and-excitation module (C-SE) embedded in Mask R-CNN to capture contextual information. To improve the detection speed, the one-stage detection method eliminates the candidate region generation process, such as YOLO [
12] and SSD [
13]. The YOLO series algorithm has now developed to YOLOV12 [
14]. YOLO transforms the target detection task into a regression problem. It divides the input image into S × S grids. If the center point of the target falls within a grid, then this grid is responsible for predicting the target. Li et al. [
15] proposed a lightweight ship detection model LSDM based on YOLOV3 using DenseNet as the backbone network and shallow features to improve the detection accuracy, and replacing the conventional convolution in FPN with spatially separable convolution. Tang et al. [
16] proposed an N-YOLO algorithm, including a noise level classifier (NLC) for classifying SAR images according to noise levels, a SAR target area extraction module (STPAE) for extracting potential target areas, and a detection module based on YOLOV5. This method effectively reduces the impact of noise, but causes a partial loss of ship edge information. Yu et al. [
17] added a small target detection layer to the YOLOV8 structure and used WIOU to adjust the bounding box loss function. Zhao et al. [
18] proposed a ST-YOLOA model for SAR ship detection. First, Coordinate Attention was embedded in the Swin Transformer backbone network to improve feature extraction capabilities. Second, a Convolutional Block Attention Module was introduced during feature fusion. Finally, the decoupled detection head of YOLOX was used to achieve the final output. Compared with traditional methods, the deep learning-based methods can deal with the current massive remote sensing data and extract high-level semantic information of images, achieving fast and high-precision target detection. However, there is a large amount of speckle noise in SAR images, the outline of ships is relatively blurred in SAR images, and the detection of near-shore ships is also affected by land background factors. As a result, the existing deep learning target detection algorithm has certain false detection and missed detection phenomena for ship targets (especially small targets) in SAR images, thereby reducing the detection performance.
YOLOv5, a target detection model with good stability and robustness, has been widely used in various visual tasks. In the YOLOv5 series, YOLOv5s has achieved a good balance between accuracy and efficiency. Therefore, we propose an improved SAR small ship detection model based on YOLOv5s to extract small ship information. First, given that the LSKModule in LSKNet [
19] can adaptively aggregate large-size convolution kernel information in the spatial dimension and applies the idea of spatial attention mechanism to enhance the features of key areas and suppress the interference of noise areas, we combine the C3 module and LSKModule to construct the LSKC3 module and add it to the backbone network. Second, considering that the SPPF structure of YOLOv5s only uses MaxPool to extract features of different receptive fields, which will cause certain information loss, Depthwise Separable Convolutions [
20] with different kernel sizes are added to the SPPF structure to further extract features of different receptive fields while only increasing a small amount of parameters and computation. Then, the feature fusion network of the model is improved on the basis of BIFPN [
21]. Considering that shallow feature maps have a small receptive field but rich location information, we add shallow feature maps during feature fusion to optimize the performance of small target detection. Before feature fusion, ParNet Block is used to further capture the local features of the above feature maps and enhance the key channel information of these feature maps. The connection method is redesigned to fully fuse feature maps of different scales. Finally, the detection head is improved by using CoordConv [
22] to perform convolution operations on the additional coordinate channels to further improve the performance of the model. The effectiveness and accuracy of the improved method are evaluated and verified based on the public SSDD dataset and HRSID dataset.
The main objectives of this paper can be summarized as follows:
(1) Aiming at the problem of small ship target detection in SAR images, a SAR image ship detection method based on YOLOv5s is proposed. To capture a wide range of contextual information and improve the feature extraction ability of the model, the LSKC3 module is built based on the LSKModule and embedded into the backbone network.
(2) Depthwise Separable Convolutions with different kernel sizes are added to the SPPF structure to further extract features of different receptive fields with only a small increase in parameters and computational complexity.
(3) In addition, an Improved BIFPN is constructed to fully fuse feature maps of different scales, simultaneously improving the model’s detection ability for small targets.
(4) Since the traditional convolution operation does not consider the position information of pixels, a CoordConv layer is added before the detection head to further improve the detection performance of the model by introducing coordinate information.
(5) Ablation experiments are carried out on the SSDD dataset to verify the effectiveness of each structure, and then the proposed method is compared with other target detection algorithms on the SSDD dataset and the HRSID dataset. The experimental results show that the proposed method has superior performance.
This paper is organized as follows:
Section 2 describes the research method, mainly introducing the improved YOLOv5s SAR small ship detection model in detail.
Section 3 verifies and analyzes the experimental results. The effectiveness and advancement of the method are demonstrated through ablation experiments and comparative experiments.
Section 4 is the conclusions and future prospects.
2. Methodology
The proposed method is shown in
Figure 1 and consists of three parts: a backbone, neck, and head. First, to capture a wide range of contextual information, the LSKC3 module is constructed using LSKModule and added to the backbone for feature extraction. Secondly, multiple parallel Depthwise Separable Convolutions layers are added to the SPPF structure to further extract features of different receptive fields while increasing a small amount of parameters and computation. Then, the feature map extracted by the backbone network is passed through the Improved BIFPN to fully integrate feature maps of different scales, and the shallow feature map is used to improve the model’s detection ability for small ships. Finally, the CoordConv layer is added to the detection head to further improve the model performance by introducing coordinate information.
2.1. LSKC3
Large-size convolution kernels can cover a larger receptive field, thereby capturing more global information and facilitating target detection. LSKModule is the core module in LSKNet [
19], which decomposes a large-size convolution kernel into a sequence of large-size deep convolutions with different dilation rates, thereby generating feature maps with different large receptive fields. Under the same receptive field, compared with a single standard large convolution kernel, the sequential decomposition method greatly reduces the number of parameters. Then, the idea of the spatial attention mechanism is combined to perform weighted fusion of these feature maps with different large receptive field information, and finally spatially merges them with the input feature map. The structure of LSKModule is shown in
Figure 2.
The calculation process of LSKModule [
19] is as follows.
First, the input feature
passes through different deep convolution kernels to obtain feature maps with different receptive fields. These feature maps are respectively passed through a 1 × 1 convolutional layer to compress the number of channels, and the obtained feature maps are concatenated in the channel dimension to obtain the feature map
:
Then, average pooling
and maximum pooling
are performed on
in the channel dimension to extract the spatial relationship of the feature map and obtain the pooled features
and
:
The pooled features are concatenated in the channel dimension and converted into N spatial attention maps (N is the number of depthwise convolution kernels) using a convolutional layer
:
For each spatial attention map
, the sigmoid activation function is used to obtain the spatial selection mask of each feature map:
Then, the different feature maps are weighted spatially to enhance the features of the key areas, and then element-by-element addition is performed and the feature map S is obtained through the convolution layer
:
The final output of LSKModule is the element-wise product between the input features
and
:
Through the above process, LSKModule can adaptively aggregate the feature information extracted by large-size convolution kernels in the spatial dimension and enhance the key features of the feature map in the spatial dimension through the idea of the spatial attention mechanism. Considering that only 3 × 3 convolution kernels are used in the backbone network of YOLOv5s to extract local features, the model cannot fully extract large-scale context information. Therefore, to improve the feature extraction ability of the model while enhancing the model’s attention to key areas and suppressing noise interference, we built the LSKC3 module and added it to the backbone network. LSKC3 is based on C3, the LSKModule is added after the CBS module on one side to capture large-scale context information, and residual connection is used to reduce the loss of information during transmission. The LSKC3 module is shown in
Figure 3. Two deep convolution kernels with sizes of 5 × 5 and 7 × 7 are used in the LSKModule mechanism, and the expansion rates are set to 1 and 3, respectively. The calculation process of the LSKC3 module is as follows.
First, the input feature
passes through two CBS modules to obtain two feature maps
and
:
Then, in the upper branch, the
feature passes through the LSKModule to adaptively aggregate the feature information extracted by the large-size convolution kernel, and then is added to the original feature to obtain the feature
:
In the lower branch, the
feature passes through n bottleneck modules with shortcut structures to obtain the feature map
:
Finally, the feature map
and the feature map
are concatenated in the channel dimension and passed through a CBS module to obtain the final feature map
:
In this way, the model can fully capture the wide range of contextual information of the input feature map, enhance the model’s attention to key features, and effectively suppress the influence of noise.
2.2. Improved SPPF
The multi-receptive field design can extract contextual information of different spatial ranges at the same scale, which can enhance the model’s ability to distinguish between targets and similar backgrounds. The SPPF module in YOLOv5s uses multiple maximum pooling (MaxPool2d) layers in series to achieve multi-receptive field feature extraction. However, using only MaxPool operations will cause certain information loss. Therefore, multiple parallel Depthwise Separable Convolution [
20] layers are added to fully extract the features of different receptive fields. Depthwise Separable Convolution decomposes a complete convolution operation into Depthwise Convolution and Pointwise Convolution. The number of convolution kernels in Depthwise Convolution is the same as the number of channels in the input feature map, and one convolution kernel only acts on one channel of the feature map. Depthwise Convolution performs convolution operations on each channel of the input layer independently, without using the feature information of different channels at the same spatial position. Pointwise Convolution uses a 1 × 1 conventional convolution kernel to process the output of Depthwise Convolution to use the feature information of different channels at the same spatial position and adjust the number of channels in the feature map. Compared with traditional convolution, this decomposition method greatly reduces the parameters and computational complexity of convolution operations.
Figure 4 shows the difference between Standard Convolution and Depthwise Separable Convolution.
A feature map with an input of H × W × M is convolved to obtain a feature map of H × W × N, where H and W are the height and width of the feature map, M is the number of channels of the input feature map, and N is the number of channels of the output feature map.
The standard convolution operation requires N k × k × M convolution kernels, where k is the size of the convolution kernel. The calculation of the standard convolution is as follows:
Depthwise Separable Convolution first performs convolution operation on the input feature map through Depthwise Convolution, which requires M convolution kernels.
Each convolution kernel is only responsible for one channel of the input feature map. The calculation of Depthwise Convolution is as follows:
Then, the output results of Depthwise Convolution are integrated through Pointwise Convolution. N
convolution kernels are required. The calculation of Pointwise Convolution is as follows:
The total computational effort of Depthwise Separable Convolution is as follows:
Compared with standard convolution, the computational complexity of Depthwise Separable Convolution is reduced to the following:
In this method, Depthwise Convolution uses convolution kernels of sizes 3 × 3, 5 × 5, 9 × 9, and 13 × 13. Pointwise Convolution uses a 1 × 1 conventional convolution kernel to process the feature maps generated after Depthwise Convolution. Finally, these feature maps are spliced with the feature maps after convolution and MaxPool branches to obtain feature maps with different receptive field information, which further improves the model’s ability to distinguish between targets and backgrounds. The Improved SPPF structure is shown in
Figure 5.
2.3. Improved BIFPN
After the input image is extracted by the backbone network, a series of feature maps of different scales will be obtained. The feature fusion network in the neck fuses these feature maps to obtain more contextual information, which is helpful for multi-scale object detection. Feature pyramid network (FPN) [
21] proposes a top-down multi-scale feature fusion method to transfer deep semantic features from high-level feature maps to low-level feature maps. PAN [
23] proposes a bottom-up feature fusion method based on FPN to transfer shallow features with rich location information from low-level feature maps to high-level feature maps. The neck part of YOLOv5s adopts the structure of FPN and PAN at the same time to fuse the P3, P4, and P5 feature maps extracted by the backbone network. The BiFPN [
24] feature fusion network introduces cross-layer connections and adopts feature weighting to fuse feature maps of more levels.
The specific steps of improving BIFPN are as follows: First, before feature fusion, the feature map extracted by the backbone network is further extracted through the ParNet Block structure while focusing on the important channel information of the feature map. ParNet Block consists of a 1 × 1 convolution layer, a 3 × 3 convolution layer, and an SSE (skip-squeeze-excitation) layer designed based on the SE (squeeze-and-excitation) module [
10]. Secondly, considering that the receptive field of shallow feature maps is small and has rich position information, in order to improve the detection performance of the model for small targets, the shallow feature map P2 is used in feature fusion. In the Improved BIFPN structure, cross-layer connections are used for feature maps of the same scale to fuse shallow features and deep features of the same scale, and two bottom-up connections are added to the top-down fusion path to fully aggregate multi-scale features. A fast normalized fusion method is used to set a learnable weight parameter for each input feature map to achieve the weighted fusion of feature maps of different scales. The output of this process is used for the final prediction. The Improved BIFPN and ParNet Block structures are shown in
Figure 6a and
Figure 6b, respectively.
The Improved BIFPN uses P2, P3, P4, and P5 extracted by the backbone network as input features, with sizes of 1/4, 1/8, 1/16, and 1/32 of the input image, respectively.
First, to further extract local features and focus on the important channel information of the feature map, the four feature maps are respectively passed through the ParNet Block structure to obtain feature maps P2′, P3′, P4′, and P5′ for subsequent feature fusion.
Then, the four feature maps P2′, P3′, P4′, and P5′ are first obtained through a top-down path to obtain the four feature maps C2, C3, C4, and C5. Among these, the C5 feature map is obtained by processing the P5′ feature map through the Improved SPPF module, which contains information on multiple different receptive fields. The three feature maps C2, C3, and C4 are obtained by splicing feature maps of multiple levels in the channel dimension. The top-down feature fusion process is as follows:
Finally, in the bottom-up path, the D2 feature map is obtained by passing the C2 feature map through the C3 module. The three feature maps D3, D4, and D5 are obtained by splicing the feature maps of multiple levels in the channel dimension. The bottom-up feature fusion process is as follows:
where
represents the concatenation operation on the channel dimension;
and
represent upsampling and downsampling operations, respectively, and downsampling is implemented by a convolution operation of size 3 × 3 and stride 2;
(
) represents the trainable weight parameter; and ε is set to 0.0001 to avoid division by 0.
2.4. Detection Head Based on CoordConv
CoordConv [
22] adds an additional coordinate channel to the input feature map in the channel dimension so that the convolution operation acts on the coordinate information of the feature map and allows the network to learn complete translation invariance or different degrees of translation dependence according to the needs of the task. The difference between the traditional convolution layer and the CoordConv [
22] layer is shown in
Figure 7.
Assume that the feature map
with an input size of
is convolved to obtain a feature map Y with a size of
. Traditional convolution requires
convolution kernels
. The calculation process of traditional convolution is as follows:
CoordConv first needs to add an x-coordinate position channel and a y-coordinate position channel to the input feature map to obtain the feature map
, and then use
convolution kernels
to perform convolution operations. The coordinate convolution calculation process is as follows:
where
and
are the position indexes of the output feature map,
is the channel index of the output feature map, and
means round down.
The detection head in YOLOv5s uses only one 1 × 1 convolution layer to achieve category prediction and Box regression. Traditional convolution operations do not use pixel location information. In order to improve the performance of the model, the CoordConv [
22] layer is embedded before the 1 × 1 convolution. By introducing coordinate information, the model’s perception of spatial location is enhanced, further improving the accuracy of target detection.
4. Conclusions and Outlook
In view of the unclear ship outline in SAR images, the influences of noise, land, and other factors increase the difficulty of ship (especially small target ship) detection. This paper proposes an improved target detection algorithm based on YOLOv5s for ship target detection in SAR images. First, in order to obtain extensive context information, the backbone network is improved by the LSKModule to improve the feature extraction ability of the model backbone. Secondly, the SPPF module is redesigned using Depthwise Separable Convolution, and the features of different receptive fields are further extracted by introducing a small number of parameters and calculations. Then, the Improved BIFPN feature fusion network is designed to fuse feature maps of different scales, which improves the model’s ability to detect small targets. Finally, the CoorConv layer is added before the detection head to further improve the performance of the model by introducing coordinate information. The experimental results for the SSDD dataset and the HRSID dataset show that this method has good performance. It can still accurately detect ship targets when facing blurred ship outlines and interference from factors such as noise and land. It also has strong detection capabilities for small targets, and the detection effect is better than other target detection algorithms.
However, although our improved YOLOv5s SAR small ship detection model has achieved good results, there is still room for improvement. In the future, further research can be carried out on the following aspects: (1) The improved YOLOv5 SAR small ship detection model is relatively complex, which may lead to the problem of limited computing resources when running on edge devices. Subsequent research may consider using model compression and acceleration technology to reduce computing resource consumption while ensuring detection accuracy. (2) The generalization ability of the model under extreme weather conditions needs to be further verified. In subsequent research, a diverse dataset containing different extreme weather scenes can be constructed to further improve the environmental adaptability of the model. (3) By fusing multimodal remote sensing data (such as visible light and infrared remote sensing images), more abundant information can be provided for small ship detection to further reduce the false alarm rate and missed detection rate.