R-CenterNet+: Anchor-Free Detector for Ship Detection in SAR Images

In recent years, the rapid development of Deep Learning (DL) has provided a new method for ship detection in Synthetic Aperture Radar (SAR) images. However, there are still four challenges in this task. (1) The ship targets in SAR images are very sparse. A large number of unnecessary anchor boxes may be generated on the feature map when using traditional anchor-based detection models, which could greatly increase the amount of computation and make it difficult to achieve real-time rapid detection. (2) The size of the ship targets in SAR images is relatively small. Most of the detection methods have poor performance on small ships in large scenes. (3) The terrestrial background in SAR images is very complicated. Ship targets are susceptible to interference from complex backgrounds, and there are serious false detections and missed detections. (4) The ship targets in SAR images are characterized by a large aspect ratio, arbitrary direction and dense arrangement. Traditional horizontal box detection can cause non-target areas to interfere with the extraction of ship features, and it is difficult to accurately express the length, width and axial information of ship targets. To solve these problems, we propose an effective lightweight anchor-free detector called R-Centernet+ in the paper. Its features are as follows: the Convolutional Block Attention Module (CBAM) is introduced to the backbone network to improve the focusing ability on small ships; the Foreground Enhance Module (FEM) is used to introduce foreground information to reduce the interference of the complex background; the detection head that can output the ship angle map is designed to realize the rotation detection of ship targets. To verify the validity of the proposed model in this paper, experiments are performed on two public SAR image datasets, i.e., SAR Ship Detection Dataset (SSDD) and AIR-SARShip. The results show that the proposed R-Centernet+ detector can detect both inshore and offshore ships with higher accuracy than traditional models with an average precision of 95.11% on SSDD and 84.89% on AIR-SARShip, and the detection speed is quite fast with 33 frames per second.


Introduction
Synthetic aperture radar (SAR) is an active microwave sensor with high resolution and wide swath [1]. Because of the ability to provide all-day and all-weather images of the ocean environment, it has been widely used in marine surveillance and deformation monitoring [2,3]. As an important task of marine surveillance, ship detection in SAR images is attracting increasing attention.
SAR images are different from optical images due to the complex imaging mechanisms and speckle noises. Moreover, phase errors caused by topography variations lead to the degradation of the focusing quality and geometric distortion of High-Resolution SAR images [4]. The interpretation and understanding of SAR images are very difficult, so it is necessary to establish the automatic target recognition (ATR) system of SAR images [5]. Numerous SAR ship detection methods have been proposed in recent years. There are three traditional methods for ship detection, including threshold methods [6], statistics methods [7], and transformation methods [8]. Among these methods, the constant false alarm rate (CFAR) and its improved versions [9][10][11][12][13][14][15] have been extensively studied and applied. However, the CFAR performs poorly in small object detection and complex scene detection. It cannot meet the requirements of high-precision ship detection.
Due to the powerful ability of feature extraction and expression, Convolution Neural Network (CNN) has become the mainstream approach for object detection [16]. Currently, object detection methods based on of Deep Learning (DL) can be roughly divided into two categories. The first is anchor-based methods such as Faster Region-CNN (R-CNN) [17], Mask R-CNN [18], Single Shot MultiBox Detector (SSD) [19] and RetinaNet [20]. Anchorbased methods introduce a Region Proposal Network (RPN) to generate a large number of candidate regions where targets may be included. Then, the candidate regions are classified and turned by CNN to obtain bounding boxes. This kind of method has relatively high accuracy with huge time consumption. The second is anchor-free methods such as You Only Look Once v1 (YOLOv1) [21], CornerNet [22], CenterNet [23] and Fully Convolutional One-Stage (FCOS) [24]. Anchor-free methods do not generate candidate regions in advance, and directly regress the category and location of bounding boxes. This kind of method is relatively fast and easy to train. This means they are more suitable for real-time processing and mobile deployment.
The number of available SAR images has increased greatly in recent years, which makes it possible to use the DL method for SAR ship detection. More and more scholars use DL methods to detect ships in SAR images. Most of the studies are based on anchor-based detection methods. For example, Li et al. [25] proposed an improved faster R-CNN method and released an SAR Ship Detection Dataset (SSDD). Kang et al. [26] came up with a contextual region-based CNN with multilayer fusion to rule out false alarms. Jiao et al. [27] improved faster R-CNN with Feature Pyramid Networks (FPN) to detect multiscale ships. A dense attention pyramid network was invented to detect dense small ships [28]. Zhao et al. [29] proposed a cascade coupled convolutional network with an attention mechanism to alleviate the problem of missing detections for small and densely clustered ships. Wang et al. [30] applied an object detector based on RetinaNet in multi-resolution SAR images. Wang et al. [31] combined SSD with transfer learning to address ship detection in complex scenes. Mao et al. [32] put forward an efficient SAR ship detector based on simplified U-Net.
In order to realize real-time detection, more and more methods focus on the highspeed processing of SAR ship detection. What is significantly different from other images is that ship targets are very sparse in SAR images. Considering this characteristic of SAR images, some scholars use anchor-free methods for SAR ship detection to accelerate the SAR image processing speed. Gao et al. [33] designed an anchor-free detector with lightweight backbone MobileNetV2. Guo et al. [34] improved CenterNet with a feature refinement module, feature pyramids fusion module and head enhancement module.
Although the above studies show that the DL methods can be applied in SAR ship detection, there are still four challenges affecting its performance: Heavy time cost, small ship detection, complex background and lack of accurate ship size and axial information. In order to solve the above problems, we propose an effective anchor-free detector for SAR ship detection called R-CenterNet+. First, to improve the focusing ability on small ships, we introduce an attention mechanism to the backbone network. Second, to distinguish between ships and complex background, FEM is employed to refine the feature map. Finally, we adopt the rotating bounding box to annotate ships and add the angle regression branch to the original detection head to predict ships' rotation angle.
The main contributions of our work in this paper are as follows: (1) For small ship detection, the feature extraction network Convolutional Block Attention Module Deep Layer Aggregation 34 (CBAM-DLA34) is established, which improves the performance of small ship detection; (2) For the complex background, we adopt FEM to improve the robustness of ship detection in a complex background; The SSDD dataset, established in 2017, is the first published dataset specifically for ship target detection in SAR images at home and abroad [25]. The SSDD dataset contains various images of different sensors, resolutions, polarizations, and scenes, etc. Using the production process of the PASCAL VOC dataset for reference, the SSDD dataset can be used to train and test ship detection algorithms, which enables researchers to compare algorithm performance under the same conditions. The detailed descriptions of the SSDD dataset are shown in Table 1. The distribution of ship pixel size on the SSDD dataset is shown in Figure 1. In Figure 1, we measure ship size from the pixel level (i.e., the number of pixels) rather than the physical level. The reasons are as follows: (1) SAR images in different datasets have inconsistent resolutions, and these datasets' publishers only provided a rough resolution range, so we cannot perform a strict comparison using the physical size; (2) in the deep learning community, it is common sense to use pixels to measure the object size in a relative pixel proportion among the whole image [35]. As can be seen from the Figure 1, the SSDD dataset is mainly composed of small ships. Moreover, the ship size varies greatly, with the smallest being about 7 × 4 pixels and the largest being about 381 × 69 pixels, which increases the diversity of ships on the dataset.
In order to improve the detection performance of the proposed model for multi-scale ships, we classify the ship targets on the SSDD dataset by ship pixel size. We classify ship objects with an area less than 32 × 32 pixels as small ships, with an area from 32 × 32 pixels to 64 × 64 pixels as medium-sized ships, and with an area larger than 64 × 64 pixels as large ships. The statistical results of ship targets are shown in Table 2. In order to improve the detection performance of the proposed model for multi-scale ships, we classify the ship targets on the SSDD dataset by ship pixel size. We classify ship objects with an area less than 32 × 32 pixels as small ships, with an area from 32 × 32 pixels to 64 × 64 pixels as medium-sized ships, and with an area larger than 64 × 64 pixels as large ships. The statistical results of ship targets are shown in Table 2. (2) Inshore and offshore dataset classification In general, it is difficult to detect inshore ships due to the interference of complex backgrounds [28]. In order to effectively reduce the interference of complex backgrounds and better detect inshore ships, we regard images containing land as inshore samples, and images not containing land as offshore samples, and then establish SSDD-inshore and SSDD-offshore datasets, which are shown in Table 3. Table 3. SSDD-inshore and SSDD-offshore.

Datasets
Images Ships SSDD-inshore  227  668  SSDD-offshore  933  1910  SSDD  1160  2578 (3) Oversample of local image data The SSDD dataset contains some densely arranged small ships, which are relatively few compared to larger-sized ships. This imbalance of data distribution will cause the model to pay more attention to larger-sized ships, which will confuse multiple densely  (2) Inshore and offshore dataset classification In general, it is difficult to detect inshore ships due to the interference of complex backgrounds [28]. In order to effectively reduce the interference of complex backgrounds and better detect inshore ships, we regard images containing land as inshore samples, and images not containing land as offshore samples, and then establish SSDD-inshore and SSDD-offshore datasets, which are shown in Table 3. Table 3. SSDD-inshore and SSDD-offshore.

Datasets
Images Ships SSDD-inshore  227  668  SSDD-offshore  933  1910  SSDD  1160  2578 (3) Oversample of local image data The SSDD dataset contains some densely arranged small ships, which are relatively few compared to larger-sized ships. This imbalance of data distribution will cause the model to pay more attention to larger-sized ships, which will confuse multiple densely arranged small ship targets with one larger ship target, resulting in serious missed detections. In order to improve the detection performance, during the training process, we perform oversample data enhancement on images containing densely arranged small ships to encourage the model to pay more attention to them. The method of the oversample is shown in Figure 2.
arranged small ship targets with one larger ship target, resulting in serious missed detections. In order to improve the detection performance, during the training process, we perform oversample data enhancement on images containing densely arranged small ships to encourage the model to pay more attention to them. The method of the oversample is shown in Figure 2.

Dataset Introduction
The AIR-SARShip dataset, established in 2019, is a SAR ship detection dataset of high resolution and large-size scene [36]. The dataset contains 31 large-scale SAR images of Gaofen-3 (GF-3). The scene types include ports, islands and reefs, sea surface with different levels of sea state, etc. The background covers various scenes such as inshore and offshore scenes. The AIR-SARShip dataset uses the production process of PASCAL VOC dataset for reference. The detailed descriptions of AIR-SARShip are shown in Table 4. (1) Image cropping into sub-images Taking into account the limitation of Graphics Processing Unit (GPU) memory, we crop 31 large images into sub-images of 512 × 512 pixels in steps of 256 pixels. In order to reduce the imbalance between the foreground and background of the dataset, only reserve sub-images containing ship targets were used for training and testing. In the end, the cropped AIR-SARShip dataset includes a total of 719 sub-images and 1585 ship targets. This kind of image cropping is common sense in the DL community, and the original large SAR images with 24,000 × 16,000 pixels in the LS-SSDD-v1.0 dataset [37] are cropped into small sub-images with 800 × 800 pixels for the neural network training.

Dataset Introduction
The AIR-SARShip dataset, established in 2019, is a SAR ship detection dataset of high resolution and large-size scene [36]. The dataset contains 31 large-scale SAR images of Gaofen-3 (GF-3). The scene types include ports, islands and reefs, sea surface with different levels of sea state, etc. The background covers various scenes such as inshore and offshore scenes. The AIR-SARShip dataset uses the production process of PASCAL VOC dataset for reference. The detailed descriptions of AIR-SARShip are shown in Table 4. Taking into account the limitation of Graphics Processing Unit (GPU) memory, we crop 31 large images into sub-images of 512 × 512 pixels in steps of 256 pixels. In order to reduce the imbalance between the foreground and background of the dataset, only reserve sub-images containing ship targets were used for training and testing. In the end, the cropped AIR-SARShip dataset includes a total of 719 sub-images and 1585 ship targets. This kind of image cropping is common sense in the DL community, and the original large SAR images with 24,000 × 16,000 pixels in the LS-SSDD-v1.0 dataset [37] are cropped into small sub-images with 800 × 800 pixels for the neural network training.
(2) Ship target classification The distribution of ship pixel size on the cropped AIR-SARShip dataset is shown in Figure 3. As can be seen from the figure, the AIR-SARShip dataset is mainly composed of small-and medium-sized ships. Ship size varies greatly, with the smallest being about 18 × 4 pixels and the largest being about 344 × 72 pixels. (2) Ship target classification The distribution of ship pixel size on the cropped AIR-SARShip dataset is shown in Figure 3. As can be seen from the figure, the AIR-SARShip dataset is mainly composed of small-and medium-sized ships. Ship size varies greatly, with the smallest being about 18 × 4 pixels and the largest being about 344 × 72 pixels. Similar to the processing of the SSDD dataset, we classify ship targets of the cropped AIR-SARShip dataset by size. The results are shown in Table 5. (3) Inshore and offshore dataset classification We establish AIR-SARShip-inshore and AIR-SARShip-offshore datasets, as shown in Table 6. Table 6. AIR-SARShip-inshore and AIR-SARShip-offshore.

Materials and Methods
The proposed model is an anchor-free detector based on Centernet, and its overall framework is shown in Figure 4. The target detection process is as follows: first, use the Similar to the processing of the SSDD dataset, we classify ship targets of the cropped AIR-SARShip dataset by size. The results are shown in Table 5. (3) Inshore and offshore dataset classification We establish AIR-SARShip-inshore and AIR-SARShip-offshore datasets, as shown in Table 6. Table 6. AIR-SARShip-inshore and AIR-SARShip-offshore.

Datasets
Images Ships

Materials and Methods
The proposed model is an anchor-free detector based on Centernet, and its overall framework is shown in Figure 4. The target detection process is as follows: first, use the backbone network of CBAM-DLA34 to extract ship features; then, input the original feature map into FEM and output the foreground enhancement feature map; finally, conduct regression for the center point coordinates, size, offset, and rotation angle of the ship, so as to realize the rotation detection of ship targets.

CBAM-DLA34
The imaging mechanism of SAR images is quite different to that of ordinary optical images. There are a lot of speckle noises, which makes it difficult to extract and merge features of SAR images, and leads to a weak focus on small ship targets. Therefore, we introduce the lightweight attention module CBAM [38] into the backbone network to build CBAM-DLA34. CBAM is an attention module that combines channel and space. It infers attention weights along two independent dimensions of the channel and space in turn, and then multiplies the attention weights with the input feature map for adaptive feature optimization. The attention module improves the feature expression ability of the feature extraction network for ship targets in SAR images, and improves the focus on small-scale ship targets.
The specific implementation of the channel attention module includes three steps.
(1) Perform global average pooling and maximum pooling of the spatial dimension on the input feature of size × × , respectively, and obtain two feature maps of 1 × 1 × ; (2) Through multilayer perceptron (MLP), add the features output by the perceptron element by element; and (3) Obtain the channel attention weight through Sigmoid activation function. It is shown in formula (1): In the formula, represents Sigmoid activation function, and represent the weight of , and represent the features output by global average pooling and maximum pooling, respectively.
The specific implementation of the spatial attention module is as follows.
(1) Perform global average pooling and maximum pooling of the channel dimension on the input feature ', respectively, and obtain two feature maps of size × × 1.
(2) Splice the feature maps together according to the channels, and perform the convolution operation through a 7 × 7 convolution layer; and

CBAM-DLA34
The imaging mechanism of SAR images is quite different to that of ordinary optical images. There are a lot of speckle noises, which makes it difficult to extract and merge features of SAR images, and leads to a weak focus on small ship targets. Therefore, we introduce the lightweight attention module CBAM [38] into the backbone network to build CBAM-DLA34. CBAM is an attention module that combines channel and space. It infers attention weights along two independent dimensions of the channel and space in turn, and then multiplies the attention weights with the input feature map for adaptive feature optimization. The attention module improves the feature expression ability of the feature extraction network for ship targets in SAR images, and improves the focus on small-scale ship targets.
The specific implementation of the channel attention module includes three steps.
(1) Perform global average pooling and maximum pooling of the spatial dimension on the input feature F of size H × W × C, respectively, and obtain two feature maps of 1 × 1 × C; (2) Through multilayer perceptron (MLP), add the features output by the perceptron element by element; and (3) Obtain the channel attention weight M c through Sigmoid activation function. It is shown in Formula (1): In the formula, σ represents Sigmoid activation function, W 0 and W 1 represent the weight of MLP, F c avg and F c max represent the features output by global average pooling and maximum pooling, respectively.
The specific implementation of the spatial attention module is as follows.
(1) Perform global average pooling and maximum pooling of the channel dimension on the input feature F , respectively, and obtain two feature maps of size H × W × 1. (2) Splice the feature maps together according to the channels, and perform the convolution operation through a 7 × 7 convolution layer; and (3) Obtain the spatial attention weight M s through Sigmoid activation function. It is shown in Formula (2): In the formula, σ represents Sigmoid activation function, f 7×7 represents a convolution operation with the filter size of 7 × 7, F s avg and F s max represent the features output by global average pooling and maximum pooling, respectively.
The method of adding CBAM to ResBlock [39] is shown in Figure 5. First, perform a convolution calculation on the feature map generated by the previous residual block to generate an input feature map F. Second, Input F into the channel attention module to obtain the channel attention feature M c , and then obtain the feature map F by multiplying F and M c element by element. Third, input F into the spatial attention module to obtain the spatial attention feature M s , and then obtain the feature map F by multiplying F and M s element by element. Finally, perform a shortcut connection to obtain the final feature map F , which is used as the input of the next ResBlock. (3) Obtain the spatial attention weight through Sigmoid activation function. It is shown in formula (2): In the formula, represents Sigmoid activation function, × represents a convolution operation with the filter size of 7 × 7, and represent the features output by global average pooling and maximum pooling, respectively.
The method of adding CBAM to ResBlock [39] is shown in Figure 5. First, perform a convolution calculation on the feature map generated by the previous residual block to generate an input feature map . Second, Input into the channel attention module to obtain the channel attention feature , and then obtain the feature map ' by multiplying F and element by element. Third, input ' into the spatial attention module to obtain the spatial attention feature , and then obtain the feature map '' by multiplying ' and element by element. Finally, perform a shortcut connection to obtain the final feature map ''', which is used as the input of the next ResBlock. DLA is a network with hierarchical skip connections [40]. It designs two structures: Iterative Deep Aggregation (IDA) and Hierarchical Deep Aggregation (HDA). According to the basic network structure, IDA refines the resolution and aggregation scale step by step to integrate semantic information. HDA aggregates various levels into representations of different grades through its own tree-like connection structure to integrate spatial information.
We improve the DLA network, and propose CBAM-DLA34, which uses CBAM-ResBlock as the basic module. The specific structure of CBAM-DLA34 is shown in Figure  6.  DLA is a network with hierarchical skip connections [40]. It designs two structures: Iterative Deep Aggregation (IDA) and Hierarchical Deep Aggregation (HDA). According to the basic network structure, IDA refines the resolution and aggregation scale step by step to integrate semantic information. HDA aggregates various levels into representations of different grades through its own tree-like connection structure to integrate spatial information.
We improve the DLA network, and propose CBAM-DLA34, which uses CBAM-ResBlock as the basic module. The specific structure of CBAM-DLA34 is shown in Figure 6.
to the basic network structure, IDA refines the resolution and aggregation scale step by step to integrate semantic information. HDA aggregates various levels into representations of different grades through its own tree-like connection structure to integrate spatial information.
We improve the DLA network, and propose CBAM-DLA34, which uses CBAM-ResBlock as the basic module. The specific structure of CBAM-DLA34 is shown in Figure  6.

FEM
Due to the special imaging environment, the background of the SAR image is very complicated. Ship targets are easily affected by islands, reefs and speckle noise, especially in the detection of inshore ships. The traditional DL model is easy to confuse the ship target with the background of the similar shape, resulting in false detections and missed detections. Studies [41][42][43] show that semantic segmentation can assist object detection. The R-Centernet+ detector model proposed in this paper adopts an FEM based on semantic segmentation to predict the foreground area in advance to reduce the interference of complex backgrounds.
The implementation of the FEM is shown in Figure 7. We perform continuous convolution on the original feature map extracted by the backbone network to obtain the foreground image F. Element-wise multiplication on the foreground image F and the original feature map FM generates a foreground enhanced feature map FEFM.

FEM
Due to the special imaging environment, the background of the SAR image is very complicated. Ship targets are easily affected by islands, reefs and speckle noise, especially in the detection of inshore ships. The traditional DL model is easy to confuse the ship target with the background of the similar shape, resulting in false detections and missed detections. Studies [41][42][43] show that semantic segmentation can assist object detection. The R-Centernet+ detector model proposed in this paper adopts an FEM based on semantic segmentation to predict the foreground area in advance to reduce the interference of complex backgrounds.The implementation of the FEM is shown in Figure 7. We perform continuous convolution on the original feature map extracted by the backbone network to obtain the foreground image F. Element-wise multiplication on the foreground image F and the original feature map FM generates a foreground enhanced feature map FEFM. In order to mark the foreground accurately, the foreground labels based on semantic segmentation are used to train FEM. The generation method of foreground labels is shown in Figure 8.  In order to mark the foreground accurately, the foreground labels based on semantic segmentation are used to train FEM. The generation method of foreground labels is shown in Figure 8. First, project the groundtruth bounding box of an SAR image with a width (W) and a height (H) to the corresponding position under the output stride of R. Then create a label of size W R × H R × 1. If the pixel is within the projected bounding box, set the value to 1, otherwise set the value to 0.
In order to mark the foreground accurately, the foreground labels based on semantic segmentation are used to train FEM. The generation method of foreground labels is shown in Figure 8. First, project the groundtruth bounding box of an SAR image with a width (W) and a height (H) to the corresponding position under the output stride of R. Then create a label of size × × 1. If the pixel is within the projected bounding box, set the value to 1, otherwise set the value to 0.

Improved Detection Head
Ship targets in SAR images are characterized by a large aspect ratio, arbitrary direction and dense arrangement. As shown in the left figure of Figure 9, the traditional horizontal box detection is difficult to accurately express the length, width and axial information of ship targets. More importantly, in complex scenes, the horizontal box contains a lot of non-target area information. Compared with the horizontal box, the rotating box can reflect the ship information more accurately, and avoid the interference of the non-target area on the ship feature extraction, which meets the actual requirements of ship target feature extraction in SAR images. In order to achieve the fine detection of ship targets, the detection head structure of the Centernet network is improved, and the output of the angle map is added on the original basis, which realizes the rotation detection of ship targets in SAR images.
x FOR PEER REVIEW 10 of 22 Ship targets in SAR images are characterized by a large aspect ratio, arbitrary direction and dense arrangement. As shown in the left figure of Figure 9, the traditional horizontal box detection is difficult to accurately express the length, width and axial information of ship targets. More importantly, in complex scenes, the horizontal box contains a lot of non-target area information. Compared with the horizontal box, the rotating box can reflect the ship information more accurately, and avoid the interference of the nontarget area on the ship feature extraction, which meets the actual requirements of ship target feature extraction in SAR images. In order to achieve the fine detection of ship targets, the detection head structure of the Centernet network is improved, and the output of the angle map is added on the original basis, which realizes the rotation detection of ship targets in SAR images. An SAR image with a width of and a height of can generate four regression maps: (1) In the branch of position regression, the center point heat map ∈ 0, 1 × × is generated, where is the scaling ratio of output size, and is the number of object types in the ship detection ( = 1). When P = 1, it means the center point is detected. Gauss-

Horizontal bounding box
Rotating bounding box Figure 9. The difference between two kinds of labels. (1) In the branch of position regression, the center point heat map P ∈ [0, 1] W R × H R ×C is generated, where R is the scaling ratio of output size, and C is the number of object types in the ship detection (C = 1). When P xyc = 1, it means the center point is detected. Gaussian p is used to map each center point to the center point heat map, where p =| p R | is the center point corresponding to low resolution, and σ p is the standard deviation related to the target size. In the inference, the first 100 peaks in the heat map whose values are not less than 8 connected neighbors are used as the center points for prediction; (2) In the branch of size regression, the size prediction map For a center point p k , there is a size prediction S k = (w k , h k ), where w k and h k correspond to the width and height of the ship target, respectively; (3) In the branch of center point offset regression, the center point offset prediction is generated. All the ship targets share the same offset prediction value O k = (δ k , δ k ) to recover the discretization error caused by the output step size; and (4) In the branch of angle regression, the angle prediction map ∧ s ∈ R W R × H R ×1 is generated. The angle prediction value A k = θ k corresponds to the rotation angle of a ship target.
In order to realize the regression of the ship rotation angle, we use the rotating rectangular box to mark the ship target. In this paper, the method of the rotating rectangular box is defined as (x, y, w, h, θ), where (x, y) is the center point coordinate of the rotating rectangle, w is the long side of the rectangle, and h is the short side of the rectangle. As shown in Figure 10, θ represents the angle that the positive direction of y-axis rotates clockwise to the long side of the rectangle, and its range is [0, 180).

Loss Function
The loss function can be divided into five parts: ① is the loss of foreground prediction; ② is the loss of center point prediction; ③ is the loss of size prediction; ④ is the loss of center point offset prediction; ⑤ is the loss of angle prediction. and use the modified , as respectively shown in Formulas (4) and (5).
In the formula, is the predicted value of the foreground point, is its groundtruth, is the number of foreground points in the picture, and and are hyperparameters. According to [22] and our experiments before, we set = 2, = 4 in the experiments that result in the best outcomes.

Loss Function
The loss function can be divided into five parts: 1 L f is the loss of foreground prediction; 2 L p is the loss of center point prediction; 3 L s is the loss of size prediction; 4 L o is the loss of center point offset prediction; 5 L a is the loss of angle prediction. L f and L p use the modified f ocal loss, as respectively shown in Formulas (4) and (5).
In the formula, F xyc is the predicted value of the foreground point, G f xyc is its groundtruth, N is the number of foreground points in the picture, and α and β are hyperparameters. According to [22] and our experiments before, we set α = 2, β = 4 in the experiments that result in the best outcomes.
In the formula, P xyc is the predicted value of the center point, G p xyc is its groundtruth, N is the number of center points in the picture, α and β are hyperparameters. We also set α = 2, β = 4. L s , L o and L a use the modified smooth L1 loss, as shown in Formulas (6)-(8), respectively.
In the formula, S k is the predicted value of the target size at the center point P k , and G sk is its groundtruth.
In the formula, O k is the predicted value of the center point offset at the center point P k , and G ok is its groundtruth.
In the formula, A k is the predicted value of the target rotation angle at the center point P k , and G ak is its groundtruth.
The overall loss function is: In the formula, λ f , λ p , λ s , λ o , λ a are the weight coefficients of five parts of loss value, respectively.

Experiment and Analysis
We conduct experiments on two SAR image ship datasets SSDD and AIR-SARShip. The experiments are implemented in the Pytorch DL environment built on Windows10 system, with NVIDIA 2070 Super, CUDA v10.0, cuDNN v7.4.2.

Evaluation Metric
We use Precision (P), Recall (R), Average Precision (AP), Frame Per Second (FPS) and FLoating-point Operations (FLOPs) to evaluate the effect of ship detection. The calculation formulas of P and R are as follows [44]: In the formula, TP means true positive, FP means false positive, and FN means false negative. Generally speaking, P and R are a pair of values that affect each other, so it is difficult to evaluate the overall performance. Therefore, this paper uses AP to evaluate the detection model more objectively, which is expressed as: Detection speed is an important indicator of whether the model can be applied to actual detection tasks, and it should be paid attention to as much as detection precision. In order to comprehensively evaluate the performance of the proposed model, FPS is introduced to evaluate the speed of model detection.

FPS =
Frames 1s (13) In the formula, Frames represents the number of pictures processed by the detection model per second.
The time complexity is an important evaluation metric of the deep learning model. It determines the training and prediction time of the model. If the complexity is too high, it will lead to a lot of time for model training and prediction, which cannot quickly verify the idea and improve the model, nor can it achieve rapid prediction. We use FLOPs to measure the time complexity of the deep learning model. FLOPs of convolutional layers and fully connected layers are expressed as: In the formula, C i represents input channel, K represents kernel size, H and W represent output feature map size, C o represents output channel, I represents input neuron numbers and O represents output neuron numbers.

Experiments on SSDD
For SSDD, 80% of the entire dataset is randomly selected as the training set, and 20% of the entire dataset as the test set. During the training process of the SSDD dataset, the input SAR image is transformed to 512 × 512 pixels. The Adam optimizer with 0.9 momentum and 1 × 10 −4 weight decay is used for training. The number in the batch size is 8. The initial learning rate is set to 1.25 × 10 −4 , and the training epoch is 120. After 60 epochs of training, the learning rate is further attenuated to 1.25 × 10 −5 , and after 90 epochs, it is further attenuated to 1.25 × 10 −6 . We set λ p = 1.0, λ s = 0.1, and λ o = 1.0 according to the reference of [23] (Zhou X etc., 2019). We performed four experiments on λ a and λ f to determine the optimal weight assignments. The experimental results on different weight assignments are shown in Table 7. Finally, we set λ p = 1.0, λ s = 0.1, λ o = 1.0, λ a = 0.1, and λ f = 1 in all our experiments.  In the training process, in addition to using an oversample, data enhancement methods such as random horizontal flipping and random cropping are also used. The loss changes during the training on the SSDD dataset are shown in Figure 11. As the number of training increases, the loss decreases constantly and the model converges. In the training process, in addition to using an oversample, data enhancement methods such as random horizontal flipping and random cropping are also used. The loss changes during the training on the SSDD dataset are shown in Figure 11. As the number of training increases, the loss decreases constantly and the model converges. The satisfactory detection results on the SSDD dataset are shown in Figure 12. The first row of pictures shows the results of small ship detection, which is a challenging problem for ship detection. Obviously, the R-Centernet+ detector can effectively detect small ships. This shows that CBAM can improve the feature expression ability of the feature extraction network for ship targets, and enhance the focus on small ships. The second row of pictures shows the results of detecting ships in complex backgrounds. Obviously, R-Centernet+ detector can effectively distinguish ships from complex backgrounds. This The satisfactory detection results on the SSDD dataset are shown in Figure 12. The first row of pictures shows the results of small ship detection, which is a challenging problem for ship detection. Obviously, the R-Centernet+ detector can effectively detect small ships. This shows that CBAM can improve the feature expression ability of the feature extraction network for ship targets, and enhance the focus on small ships. The second row of pictures shows the results of detecting ships in complex backgrounds. Obviously, R-Centernet+ detector can effectively distinguish ships from complex backgrounds. This shows that FEM is beneficial to distinguish the foreground and the background, and improve the performance of the detection model in complex scenes.

021, 21, x FOR PEER REVIEW 14 of 22
shows that FEM is beneficial to distinguish the foreground and the background, and improve the performance of the detection model in complex scenes. However, there are still some false detections and missed detections in the SSDD dataset. As shown in Figure 13a, small ships close to each other are prone to missed detections. As shown in Figure 13b, a small number of small offshore islands are easily confused with ship targets. However, there are still some false detections and missed detections in the SSDD dataset. As shown in Figure 13a, small ships close to each other are prone to missed detections. As shown in Figure 13b, a small number of small offshore islands are easily confused with ship targets. However, there are still some false detections and missed detections in the SSDD d taset. As shown in Figure 13a, small ships close to each other are prone to missed dete tions. As shown in Figure 13b, a small number of small offshore islands are easily co fused with ship targets. The P-R curves of R-Centernet+, Centernet and Faster-RCNN on SSDD, SSDD-i shore and SSDD-offshore datasets are shown in Figure 14. It can be seen that both in i shore and offshore scenes, the P-R curves of the proposed method are relatively smoot and the detection performance is better than the other two methods. The detailed expe mental results on the SSDD dataset are shown in Table 8. The AP of the proposed metho on the SSDD dataset reaches 95.11%, and the detection performance is quite great on th whole. The proposed R-Centernet+ detector can detect offshore ships of various scal well, and the AP of the proposed method on SSDD-offshore dataset reaches 97.84%. Mo importantly, this model can better detect inshore ships in complex backgrounds, and th AP value reaches 93.72%, which is superior to the Faster-RCNN and Centernet. The P-R curves of R-Centernet+, Centernet and Faster-RCNN on SSDD, SSDD-inshore and SSDD-offshore datasets are shown in Figure 14. It can be seen that both in inshore and offshore scenes, the P-R curves of the proposed method are relatively smooth, and the detection performance is better than the other two methods. The detailed experimental results on the SSDD dataset are shown in Table 8. The AP of the proposed method on the SSDD dataset reaches 95.11%, and the detection performance is quite great on the whole. The proposed R-Centernet+ detector can detect offshore ships of various scales well, and the AP of the proposed method on SSDD-offshore dataset reaches 97.84%. More importantly, this model can better detect inshore ships in complex backgrounds, and the AP value reaches 93.72%, which is superior to the Faster-RCNN and Centernet.

Experiments on AIR-SARShip
For the AIR-SARShip dataset, 21 of the 31 SAR images are used as the training set, and the remaining 10 are used as the test set. After image cropping, there are 474 subimages in the training set and 245 sub-images in the test set. In the training process of the AIR-SARShip dataset, the input SAR image is transformed to 512 × 512 pixels. We use an Adam optimizer with 0.9 momentum and 1 × 10 −4 weight decay for training. The number of the batch size is 8. The initial learning rate is set to 1 × 10 −4 , and the training epoch is 100. After 60 epochs of training, the learning rate further decays to 1 × 10 −5 , and after 80 epochs of training, it further decays to 1 × 10 −6 .
In the training process, the data enhancement methods of random horizontal flipping and random cropping are used.
The loss changes during the training on the AIR-SARShip dataset are shown in Figure 15. As the number of training increases, the loss decreases constantly and the model converges. The satisfactory detection results on the Air-SARShip dataset are shown in Figure 1 The first row of pictures shows the results of detecting small ships, and the second row pictures shows the results of detecting ships in complex backgrounds. The satisfactory detection results on the Air-SARShip dataset are shown in Figure 16. The first row of pictures shows the results of detecting small ships, and the second row of pictures shows the results of detecting ships in complex backgrounds. The satisfactory detection results on the Air-SARShip dataset are shown in Figure 16.
The first row of pictures shows the results of detecting small ships, and the second row of pictures shows the results of detecting ships in complex backgrounds. The P-R curves of R-Centernet+, Centernet and Faster-RCNN on AIR-SARShip, AIR-SARShip-inshore and AIR-SARShip-offshore datasets are shown in Figure 18. As shown in the figure, the detection performance of the proposed method is superior to the Faster-RCNN and Centernet. The detailed experimental results of the AIR-SARShip dataset are shown in Table 9. The AP of the proposed method on the AIR-SARShip dataset reaches 84.89%, which can be applied in actual detection task. The AP of the proposed method on the AIR-SARShip-offshore dataset reaches 87.43%, and on the AIR-SARShip-inshore dataset reaches 68.45%. On the AIR-SARShip dataset, the detection performance is lower than that on the SSDD dataset due to the great differences between the training set and the test set of the AIR-SARShip dataset. It is a challenging research topic to increase the migration ability of models in large detection scenes. The P-R curves of R-Centernet+, Centernet and Faster-RCNN on AIR-SARShip, AIR-SARShip-inshore and AIR-SARShip-offshore datasets are shown in Figure 18. As shown in the figure, the detection performance of the proposed method is superior to the Faster-RCNN and Centernet. The detailed experimental results of the AIR-SARShip dataset are shown in Table 9. The AP of the proposed method on the AIR-SARShip dataset reaches 84.89%, which can be applied in actual detection task. The AP of the proposed method on the AIR-SARShip-offshore dataset reaches 87.43%, and on the AIR-SARShip-inshore dataset reaches 68.45%. On the AIR-SARShip dataset, the detection performance is lower than that on the SSDD dataset due to the great differences between the training set and the test set of the AIR-SARShip dataset. It is a challenging research topic to increase the migration ability of models in large detection scenes. 84.89%, which can be applied in actual detection task. The AP of the proposed method the AIR-SARShip-offshore dataset reaches 87.43%, and on the AIR-SARShip-inshore d taset reaches 68.45%. On the AIR-SARShip dataset, the detection performance is low than that on the SSDD dataset due to the great differences between the training set a the test set of the AIR-SARShip dataset. It is a challenging research topic to increase t migration ability of models in large detection scenes.

Comparisons with the State-of-the-Arts
We compare the R-Centernet+ detector with the state-of-the-art detection models on SSDD and AIR-SARShip datasets under the same conditions. The results of the comparisons are shown in Table 10. On the whole, the proposed R-Centernet+ detector is a stable and efficient SAR image ship detection model with high accuracy and fast speed. First, we compare the proposed R-Centernet+ detector with the classic anchor-based detection model Faster-RCNN [14]. On the SSDD dataset, the AP of the proposed model is 95.11%, which is 6.51% higher than that of Faster-RCNN. On the AIR-SARShip dataset, the AP of the proposed model is 84.89%, which is 5.71% higher than that of Faster-RCNN. The main reason is that the center point can well represent the features of the ships, which has a great advantage over using the anchor box. And because there is no anchor and no Non-Maximum Suppression (NMS), the speed of the proposed method has been greatly improved, from 14 to 33 FPS. It indicates that the proposed R-Centernet+ detector shows great application value in real-time SAR ship detection.
The proposed R-Centernet+ detector is compared with the classic anchor-free detection model Centernet [20]. On the SSDD dataset, the AP of the proposed model is 1.29% higher than that of Centernet. On the AIR-SARShip dataset, the AP of the proposed model is 1.18% higher than that of Centernet. The main reason is that R-Centernet+ detector model adds the CBAM and FEM, which improves the detection performance of small ships and ships in complex scenes. In addition, the R-Centernet+ detector realizes rotation detection, reduces the interference of non-target areas on ship features extraction, and also increases the robustness of the detection model.
We also compare the number of the parameters, FLOPs and the model size of these methods. The results of the comparisons are shown in Table 11. It can be found from Table 11 that the addition of CBAM and FEM have little effect on the model parameters, FLOPs and model size. The number of parameters of the proposed method is low, which indicates it is a lightweight network. The model size of the proposed method is only 77.92 MB, and such a lightweight model is convenient for the future FPGA or DSP porting.

Discussion
CBAM and FEM are added to R-Centernet+ detector. In order to verify the effectiveness of each module, ablation experiments are performed on the SSDD dataset. The results of the ablation experiment are shown in Table 12.

Effectiveness of CBAM
In order to verify the effectiveness of the CBAM, a contrast experiment is performed on the SSDD dataset. The detection results of the selected typical scenes in which small ships assemble are shown in Figure 19. It can be seen from Figure 19 that the introduction of the attention mechanism can make it better to extract the features of small ships, thereby increasing the focus on small ships and reducing the missed detection rate of ships. Adding CBAM makes the performance of the R-Centernet+ detector better, and the AP increases by 0.80%. This illustrates the advantages of the CBAM method for ship detection in SAR images, especially for the small ships that are not easy to detect. of the attention mechanism can make it better to extract the features of small increasing the focus on small ships and reducing the missed detection rate ing CBAM makes the performance of the R-Centernet+ detector better, a creases by 0.80%. This illustrates the advantages of the CBAM method for in SAR images, especially for the small ships that are not easy to detect.

Effectiveness of FEM
In order to verify the effectiveness of FEM, a contrast experiment is performed on the SSDD dataset. The detection results of the selected typical complex scenes are shown in Figure 20. It can be seen from Figure 20 that FEM shields islands and reefs that are very similar to ships, and reduces the false detection rate of ships. FEM improves AP by 1.09%. The detection results show that FEM enhances the foreground features and reduces the interference of complex backgrounds on ships.
In order to verify the effectiveness of FEM, a contrast experiment is performed on the SSDD dataset. The detection results of the selected typical complex scenes are shown in Figure 20. It can be seen from Figure 20 that FEM shields islands and reefs that are very similar to ships, and reduces the false detection rate of ships. FEM improves AP by 1.09%. The detection results show that FEM enhances the foreground features and reduces the interference of complex backgrounds on ships.

Conclusions
The paper proposes a lightweight anchor-free detector for ship detection called R-Centernet+, which can balance precision and speed well. CBAM-DLA34 is used to extract ship features with attention. FEM is adopted to introduce foreground features in advance to obtain a feature map with enhanced foreground. The detection head is designed to conduct regression for the center point coordinates, size, offset and rotation angle of the ships, and realize the rotation detection of ship targets. We conduct the ship detection experiment on SSDD and AIR-SARShip datasets, the AP is 95.11% and 84.89%, respectively, and we achieve a high detection speed with a value of 33 FPS. Through comparison with Faster-RCNN and Centernet, the performance of the proposed model is better than these

Conclusions
The paper proposes a lightweight anchor-free detector for ship detection called R-Centernet+, which can balance precision and speed well. CBAM-DLA34 is used to extract ship features with attention. FEM is adopted to introduce foreground features in advance to obtain a feature map with enhanced foreground. The detection head is designed to conduct regression for the center point coordinates, size, offset and rotation angle of the ships, and realize the rotation detection of ship targets. We conduct the ship detection experiment on SSDD and AIR-SARShip datasets, the AP is 95.11% and 84.89%, respectively, and we achieve a high detection speed with a value of 33 FPS. Through comparison with Faster-RCNN and Centernet, the performance of the proposed model is better than these state-of-the-art detection models, which proves the robustness and practicability of the model.
Inshore ship datasets and offshore ship datasets are established based on SSDD and AIR-SARShip datasets, respectively, to evaluate the effectiveness of reducing background interference of the proposed model. For the inshore ships, the AP of the proposed model on SSDD-inshore and AIR-SARShip-inshore reach 93.72% and 68.45%, respectively. For the offshore ships, the AP of the proposed model on SSDD-offshore and AIR-SARShip-offshore reach 97.84% and 87.43%, respectively. The detection performance of ships of various sizes is satisfactory. The results show that the model can detect ships of multi-scale in inshore and offshore scenes.
In future work, we will continue to improve FEM. We will use the instance segmentation method to mark the foreground area of each ship target, respectively, to achieve a more refined foreground enhancement. On the basis of distinguishing ship targets and complex backgrounds, we will further distinguish different ship targets to better detect densely arranged ships.