A Ship Detection Method via Redesigned FCOS in Large-Scale SAR Images

: Ship detection in large-scale synthetic aperture radar (SAR) images has achieved break-throughs as a result of the improvement of SAR imaging technology. However, there still exist some issues due to the scattering interference, sparsity of ships, and dim and small ships. To address these issues, an anchor-free method is proposed for dim and small ship detection in large-scale SAR images. First, fully convolutional one-stage object detection (FCOS) as the baseline is applied to detecting ships pixel by pixel, which can eliminate the effect of anchors and avoid the missing detection of small ships. Then, considering the particularity of SAR ships, the sample deﬁnition is redesigned based on the statistical characteristics of ships. Next, the feature extraction is redesigned to improve the feature representation for dim and small ships. Finally, the classiﬁcation and regression are redesigned by introducing an improved focal loss and regression reﬁnement with complete intersection over union (CIoU) loss. Experimental simulation results show that the proposed R-FCOS method can detect dim and small ships in large-scale SAR images with higher accuracy compared with other methods.


Introduction
Due to the continuous improvements in the quantity and quality of synthetic aperture radar (SAR) images [1][2][3], ship detection has been widely applied for maritime management and surveillance. Ship detection has become a subject task with important theoretical research and practical applications, so that it has attracted more and more scholars' attention [4][5][6][7][8]. For example, Li et al. [9] introduced the superpixels into constant false alarm rate (CFAR) ship detection method. Salembier et al. [10] applied graph signal processing based on Maxtree representation for ship detection. Lin et al. [11] proposed a ship detection method via superpixels and fisher vectors. Wang et al. [12] proposed the local contrast of fisher vectors (LCFVs) for detecting ships. However, these traditional methods still have limitations. On the one hand, it is difficult to design appropriate manual features for ships, resulting in weak generalization ability. On the other hand, the time cost is high due to the complex detection process.
Benefiting from the excellent performance of convolutional neural networks (CNN) in object classification, research on CNN-based object detection is booming. Object detection methods proposed in the past ten years can be roughly divided into two categories. The first are the two-stage detectors, such as faster region-based CNN (Faster R-CNN) [13]. The second are the one-stage detectors, such as you only look once (YOLO) [14], single shot multi-box detector (SSD) [15], and RetinaNet [16]. The two-stage detectors can achieve relatively good detection performance but have high time costs. However, the one-stage detectors have lower computational costs but have a weaker detection performance.
As a result of the emergence of large-scale SAR datasets [17][18][19][20][21], CNN-based detectors have been introduced for ship detection [22][23][24][25][26]. For example, Deng et al. [27] proposed a method to detect small and densely clustered ships, and it is trained from scratch. Chen et al. [28] proposed a ship detection method via an adaptive recalibration mechanism. Jin et al. [29] proposed P2P-CNN via using the ship and its surroundings to deal with the small ship problem. Gao et al. [30] proposed SAR-Net for achieving a balance between speed and accuracy. These methods are anchor-based detectors, whose detection accuracies are related to the quality of the pre-defined anchors. Although the anchor-based methods work well in detecting ships, they still have certain shortcomings. Firstly, the anchor box will introduce additional hyperparameters, which are set based on prior knowledge. When the detection object changes, the hyperparameters need to be reset. Setting inappropriate anchor boxes will result in poor detection performance, thereby reducing the generalization ability of the network. Secondly, considering the sparsity of ships, most anchor boxes tend to contain empty backgrounds, and only a few anchor boxes contain ships. This will result in far more negative samples than positive samples, resulting in an unbalanced design. Finally, a large number of anchor frames are redundant because of the sparsity of ships, thereby bringing additional computational costs.
Anchor-free methods directly predict class and location instead of using the anchor box. Anchor-free methods consist of key-point-based methods [31][32][33] and center-based methods [34][35][36][37][38]. The key-point-based methods, such as CenterNet [33] and representative points (RepPoints) [31], first detect the key points and then combine the key points for object detection. Center-based methods, such as fully convolutional one-stage object detection (FCOS) [35], FoveaBox [36], and feature selective anchor-free (FSAF) [37], directly detect objects by the center point and bounding box. Although the anchor-free methods [39][40][41] for ship detection are still under development, they have shown a better performance potential and a trade-off between speed and accuracy. However, the performance of the anchor-free method still needs to be improved due to the following issues. First, there is a lot of scattering interference from islands or the sea in SAR images. Secondly, the sparsity of ships cannot be ignored. Finally, some ships are small and dim, that is, the proportion of ships is small, and the object scattering is weak.
Therefore, a novel method called redesigned FCOS (R-FCOS) is proposed for dim and small ship detection. Specifically, we redesign the anchor-free detector FCOS to address the issues described above, and the architecture of R-FCOS is shown in Figure 1

Materials and Methods
In this section, the FOCS network, the baseline of the proposed method, is introduced. Then, the R-FCOS method, including redesigned sample definition, feature extraction, classification, and regression are described in detail. Finally, the loss function is given.

1.
It is shown that R-FCOS can eliminate the effect of anchors and avoid missing detection of small ships.

2.
Considering the particularity of SAR ships, the sample definition was redesigned based on the statistical characteristics of these ships. 3.
The feature extraction was redesigned to improve the feature representation for dim and small ships. 4.
The classification and regression stages were redesigned by introducing an improved focal loss and bounding box refinement with complete intersection over union (CIoU) loss.
The rest of the paper is organized as follows. Section 2 introduces the proposed method based on FCOS. In Section 3, the details of experiments and the analysis of the results are exhibited. Section 4 describes the discussion. Finally, the conclusion is presented in Section 5.

Materials and Methods
In this section, the FOCS network, the baseline of the proposed method, is introduced. Then, the R-FCOS method, including redesigned sample definition, feature extraction, classification, and regression are described in detail. Finally, the loss function is given.

Baseline
The core idea of FCOS is to detect objects pixel by pixel, that is, to directly predict the distance (l, t, r, b) from the center point to the four sides of the bounding box, as shown in Figure 2. Figure 3 shows the architecture of the FCOS network, which includes a backbone network, a feature pyramid network (FPN), a classification branch, a regression branch, and a center-ness branch. The backbone network is used to extract feature maps of the input images. In FPN, the different feature pyramid levels are used to detect multi-scale objects. The classification branch and regression branch are to achieve object classification and location, respectively. The center-ness branch is to evaluate the "center-ness" of a location. The "center-ness" represents the degree of coincidence of the position with the center of the ground truth bounding box, which is defined as:

Sample Definition Redesign
SAR images and natural images have obvious differences in imaging conditions and object categories. Therefore, the previous sample definition methods based on the intersection over union (IoU) threshold, or the scale range will not be suitable for SAR ship detection. Therefore, we introduced a new sample definition method, which defines the sample threshold according to the statistical characteristics of SAR ships. The specific process is as follows: Step 1: For a ground-truth box G g  on each pyramid level, the L2 distances between the predicted boxes D D i  on the th i pyramid level and the ground-truth g are computed.
Step 2: The L2 distances are sorted from small to large, and the first k corresponding predicted boxes A i are selected. Step 3: For the different pyramid level i , the total candidate positive samples are computed as follows: Step 4: The IoU between A g and g is computed as follows: Step 5: The mean and standard deviation are computed as follows: Step 6: The IoU threshold for g is as follows: The total training loss as follows: where p is the classification score, t = (l, t, r, b) is the regression prediction, and N pos is the number of positive samples. L cls , L reg , and L CE denote focal loss [16], generalized intersection over union (GIoU) loss [42], and binary cross entropy loss, respectively, and c * x,y ≥ 1 denotes the indicator function as follows:

Sample Definition Redesign
SAR images and natural images have obvious differences in imaging conditions and object categories. Therefore, the previous sample definition methods based on the intersection over union (IoU) threshold, or the scale range will not be suitable for SAR ship detection. Therefore, we introduced a new sample definition method, which defines the sample threshold according to the statistical characteristics of SAR ships. The specific process is as follows: Step 1: For a ground-truth box g ∈ G on each pyramid level, the L2 distances between the predicted boxes D i ⊆ D on the ith pyramid level and the ground-truth g are computed.
Step 2: The L2 distances are sorted from small to large, and the first k corresponding predicted boxes A i are selected.
Step 3: For the different pyramid level i, the total candidate positive samples are computed as follows: Step 4: The IoU between A g and g is computed as follows: Step 5: The mean and standard deviation are computed as follows: Step 6: The IoU threshold for g is as follows: Step 7: For each candidate a ∈ A g , if IoU(a, g) ≥ T g and the center of a in g, then Step 8: Finally, all positive samples are as follows: Step 9: The rest are negative samples as follows:

Feature Extraction Redesign
As the feature level increases, the resolution of the feature map decreases while the semantic information contained increases. For dim and small ship detection, highresolution feature representation is essential. This is because the low-resolution feature maps contain very little object information due to the small ships with less pixels and weak scattering. Previous methods usually restore the obtained low-resolution features to high-resolution, such as Hourglass [43], SegNet [44], and the deconvolution network (DeconvNet) [45]. Different from the previous methods, we no longer add additional operations of feature resolution recovery. In other words, the high-resolution feature representation persists throughout the entire computing stage. In addition, the multiresolution features are repeatedly fused to obtain rich semantic information and highresolution feature representation. In summary, we redesign feature extraction via the same-resolution feature convolution (SFC) module, multi-resolution feature fusion (MFF) module, and feature pyramid (FP) module, as shown in Figure 1.
At first, the input image is down-sampled via two 3 × 3 stride-2 convolutions.
where F 0 denotes the feature map, and Conv 3×3,s=2 denotes the operation of 3 × 3 stride-2 convolution. Two types of SFC modules are used to extract feature maps, of which two residual blocks, i.e., Basicblock and Bottleneck, are the main components, as shown in Figure 4. The computation processes of Basicblock and Bottleneck are summarized as follows: where Conv 1×1 and Conv 3×3 denote the operations of 1 × 1 stride-1 convolution and 3 × 3 stride-1 convolution, respectively; f (·) and h(·) denote the operations of Bottleneck and Basicblock, respectively. The MFF module achieves information fusion between features through convolution and up-sample. Figure 5 shows the fusion process of three-resolution features. Given    The MFF module achieves information fusion between features through convolution and up-sample. Figure 5 shows the fusion process of three-resolution features. Given threeresolution features F 31 , F 32 , F 33 , the output features {F 41 , F 42 , F 43 , F 44 } are as follows: where Upsample s=2 and Upsample s=4 denote the operations of bilinear up-sample with stride-2 and stride-4, respectively.  Figure 4. The architectures of two SFC modules. X denotes the channel dimension. Y denotes the number of Basicblock.
The MFF module achieves information fusion between features through convolution and up-sample. Figure 5 shows the fusion process of three-resolution features. Given    F out = Concat F 41 + Upsample s=2 F 42 + Upsample s=4 F 43 +Upsample s=8 F 44 (19) where Upsample s=8 denotes the operation of bilinear up-sample with stride-8, and Concat denotes the concatenation operation.
The dimension of P in is set to 256 by a 1 × 1 convolution.
The FP module is constructed through a set of average pooling operations.
where AvgPool s=2 ϕ denotes the operation of average pooling with stride-2 ϕ ; ϕ is the pyramid level. Finally, a 3 × 3 convolution is appended on each level to obtain the final feature map called {P 0 , P 1 , P 2 , P 3 , P 4 }.

Classification and Regression Redesign
To utilize the geometric features and contextual information of dim and small ships, we implemented a new feature representation for the bounding box via the features of nine fixed points, as shown in Figure 6. The deformable convolution (Dconv) is introduced to achieve the above feature representation. Specifically, these offsets between the other eight points and (x, y) are used as the offsets of the Dconv.
where 8 Upsample s = denotes the operation of bilinear up-sample with stride-8, and Concat denotes the concatenation operation.
The dimension of Pin is set to 256 by a 1 × 1 convolution.
( ) The FP module is constructed through a set of average pooling operations.

Classification and Regression Redesign
To utilize the geometric features and contextual information of dim and small ships, we implemented a new feature representation for the bounding box via the features of nine fixed points, as shown in Figure 6. The deformable convolution (Dconv) is introduced to achieve the above feature representation. Specifically, these offsets between the other eight points and ( , ) x y are used as the offsets of the Dconv. The purpose of object detection is to identify the categories of all objects and give their localizations. Object detectors generally use bounding boxes with category labels and classification scores to represent the detection results. Some bounding boxes with accurate localization may be eliminated due to low classification scores. This indicates that the classification score is not suitable for estimating detection accuracy. Therefore, we use an IoU score (IS) to simultaneously express classification and localization accuracy. Due to the sparsity of ships, most of the regions in SAR images are negative samples, and the sample imbalance is unavoidable. To solve this issue, an improved focal loss function is proposed to predict IS. Specifically, an asymmetric training sample weighting method is applied to focal loss. Focal loss is defined as: The purpose of object detection is to identify the categories of all objects and give their localizations. Object detectors generally use bounding boxes with category labels and classification scores to represent the detection results. Some bounding boxes with accurate localization may be eliminated due to low classification scores. This indicates that the classification score is not suitable for estimating detection accuracy. Therefore, we use an IoU score (IS) to simultaneously express classification and localization accuracy. Due to the sparsity of ships, most of the regions in SAR images are negative samples, and the sample imbalance is unavoidable. To solve this issue, an improved focal loss function is proposed to predict IS. Specifically, an asymmetric training sample weighting method is applied to focal loss. Focal loss is defined as: where z ∈ {−1, +1} is class label, p ∈ {0, 1} is predicted probability, and α, γ are the weighting factor and tunable focusing parameter, respectively. Next, we used an asymmetric sample weighting to improve focal loss function.
where IS and q denote the predicted and target IS, respectively. There is a lot of scattering interference from islands and the sea in SAR images. Therefore, we introduced a regression refinement branch to improve localization accuracy by a refinement factor (∆l, ∆t, ∆r, ∆b). The final output of the regression branch is (∆l · l , ∆t · t , ∆r · r , ∆b · b ). To improve the speed and accuracy of loss convergence, a CIoU loss is introduced into the regression branch: where d and g denote the center point of predicted and ground truth box, respectively, ρ is the Euclidean distance, ∆ denotes the diagonal distance of the smallest rectangle containing the predicted and ground-truth box, and β is an impact factor.

Loss Function
The entire network is trained with a multi-task loss function as follows: where d j denotes the initial box at location j on feature map, and d and g are the refined and ground-truth box, respectively.

Results
All experiments in this paper are performed on a computer equipped with RTX 2080Ti GPU and Inter i9-9820X CPU. The operating system used is Ubuntu 16.04, and the basic framework is pytorch. The average precision (AP) and frames per second (FPS) are used as the evaluation metrics.

Dataset
The Large-Scale SAR Ship Detection Dataset-v1.0 (LS-SSDD-v1.0) [20] was constructed from Sentinel-1 satellites and included 15 images with 24,000 × 16,000 pixels, a sample image is shown in Figure 7. To facilitate network training and testing, the original image is split into 9000 sub-images with 800 × 800 pixels for training (6000) and testing (3000). According to the scale division for bounding box in Microsoft Common Objects in COntext (MS COCO), the area below 1024 pixels is a small ship, the area from 1024 pixels to 9216 pixels is a medium ship, and the area above 9216 pixels is a large ship. The distributions of ship shape and number are shown in Figure 8 and

Ablation Study
Since the proposed method is similar in structure to FCOS, FCOS is used as the baseline. Figure 10 shows the detection results of FCOS, FCOS with sample definition redesign (FCOS+SDR), FCOS with sample definition redesign and feature extraction redesign (FCOS+SDR+FER), and FCOS with sample definition redesign, feature extraction redesign, and classification and regression redesign (FCOS+SDR+FER+CRR).

Ablation Study
Since the proposed method is similar in structure to FCOS, FCOS is used as the baseline. Figure 10 shows the detection results of FCOS, FCOS with sample definition redesign (FCOS+SDR), FCOS with sample definition redesign and feature extraction redesign (FCOS+SDR+FER), and FCOS with sample definition redesign, feature extraction redesign, and classification and regression redesign (FCOS+SDR+FER+CRR).

Ablation Study
Since the proposed method is similar in structure to FCOS, FCOS is used as the baseline. Figure 10 shows the detection results of FCOS, FCOS with sample definition redesign (FCOS+SDR), FCOS with sample definition redesign and feature extraction redesign (FCOS+SDR+FER), and FCOS with sample definition redesign, feature extraction redesign, and classification and regression redesign (FCOS+SDR+FER+CRR).

Analysis of Sample Definition Redesign
As shown in Figure 10, the AP of FCOS+SDR is 1.3% higher than that of FCOS. This suggests that sample definition redesign is effective for improving detection accuracy by redefining the sample threshold according to the statistical characteristics of SAR ships.

Analysis of Feature Extraction Redesign
As shown in Figure 10, the APs of FCOS+SDR+FER are 2.3% and 1.0% higher than those of FCOS and FCOS+SDR, respectively, suggesting that feature extraction redesign is effective for improving detection accuracy. The main reason is that feature extraction redesign can obtain rich semantic information and high-resolution feature representation for dim and small ships.

Analysis of Classification and Regression Redesign
As shown in Figure 10, the APs of R-FCOS are 3.2% and 0.9% higher than those of FCOS and FCOS+SDR+FER, respectively. The main reason is that the classification and regression redesign is used to deal with these issues, i.e., complex surroundings, sparsity of ships, and dim and small ships. Specifically, the nine-point feature representation is used to extract geometric features and contextual information for dim and small ships. The improved focal loss function is used to predict IS. Finally, the regression refinement branch with CIoU loss is used to improve localization accuracy.

Analysis of Sample Definition Redesign
As shown in Figure 10, the AP of FCOS+SDR is 1.3% higher than that of FCOS. This suggests that sample definition redesign is effective for improving detection accuracy by redefining the sample threshold according to the statistical characteristics of SAR ships.

Analysis of Feature Extraction Redesign
As shown in Figure 10, the APs of FCOS+SDR+FER are 2.3% and 1.0% higher than those of FCOS and FCOS+SDR, respectively, suggesting that feature extraction redesign is effective for improving detection accuracy. The main reason is that feature extraction redesign can obtain rich semantic information and high-resolution feature representation for dim and small ships.

Analysis of Classification and Regression Redesign
As shown in Figure 10, the APs of R-FCOS are 3.2% and 0.9% higher than those of FCOS and FCOS+SDR+FER, respectively. The main reason is that the classification and regression redesign is used to deal with these issues, i.e., complex surroundings, sparsity of ships, and dim and small ships. Specifically, the nine-point feature representation is used to extract geometric features and contextual information for dim and small ships. The improved focal loss function is used to predict IS. Finally, the regression refinement branch with CIoU loss is used to improve localization accuracy.

2.
The AP of SSD is the worst and is 17.7% lower than our method. Although SSD uses high-resolution features to detect small objects, it contains less semantic information, resulting in unsatisfactory detection results. In addition, SSD reduces the input image size to 300×300, which destroys the object information in the image.

3.
The APs of anchor-free methods such as RepPoints, FSAF, FoveaBox, and R-FCOS are generally better than those of anchor-based methods except for RetinaNet. This shows that the anchor-free method is more suitable for ship detection. 4.
The FPS of SSD is the highest, and that of Faster RCNN is the lowest. Although the FPS of our method is only 52.0, it already meets the real-time requirements.
Remote Sens. 2022, 14, 1153 11 of 18 2. The AP of SSD is the worst and is 17.7% lower than our method. Although SSD uses high-resolution features to detect small objects, it contains less semantic information, resulting in unsatisfactory detection results. In addition, SSD reduces the input image size to 300×300, which destroys the object information in the image.
3. The APs of anchor-free methods such as RepPoints, FSAF, FoveaBox, and R-FCOS are generally better than those of anchor-based methods except for RetinaNet. This shows that the anchor-free method is more suitable for ship detection.
4. The FPS of SSD is the highest, and that of Faster RCNN is the lowest. Although the FPS of our method is only 52.0, it already meets the real-time requirements. To visually demonstrate the detection performance of R-FCOS, Figure 12 shows the comparative results for different methods on LS-SSDD-v1.0. In the first column of Figure  11, there exist some missing ships that occur in all methods. FSAF and R-FCOS have the least number of missed ships, but our method has higher classification scores. In the second column of Figure 11, most of the ships were missed by SSD, YOLOv3, and FoveaBox, and several false alarms were obtained by Faster RCNN. In addition, false alarms and missing ships existed in RetinaNet, RepPoints, and FSAF. However, our method had only one missing ship. In the third column of Figure 11, false alarms or missing ships exist in all methods, except for YOLOv3 and R-FCOS. However, the classification scores of R-FCOS were higher than those of YOLOv3. Overall, R-FCOS is better than the other methods. To visually demonstrate the detection performance of R-FCOS, Figure 12 shows the comparative results for different methods on LS-SSDD-v1.0. In the first column of Figure 11, there exist some missing ships that occur in all methods. FSAF and R-FCOS have the least number of missed ships, but our method has higher classification scores. In the second column of Figure 11, most of the ships were missed by SSD, YOLOv3, and FoveaBox, and several false alarms were obtained by Faster RCNN. In addition, false alarms and missing ships existed in RetinaNet, RepPoints, and FSAF. However, our method had only one missing ship. In the third column of Figure 11, false alarms or missing ships exist in all methods, except for YOLOv3 and R-FCOS. However, the classification scores of R-FCOS were higher than those of YOLOv3. Overall, R-FCOS is better than the other methods.

Conclusions
In this paper, an anchor-free detector R-FCOS was proposed for dim and small ship detection. We have redesigned FCOS with the aim to address the issues of complex surroundings, sparsity of ships, and dim and small ships. The sample definition redesign deals with the particularity of SAR ships. The feature extraction redesign improved the feature representation for dim and small ships. The classification and regression redesign introduced an improved focal loss and regression refinement with CIoU loss. In the experimental part, we verified the effectiveness of the sample definition redesign, feature extraction redesign, and classification and regression redesign. Experimental results on LS-SSDD-v1.0 showed that, the proposed method achieves competitive detection performance in comparison with Faster RCNN, SSD, RetinaNet, YOLOv3, RepPoints, FSAF, and FoveaBox. In addition, we verified the model migration ability of the proposed method on SSDD. However, it should be noted that, although the proposed method has better detection performance, it cannot completely eliminate all false alarms and missed detection results, which requires further analysis and research.

Conclusions
In this paper, an anchor-free detector R-FCOS was proposed for dim and small ship detection. We have redesigned FCOS with the aim to address the issues of complex surroundings, sparsity of ships, and dim and small ships. The sample definition redesign deals with the particularity of SAR ships. The feature extraction redesign improved the feature representation for dim and small ships. The classification and regression redesign introduced an improved focal loss and regression refinement with CIoU loss. In the experimental part, we verified the effectiveness of the sample definition redesign, feature extraction redesign, and classification and regression redesign. Experimental results on LS-SSDD-v1.0 showed that, the proposed method achieves competitive detection performance in comparison with Faster RCNN, SSD, RetinaNet, YOLOv3, RepPoints, FSAF, and FoveaBox. In addition, we verified the model migration ability of the proposed method on SSDD. However, it should be noted that, although the proposed method has better detection performance, it cannot completely eliminate all false alarms and missed detection results, which requires further analysis and research.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: