A Novel Anchor-Free Method Based on FCOS + ATSS for Ship Detection in SAR Images

: Ship detection in synthetic aperture radar (SAR) images has been widely applied in maritime management and surveillance. However, some issues still exist in SAR ship detection due to the complex surroundings, scattering interferences, and diversity of the scales. To address these issues, an improved anchor-free method based on FCOS + ATSS is proposed for ship detection in SAR images. First, FCOS + ATSS is applied as the baseline to detect ships pixel by pixel, which can eliminate the effect of anchors and avoid missing detections. Then, an improved residual module (IRM) and a deformable convolution (Dconv) are embedded into the feature extraction network (FEN) to improve accuracy. Next, a joint representation of the classiﬁcation score and localization quality is used to address the inconsistent classiﬁcation and localization of the FCOS + ATSS network. Finally, the detection head is redesigned to improve positioning performance. Experimental simulation results show that the proposed method achieves 68.5% average precision (AP), which outperforms other methods, such as single shot multibox detector (SSD), faster region CNN (Faster R-CNN), RetinaNet, representative points (RepPoints), and FoveaBox. In addition, the proposed method achieves 60.8 frames per second (FPS), which meets the real-time requirement.


Introduction
Synthetic aperture radar (SAR) images have played an important role in the military and civilian fields as a result of the development of sensors. At present, SAR ship detection has been widely applied in maritime management and surveillance and has attracted more and more scholars' attention [1][2][3][4][5][6]. For example, Pappas et al. [7] introduced superpixels (SPs) to improve the constant false alarm rate (CFAR) detector. Wang et al. [8] proposed a ship detection method based on hierarchical saliency filtering. He et al. [9] applied SPlevel local information measurement on polarimetric SAR ship detection. However, these traditional ship detection methods are not ideal. This is because of the weak generalization ability and high computational cost.
In the past ten years, machine learning [10,11] and deep learning have continued to evolve. Object detectors via convolutional neural network (CNN) are booming, which can be roughly divided into two categories. The first is the two-stage detectors, such as faster region CNN (Faster R-CNN) [12]. The other is the one-stage detectors, such as you only look once (YOLO) [13], single shot multibox detector (SSD) [14] and RetinaNet [15]. Two-stage detectors can provide relatively good detection performance, but accrue high time costs. In contrast, one-stage detectors are characterized by lower computational costs and higher real-time application values.
An improved anchor-free detector based on the FCOS + ATSS network is proposed for ship detection in SAR images, which can eliminate the effect of anchors and improve detection performance.

2.
To improve accuracy, an improved residual module (IRM) and a deformable convolution (Dconv) are embedded into the feature extraction network (FEN).

3.
Considering the inconsistency of classification and localization of the FCOS + ATSS network, we propose a joint representation of the classification score and localization quality. 4.
Considering the blurred borders caused by scattering interferences, we redesign the detection to improve positioning performance.

Materials and Methods
In this section, the FOCS + ATSS network is introduced as the baseline of the proposed method first. Second, the overall scheme of our method is presented. Then, the FEN redesign and detection head redesign are described in detail. Finally, the loss function is given.

FCOS + ATSS
The FCOS network includes an FEN, a feature pyramid network (FPN), and a detection head, as shown in Figure 1. The FEN is responsible for computing a convolutional feature map over an entire input image. Specifically, for ResNets we use the feature activations output by each stage's last residual block. We denote the output of these last residual blocks as {C 3 , C 4 , C 5 } for the conv3, conv4, and conv5 outputs, and note that they have strides of {8, 16, 32} pixels with respect to the input image. The FPN construct features pyramid levels P 3 to P 7 , where P 3 to P 5 are computed from the output of the corresponding C 3 to C 5 using top-down and lateral connections and P 6 and P 7 are obtained via a 3 × 3 convolutional layer, with the stride being 2 on P 5 and P 6 , respectively. Moreover, the feature levels P 3 , P 4 , P 5 , P 6 , and P 7 have strides of 8, 16, 32, 64, and 128, respectively. The different pyramid levels of FPN are used to detect different sized objects. The detection head consists of a classification branch, a regression branch, and a center-ness branch. The classification branch and regression branch are to achieve object classification and localization, respectively. The classification branch and regression branch predict a vector p for classification and a real vector t = (l, t, r, b), encoding bounding-box coordinates, respectively. Here, l, t, r, b are the distances from the location to the four sides of the bounding box. The center-ness branch is to predict the "center-ness" of a location. The "center-ness" represents the normalized distance from the location to center of the object that the location is responsible for. Given the regression targets (l * , t * , r * , b * ) for a location, the center-ness target is defined as: where sqrt is used to slow down the decay of the center-ness. The center-ness ranges from 0 to 1. When testing, the final score is the square root of the product of the predicted center-ness and the corresponding classification score. Consequently, center-ness can down-weight the scores of bounding boxes far from the center of an object.  -down and lateral connections and P6 and P7 are obtained via  a 3 × 3 convolutional layer, with the stride being 2 on P5 and P6, respectively. Moreover,  the feature levels P3, P4, P5, P6, and P7 have strides of 8, 16, 32, 64, and 128, respectively. The different pyramid levels of FPN are used to detect different sized objects. The detection head consists of a classification branch, a regression branch, and a center-ness branch. The classification branch and regression branch are to achieve object classification and localization, respectively. The classification branch and regression branch predict a vector p for classification and a real vector ( , , , ) l t r b  t , encoding bounding-box coordinates, respectively. Here, , , , l t r b are the distances from the location to the four sides of the bounding box. The center-ness branch is to predict the "center-ness" of a location. The "center-ness" represents the normalized distance from the location to center of the object that the location is responsible for. Given the regression targets ( , , , ) for a location, the center-ness target is defined as: where sqrt is used to slow down the decay of the center-ness. The center-ness ranges from 0 to 1. When testing, the final score is the square root of the product of the predicted centerness and the corresponding classification score. Consequently, center-ness can downweight the scores of bounding boxes far from the center of an object. In this paper, we adopt the ATSS version of FCOS (FCOS + ATSS) as the baseline model for the proposed method. According to the characteristics of an object, ATSS is proposed to define positive and negative samples, which hardly increases the network's hyperparameters. To apply FCOS + ATSS to SAR ship detection, we perform experiments on a high-resolution SAR image dataset (HRSID) [18]. The total training loss function is expressed as follows:  In this paper, we adopt the ATSS version of FCOS (FCOS + ATSS) as the baseline model for the proposed method. According to the characteristics of an object, ATSS is proposed to define positive and negative samples, which hardly increases the network's hyperparameters. To apply FCOS + ATSS to SAR ship detection, we perform experiments on a high-resolution SAR image dataset (HRSID) [18]. The total training loss function is expressed as follows: where (x, y) denotes each location on the feature maps. p x,y denotes the predicted score, and c * x,y denotes the ground-truth score. t x,y denotes the predicted regression, and t * x,y denotes the ground-truth regression. Centerness denotes the predicted center-ness and centerness * denotes the ground-truth center-ness. N pos denotes the number of positive samples. L cls , L reg , and L CE denote focal loss [15], generalized intersection over union (GIoU) loss [42], and binary cross entropy loss, respectively. c * x,y ≥ 1 denotes the Iverson bracket indicator function as follows: where c * x,y is 1 if c * x,y ≥ 1 and 0 otherwise. As can be seen from Figure 2, FCOS + ATSS can accurately detect most of the ships. However, there is a severe problem of missing ships. In addition, the strong scattering objects are mistakenly detected as ship targets. To reduce false alarms and missing detections, the FEN and detection head of the FCOS + ATSS network need to be further improved. score, and , x y c denotes the ground-truth score. , L denote focal loss [15], generalized intersection over union (GIoU) loss [42], and binary cross entropy loss, respectively.
denotes the Iverson bracket indicator function as follows: , , x y c   and 0 otherwise.
As can be seen from Figure 2, FCOS + ATSS can accurately detect most of the ships. However, there is a severe problem of missing ships. In addition, the strong scattering objects are mistakenly detected as ship targets. To reduce false alarms and missing detections, the FEN and detection head of the FCOS + ATSS network need to be further improved.

Overall Scheme of the Proposed Method
In this paper, an improved FCOS + ATSS method is proposed for ship detection. In brief, the IRM and Dconv are embedded into FEN to improve the ability of feature representation. In addition, a joint representation and a general distribution are used to improve the detection head. To clearly illustrate the proposed method, Figure 3 presents its flowchart. First, SAR images are fed into the improved FEN, which is utilized to extract the feature maps (C 1 to C 5 ) of SAR images by a bottom-up pathway. Second, the FPN is established to construct multiscale feature pyramid levels (P 3 to P 7 ) with C = 256 channels. Specifically, P 3 to P 5 are computed from C 3 to C 5 via lateral connections and a top-down Remote Sens. 2022, 14, 2034 5 of 14 pathway. P 6 is computed from C 5 via a 3 × 3 convolution. P 7 is computed from P 6 by applying a ReLU function and a 3 × 3 convolution. Finally, the improved detection heads output detection results that include classification and localization.
In this paper, an improved FCOS + ATSS method is proposed for ship detection. In brief, the IRM and Dconv are embedded into FEN to improve the ability of feature representation. In addition, a joint representation and a general distribution are used to improve the detection head. To clearly illustrate the proposed method, Figure 3 presents its flowchart. First, SAR images are fed into the improved FEN, which is utilized to extract the feature maps (C1 to C5) of SAR images by a bottom-up pathway. Second, the FPN is established to construct multiscale feature pyramid levels (P3 to P7) with C = 256 channels. Specifically, P3 to P5 are computed from C3 to C5 via lateral connections and a top-down pathway. P6 is computed from C5 via a 3 × 3 convolution. P7 is computed from P6 by applying a ReLU function and a 3 × 3 convolution. Finally, the improved detection heads output detection results that include classification and localization.

Feature Extraction Network Redesign
Increasing the depth or width of a network is usually used to improve accuracy. However, as the number of hyperparameters increases, so does the complexity and computational cost of a network. Therefore, how to improve accuracy without increasing parameters is a difficult trade-off problem. The inception model can achieve a high level of accuracy while maintaining low model complexity. The reason is that the inception model follows a split-transform-merge strategy. However, the hyperparameter setting of the inception model is complex, so the model's scalability is moderate. For better accuracy, this paper introduces the strategy of split-transform-merge into the residual module, as shown in Figure 4.

Feature Extraction Network Redesign
Increasing the depth or width of a network is usually used to improve accuracy. However, as the number of hyperparameters increases, so does the complexity and computational cost of a network. Therefore, how to improve accuracy without increasing parameters is a difficult trade-off problem. The inception model can achieve a high level of accuracy while maintaining low model complexity. The reason is that the inception model follows a split-transform-merge strategy. However, the hyperparameter setting of the inception model is complex, so the model's scalability is moderate. For better accuracy, this paper introduces the strategy of split-transform-merge into the residual module, as shown in Figure 4.
prove the detection head. To clearly illustrate the proposed method, Figure 3 presents its flowchart. First, SAR images are fed into the improved FEN, which is utilized to extract the feature maps (C1 to C5) of SAR images by a bottom-up pathway. Second, the FPN is established to construct multiscale feature pyramid levels (P3 to P7) with C = 256 channels. Specifically, P3 to P5 are computed from C3 to C5 via lateral connections and a top-down pathway. P6 is computed from C5 via a 3 × 3 convolution. P7 is computed from P6 by applying a ReLU function and a 3 × 3 convolution. Finally, the improved detection heads output detection results that include classification and localization.

Feature Extraction Network Redesign
Increasing the depth or width of a network is usually used to improve accuracy. However, as the number of hyperparameters increases, so does the complexity and computational cost of a network. Therefore, how to improve accuracy without increasing parameters is a difficult trade-off problem. The inception model can achieve a high level of accuracy while maintaining low model complexity. The reason is that the inception model follows a split-transform-merge strategy. However, the hyperparameter setting of the inception model is complex, so the model's scalability is moderate. For better accuracy, this paper introduces the strategy of split-transform-merge into the residual module, as shown in Figure 4.  In Figure 4a, the parameters of the original residual module are 256 × 64 + 3 × 3 × 64 × 64 + 64 × 256 = 69,632. In Figure 4b, the parameters of the IRM are 32 × (256 × 4 + 3 × 3 × 4 × 4 + 4 × 256) = 70,144. Therefore, the parameters of the network hardly change after the improvement in the residual module. To reduce the model's parameters, we introduce a group convolution to further optimize the IRM, as shown in Figure 4c. In the group convolution, the input and output channels are divided into 32 groups, and convolutions are performed separately within each group. To further improve the network's ability to adapt to SAR ships, we use the Dconv [43] to replace the ordinary convolution.

Detection Head Redesign
Among all bounding boxes output by the FCOS + ATSS network, some bounding boxes with accurate localization may be eliminated due to their low classification scores. In addition, some bounding boxes containing background pixels may be preserved due to their high classification scores. This is because of the inconsistency of classification and localization. To address this issue, we propose a joint representation of the classification score and localization quality.
Focal loss (FL) is adopted by FCOS + ATSS to address class imbalance during training. A typical form of FL is as follows: where y ∈ {0, 1} denotes the ground-truth class and p ∈ {0, 1} is the predicted probability. α, γ are the weighting factor and tunable focusing parameter, respectively. As shown in Figure 5, the proposed method softens the standard one-hot category label and uses an IoU ∈ [0, 1] label on the corresponding category (see the classification branch in Figure 5), where IoU is the IoU score between the predicted bounding box and its corresponding ground-truth bounding box during training. Specifically, IoU = 0 denotes the negative samples with 0 quality scores, and 0 < IoU ≤ 1 stands for the positive samples with target IoU scores. Due to FL only supporting discrete labels {0, 1}, the original FL needs to be improved.
where σ denotes the sigmoid operator and β denotes the modulating factor.
performed separately within each group. To further improve the network's ability to adapt to SAR ships, we use the Dconv [43] to replace the ordinary convolution.

Detection Head Redesign
Among all bounding boxes output by the FCOS + ATSS network, some bounding boxes with accurate localization may be eliminated due to their low classification scores. In addition, some bounding boxes containing background pixels may be preserved due to their high classification scores. This is because of the inconsistency of classification and localization. To address this issue, we propose a joint representation of the classification score and localization quality.
Focal loss (FL) is adopted by FCOS + ATSS to address class imbalance during training. A typical form of FL is as follows: where   0,1 y  denotes the ground-truth class and   0,1 p  is the predicted probability. ,   are the weighting factor and tunable focusing parameter, respectively.
As shown in Figure 5, the proposed method softens the standard one-hot category label and uses an   IoU 0,1  label on the corresponding category (see the classification branch in Figure 5), where IoU is the IoU score between the predicted bounding box and its corresponding ground-truth bounding box during training. Specifically, IoU=0 denotes the negative samples with 0 quality scores, and 0<IoU 1  stands for the positive samples with target IoU scores. Due to FL only supporting discrete labels   0,1 , the original FL needs to be improved.
where  denotes the sigmoid operator and  denotes the modulating factor. Due to scattering interferences from land or sea surface in SAR images, the borders of ships are relatively unclear. As stated in Section 2.1, we adopt the relative offsets from the location to the four sides of a bounding box as the regression targets (see the regression branch in Figure 5). As shown in Figure 5a, in the FCOS + ATSS model the regressed label y is expressed as a single Dirac delta distribution, δ(x − y), which cannot cope with the ambiguity and uncertainty of the data. To improve positioning performance, we model the regressed label, y, as a general distribution, P(x), as shown in Figure 5b. The predicted value, y , is presented as: where y 0 and y n denote the minimum and maximum values of y. To facilitate implementation with neural networks, we transform the above equation into a discrete form. Specifically, we divide the range [y 0 , y n ] into equal intervals by ∆ = 1, so the predicted value, y , can be expressed as: where ∑ n i=0 P(y i ) = 1. The shape of P(x) is optimized to improve the efficiency of network learning by the following loss function: L P (P(y i ), P(y i+1 )) = −((y i+1 − y) log(P(y i )) + (y − y i ) log(P(y i+1 ))) (10) where y i and y i+1 are the two closest values to y.

Loss Function
The entire network is trained with a multi-task loss function as follows: where L GIoU denotes the GIoU loss function [38].

Dataset and Evaluation Metrics
The HRSID [18] is used to evaluate the detection performance of the proposed method. The HRSID proposed by Wei et al. is constructed by using original SAR images from the Sentinel-1B, TerraSAR-X, and TanDEM-X satellites. There are 5604 images with 800 × 800 pixels and 16,951 multiscale ships in the HRSID. These images have various polarizations, imaging modes, imaging conditions, etc. The entire dataset is divided into a training set with 1821 images, a validation set with 1821 images, and a test set with 1962 images. Some samples and shape distributions of HRSID are shown in Figure 6. The average precision (AP), AP50, and frames per second (FPS) [40] are used as the evaluation metrics. The average precision (AP), AP 50 , and frames per second (FPS) [40] are used as the evaluation metrics.

Network Training
All the experiments were supported by a personal computer with the Ubuntu 16.04 operating system. The software configuration consisted of python programming language, PyTorch, CUDA, and cuDNN. The hardware capabilities included an RTX 2080Ti GPU, Intel i9-9820X CPU, and 128 GB RAM. To maintain the same hyperparameters of the detectors, we chose MMDetection for training and testing. All the detectors were trained with the GPU and finished in 30th epochs. The momentum and weight decay were set to 0.9 and 0.0001, respectively. The IoU threshold was set to 0.6 when training and testing. We chose SGD with the initial learning rate of 0.005 as the optimizer, and the other hyperparameters were set to the default values in MMDetection. We set β = 2 and n = 16 in the proposed method.

Ablation Study
Since the proposed method is similar in structure to the FCOS + ATSS network, FCOS + ATSS was used as the baseline. The detection results of ablation studies are shown in Table 1, where S1 and S2 denote the FEN redesign and detection head redesign, respectively.

Analysis on FEN Redesign
As shown in Table 1, the AP and AP 50 of baseline + S1 method are 4.8% and 3.7% higher than baseline, respectively. These results suggest that the FEN redesign is effective for improving detection accuracy by improving the network's feature representation ability. Specifically, we introduced the IRM and Dconv module to improve the network's feature representation ability. Although the FPS of the baseline + S1 method decreases (73.9 > 61.0), it is acceptable.

Analysis of Detection Head Redesign
As shown in Table 1, the AP and AP 50 of the baseline + S2 method are 5.5% and 3.3% higher than baseline, respectively, suggesting that the detection head redesign is effective for improving detection accuracy. On one hand, the baseline + S2 method achieves the joint representation of classification score and localization quality. On the other hand, the general distribution improves localization accuracy. Although the FPS of the baseline + S2 method decreases (73.7 > 73.7), the reduction is small.

Analysis on FEN Redesign and Detection Head Redesign
As shown in Table 1, the AP and AP 50 of the baseline + S1 + S2 (Ours) method are 8.3% and 4.7% higher than baseline, respectively. In addition, the AP and AP 50 of our method are the highest among all methods. This is because the advantages of two modules are combined in our method. Although the detection speed of our method is the slowest, our method still meets the real-time requirement. Figure 7 shows some detection results of FCOS + ATSS and the proposed method. It can be seen that the proposed method can reduce missing detections and false alarms, indicating that the proposed method can improve the detection performance of FCOS + ATSS. 8.3% and 4.7% higher than baseline, respectively. In addition, the AP and AP50 of our method are the highest among all methods. This is because the advantages of two modules are combined in our method. Although the detection speed of our method is the slowest, our method still meets the real-time requirement. Figure 7 shows some detection results of FCOS + ATSS and the proposed method. It can be seen that the proposed method can reduce missing detections and false alarms, indicating that the proposed method can improve the detection performance of FCOS + ATSS.

Comparison with Other Methods
Based on the same experimental environment, the detection performance of the proposed method is compared with those of other methods, such as Faster RCNN [12], SSD [14], RetinaNet [15], RepPoints [32], and FoveaBox [37]. From Table 2, we can draw the following conclusions: 1.
The AP and AP 50 of SSD are the worst. This is because SSD uses high-resolution features to detect small ships, resulting in unsatisfactory detection results. In addition, SSD reduces the input image size to 300 × 300, which destroys the image information.

3.
The AP and AP 50 of anchor-free methods such as RepPoints and FoveaBox are generally better than those of anchor-based methods except for RetinaNet. This shows that the anchor-free method is more suitable for SAR ship detection.

4.
The FPS of SSD is the highest, and that of Faster RCNN is the lowest. Although the FPS of our method is only 60.8, it already meets the real-time requirement. To visually demonstrate the detection performance of different methods, Figure 8 shows the comparative results for the different methods. In the first column of Figure 8, one false alarm exists in the other methods, but the proposed method can avoid this false alarm. In the second and third columns of Figure 8, some missing ships exist in all methods. However, Faster RCNN and our method have fewer missed ships than the other methods. In the fourth column of Figure 8, most of the ships were missed by the other methods. However, our method has the fewest missing detections. Compared with the other methods, the proposed method obtains a better detection performance.

Discussion
The detection results on the validation set are shown in Figure 9. As shown in Figure 9, the AP and AP 50 of the validation set gradually increased with the network training, and the AP and AP 50 finally stabilized to 68% and 90%, respectively. The AP and AP 50 of our method on the test set are 68.5% and 89.8%, respectively. Therefore, there is no significant difference between the detection accuracy of the training set and the test set, which verifies the effectiveness of the method in this paper.

Discussion
The detection results on the validation set are shown in Figure 9. As shown in Figure  9, the AP and AP50 of the validation set gradually increased with the network training, and the AP and AP50 finally stabilized to 68% and 90%, respectively. The AP and AP50 of our method on the test set are 68.5% and 89.8%, respectively. Therefore, there is no significant difference between the detection accuracy of the training set and the test set, which verifies the effectiveness of the method in this paper. The simulation experiments were carried out on the SSDD [20] to verify the model migration ability. AP and FPS were used to evaluate the detection performance of different methods on SSDD, as shown in Table 3. It can be seen that the AP of the proposed method is 98.4%, which is 6.4%, 4.5%, 2.1%, 1.9%, 2.8% higher than SSD, Faster RCNN, RetinaNet, RepPoints, and FoveaBox, respectively. Although the FPS of R-FCOS is only 19.2, it is acceptable. In order to visually demonstrate the ship detection performance of the proposed method, some detection examples are given in Figure 10. It can be seen that the proposed method can accurately detect all ships.  The simulation experiments were carried out on the SSDD [20] to verify the model migration ability. AP and FPS were used to evaluate the detection performance of different methods on SSDD, as shown in Table 3. It can be seen that the AP of the proposed method is 98.4%, which is 6.4%, 4.5%, 2.1%, 1.9%, 2.8% higher than SSD, Faster RCNN, RetinaNet, RepPoints, and FoveaBox, respectively. Although the FPS of R-FCOS is only 19.2, it is acceptable. In order to visually demonstrate the ship detection performance of the proposed method, some detection examples are given in Figure 10. It can be seen that the proposed method can accurately detect all ships.

Conclusions
In this paper, an improved anchor-free detector based on FCOS + ATSS is proposed for ship detection. We redesigned FCOS + ATSS with the aim to address the issues of the complex surroundings, scattering interference, and diversity of the scales. The IRM and Dconv were embedded into FEN to improve the feature representation ability. The joint representation of classification score and localization quality was used to address the inconsistency of classification and localization. The bounding box regression method was redesigned to improve object positioning performance. The experimental results on HRSID show that the proposed method achieves a competitive detection performance in comparison with SSD, Faster RCNN, RetinaNet, RepPoints, and FoveaBox. In addition, we verify the model migration ability of the proposed method on SSDD. However, it also needs to be noted that, although the proposed method has better detection accuracy, its

Conclusions
In this paper, an improved anchor-free detector based on FCOS + ATSS is proposed for ship detection. We redesigned FCOS + ATSS with the aim to address the issues of the complex surroundings, scattering interference, and diversity of the scales. The IRM and Dconv were embedded into FEN to improve the feature representation ability. The joint representation of classification score and localization quality was used to address the inconsistency of classification and localization. The bounding box regression method was redesigned to improve object positioning performance. The experimental results on HRSID show that the proposed method achieves a competitive detection performance in comparison with SSD, Faster RCNN, RetinaNet, RepPoints, and FoveaBox. In addition, we verify the model migration ability of the proposed method on SSDD. However, it also needs to be noted that, although the proposed method has better detection accuracy, its detection speed is not the fastest, which requires further analysis and research. Therefore, lightweight networks are the focus of our future work.