A Spatial Cross-Scale Attention Network and Global Average Accuracy Loss for SAR Ship Detection

: A neural network-based object detection algorithm has the advantages of high accuracy and end-to-end processing, and it has been widely used in synthetic aperture radar (SAR) ship detection. However, the multi-scale variation of ship targets, the complex background of near-shore scenes, and the dense arrangement of some ships make it difﬁcult to improve detection accuracy. To solve the above problem, in this paper, a spatial cross-scale attention network (SCSA-Net) for SAR image ship detection is proposed, which includes a novel spatial cross-scale attention (SCSA) module for eliminating the interference of land background. The SCSA module uses the features at each scale output from the backbone to calculate where the network needs attention in space and enhances the features of the feature pyramid network (FPN) output to eliminate interference from noise, and land complex backgrounds. In addition, this paper analyzes the reasons for the “score shift” problem caused by average precision loss (AP loss) and proposes the global average precision loss (GAP loss) to solve the “score shift” problem. GAP loss enables the network to distinguish positive samples and negative samples faster than focal loss and AP loss, and achieve higher accuracy. Finally, we validate and illustrate the effectiveness of the proposed method by performing it on SAR Ship Detection Dataset (SSDD), SAR-ship-dataset, and High-Resolution SAR Images Dataset (HRSID). The experimental results show that the proposed method can signiﬁcantly reduce the interference of background noise on the ship detection results, improve the detection accuracy, and achieve superior results to the existing methods.


Introduction
Synthetic aperture radar (SAR) is a microwave sensor based on the scattering characteristics of electromagnetic waves for imaging, which has a certain cloud and ground penetration capability to detect hidden targets. This characteristic makes it suited for marine monitoring, mapping, and military applications. With the continuous exploitation of marine resources, the monitoring of marine vessels based on SAR images has received increasing attention. SAR image object detection aims to automatically locate and identify specific targets from images and has important application prospects in defense and civil fields such as target identification, target detection, marine development, and terrain classification [1][2][3][4].
The development of SAR image ship detection methods can be divided into two stages [5]: traditional detection methods represented by constant false alarm rate (CFAR) algorithms and deep learning methods represented by convolutional neural networks. The constant false alarm rate algorithm adaptively adjusts the threshold value using statistical models and sample selection strategies. It has been widely used due to its constant false alarm rate and low complexity. For example, the bilateral CFAR algorithm [6] takes into account the intensity distribution and spatial distribution when selecting thresholds to extreme aspect ratio problem and embedded a soft threshold attention module (STA) in the network to suppress the effects of noise and complex backgrounds. Tian et al. [36] employed an object characteristic-driven image enhancement (OCIE) module to enhance the variety of datasets and explored the dense feature reuse (DFR) module and receptive field expansion (RFE) module to increase the network receptive field and strengthen the transmission of information flow. Wei et al. [37] adopted a high-resolution feature pyramid network (HRFPN) to make full use of the information in the high-resolution and lowresolution features and proposed a high-resolution ship detection network (HR-SDNet) on this basis.
In order to further improve the detection accuracy of the network, in this paper, a new anchor-free spatial cross-scale attention network (SCSA-Net) is proposed, which contains a novel spatial cross-scale attention (SCSA) module that enhances features at different scales while mitigating the interference generated by complex backgrounds such as land.
In addition, to solve the "score shift" problem of average accuracy loss (AP loss), this paper improves AP loss and proposes global average accuracy loss (GAP loss). Three widely used datasets are used to experimentally validate the proposed methods and confirm their effectiveness.
The main contributions of our work are as follows:

1.
A novel cross-scale spatial attention module is proposed, which consists of a crossscale attention module and a spatial attention redistribution module. The former dynamically adjusts the position of network attention by combining information from different scales. The latter redistributes spatial attention to mitigate the influence of complex backgrounds and make the ship more distinctive.

2.
We analyze the reasons why AP loss generates the "score shift" problem and propose a global average accuracy loss (GAP loss) to solve it. Compared to traditional methods using focus loss as the classification loss, training with GAP loss allows the network to optimize directly with the average precision (AP) as the target and to distinguish between positive and negative samples more quickly, achieving better detection results. 3.
We propose an anchor-free spatial cross-scale attention network (SCSA-Net) for ship detection in SAR images, which reached 98.7% AP on the SSDD, 97.9% AP on the SAR-Ship-Dataset and 95.4% AP on the HRSID, achieving state-of-the-art performance.
The remainder of this paper is organized as follows: the second section introduces the proposed method; the third section describes the experiments and analysis of the results; the fourth section discusses and suggests future work; finally, the fifth section summarizes this paper.

Methods
This section describes SCSA-Net for SAR ship detection, which contains the overall architecture of SCSA-Net, the spatial cross-scale attention module, and GAP loss.

Overall Architecture of SCSA-Net
The main framework of SCSA-Net is depicted in Figure 1. FCOS [25] is chosen as the baseline, which is one of the most representative one-stage anchor-free detectors. Following the FCOS, the ResNet-50 is used as the backbone of SCSA-Net for feature extraction and the FPN is employed to solve the multi-scale target detection problem. Additionally, a new spatial cross-scale spatial attention module is added to the network, which dynamically enhances the FPN output features based on the backbone output features. The enhanced feature is then forwarded to the class-box subnet for classification and regression.

Spatial Cross-Scale Attention Module
In the offshore scenes of SAR images, the network usually performs well because of the large distinction between sea level clutter and ship targets. However, in the near-shore scenes of SAR images, the land portion of the background is complex and the network easily confuses ship targets with smaller islands and land backgrounds having similar ship shapes. This leads to false alarms and missed detections. The spatial cross-scale attention module is proposed to reduce the interference of land-to-ship targets. It consists of two parts, the cross-scale attention module, and the spatial attention redistribution module. The cross-scale attention module determines the regions requiring attention at each scale, while the spatial attention redistribution module redistributes the attention regions at each scale spatially to locate more accurate attention regions.

Cross-Scale Attention Module
The structure of the cross-scale attention module is shown in Figure 2. The multihead attention module is used to extract the attention regions of each scale more accurately. To be more intuitive, the formula in [38] is quoted as: (1) where , , , and are the projection matrices, where and are the dimensions of and respectively, and is the dimension of the multi-head attention module. In this paper, the multi-head attention inputs of , , and are the same, so we simplify to when .

Spatial Cross-Scale Attention Module
In the offshore scenes of SAR images, the network usually performs well because of the large distinction between sea level clutter and ship targets. However, in the near-shore scenes of SAR images, the land portion of the background is complex and the network easily confuses ship targets with smaller islands and land backgrounds having similar ship shapes. This leads to false alarms and missed detections. The spatial cross-scale attention module is proposed to reduce the interference of land-to-ship targets. It consists of two parts, the cross-scale attention module, and the spatial attention redistribution module. The cross-scale attention module determines the regions requiring attention at each scale, while the spatial attention redistribution module redistributes the attention regions at each scale spatially to locate more accurate attention regions.

Cross-Scale Attention Module
The structure of the cross-scale attention module is shown in Figure 2. The multi-head attention module is used to extract the attention regions of each scale more accurately. To be more intuitive, the formula in [38] is quoted as: are the projection matrices, where d k and d v are the dimensions of K and V respectively, and d model is the dimension of the multi-head attention module. In this paper, the multi-head attention inputs of Q, K, and V are the same, so we simplify MultiHead(Q, K, V) to MultiHead(X) The features from the last residual block of i-th stage in ResNet-50 are M i ∈ R C i ×H i ×W i , i = 3, 4, 5 where C i , H i , W i indicate the channel number, spatial height, and width, respectively. First, the convolution of size 1 is used in M i to obtain X i ∈ R 256×H i ×W i , i = 3, 4, 5 in order to reduce the channel dimension. Then, the maximum pooling layer is used to down-sample X i to obtain X i to reduce the computing cost X v i = V X i ∈ R 256×(H 5 ·W 5 ) , i = 3, 4, 5 denotes the vectorization function, and X i = V −1 X v i ∈ R 256×H 5 ×W 5 denotes its inverse function. The enhanced feature A i ∈ R 256×H 5 ×W 5 , i = 3, 4, 5 are calculated as: The features from the last residual block of -th stage in ResNet-50 are where indicate the channel number, spatial height, and width, respectively. First, the convolution of size 1 is used in to obtain in order to reduce the channel dimension. Then, the maximum pooling layer is used to downsample to obtain to reduce the computing cost.
denotes the vectorization function, and denotes its inverse function. The enhanced feature are calculated as: (2) Cross-Scale Attention Module

Spatial Attention Redistribution Module
The structure of the spatial redistribution model is shown in Figure 3. First, using a deformable convolution and concatenation on to obtain . Then is fed into the convolution layer and SoftMax activation layer to obtain . The denotes the feature map of the -th channel in the , where are considered as attention weights only on the -th scale, except that some regions possibly will be interesting or suppressed at all scales, such as common features of ship targets or land areas that do not contain targets. For this reason, and are respectively used as the weights that require attention and suppression at all scales. Then, the redistributed attention weight can be formulated as: Finally, are reshaped by upsampling into used to enhance features from FPN. Following the idea of FCOS, to further improve the detection capability of large-scale targets, we use a series of convolutions of stride 2 and size 3 for the enhanced features of to obtain feature maps and with larger receptive fields.
where means broadcast operation, means a convolution with a kernel size of 3, a stride of 2, and a padding of 1.

Spatial Attention Redistribution Module
The structure of the spatial redistribution model is shown in Figure 3. First, using a deformable convolution and concatenation on A i to obtain A c ∈ R (256·3)×H 5 ×W 5 . Then A c is fed into the convolution layer and SoftMax activation layer to obtain A s ∈ R 5×H 5 ×W 5 . The A s j ∈ R 1×H 5 ×W 5 denotes the feature map of the j-th channel in the A s , where A s j , j = 3, 4, 5 are considered as attention weights only on the j-th scale, except that some regions possibly will be interesting or suppressed at all scales, such as common features of ship targets or land areas that do not contain targets. For this reason, A s 1 and A s 2 are respectively used as the weights that require attention and suppression at all scales. Then, the redistributed attention weight S i ∈ R 1×H 5 ×W 5 can be formulated as:  Figure 3. The structure of the spatial attention redistribution module.

Average Precision Loss
In the object detection task, average precision (AP) is an important criterion for the detection result. The range of AP is , the closer the AP is to 1, the more accurate the detection result is and the closer the AP is to 0, the worse the detection result is. The main optimization goal of the object detection task is to make the AP as close to 1 as possible. Based on this goal, AP loss was proposed by Chen et al. [39]. With a large number of samples, AP loss can achieve better performance than focal loss because it allows the network to be optimized directly for the AP, and is not affected by a large number of true negative samples [39]. To be more intuitive, the formula of AP loss in [39] is quoted as: Finally, S i are reshaped by upsampling into S i ∈ R 1×H i ×W i used to enhance features F i , i = 3, 4, 5 from FPN. Following the idea of FCOS, to further improve the detection capability of large-scale targets, we use a series of convolutions of stride 2 and size 3 for the enhanced features of F 5 to obtain feature maps F 6 and F 7 with larger receptive fields.
where BD() means broadcast operation, Conv() means a convolution with a kernel size of 3, a stride of 2, and a padding of 1.

Average Precision Loss
In the object detection task, average precision (AP) is an important criterion for the detection result. The range of AP is [0, 1], the closer the AP is to 1, the more accurate the detection result is and the closer the AP is to 0, the worse the detection result is. The main optimization goal of the object detection task is to make the AP as close to 1 as possible. Based on this goal, AP loss was proposed by Chen et al. [39]. With a large number of samples, AP loss can achieve better performance than focal loss because it allows the network to be optimized directly for the AP, and is not affected by a large number of true negative samples [39]. To be more intuitive, the formula of AP loss in [39] is quoted as: where s i is the classification score of the box b i output by the network without the sigmoid function, and each b i will be assigned a label t i ∈ {−1, 0, 1} (label −1 for not counted into the AP loss). P = {i|t i = 1 } and N = {i|t i = 0 } are the set of positive and negative samples, respectively, and |P| is the size of set P. Further, Chen et al. [39] give a formula for the gradient g i of s i to address the problem that L AP is non-differentiable in the backpropagation process.
where ∀i, j, y ij = 1 t i =1,t j =0 and 1 is an indicator function that equals to 1 only if the subscript condition holds (i.e., t i = 1, t j = 0), otherwise 0.

Global Average Precision Loss
Chen et al. [39] point out that training with AP loss has the problem of "score shift". That is, although the AP of each of the two images is high, the AP may become lower if the two images are put together, because the scores of the positive or negative samples of the two images may be in different ranges, and the score of negative samples on one of the images may be higher than the score of positive samples in the other image. Chen et al. [39] avoid this situation through minibatch training but it still exists in two different batches. The main reason for the "score shift" is that AP loss only deals with those positive samples that have smaller scores than the negative samples and those negative samples that have larger scores than the positive samples. If the score of a positive sample is larger than all the negative samples, then its gradient g i will be 0. Similarly, if the score of a negative sample is smaller than all the positive samples, then its gradient will also be 0. This leads to the fact that once the score of a sample jumps out of the range r k = max in the k-th batch, then its gradient of AP loss will become 0 unless the region changes so that it is included again, and when min j∈P k s j,k > max i∈N k (s i,k ), the gradient of AP loss for all samples will become 0. Thus, the greater the difference in r k , the more serious the problem of "score shift". In addition, in the early stage of the training, the scores of positive and negative samples are very close to each other, which will result in a small range of r k . Therefore, training with AP loss can only produce a very small gap in the positive and negative samples, leading to poor results.
To solve the above problem, we propose a global average precision loss (GAP loss). First, to make the range r k of each batch similar, we calculate the mean value of positive and negative sample scores for multiple batches and take them into account separately when calculating L ij . Thus, the formula for L ij becomes as follows, while for more intuition where P (k) and N (k) are the set of positive and negative samples in the k-th batch, respectively. P (k) and N (k) is the size of set P (k) and N (k) . s Compared with Equation (7), the extra term in the Equation (8)  N to the set of negative samples. In this way, the r k of each batch will contain both s (k) N and s (k) P , thus alleviating the problem of "score shift". We then added additional terms to the gradient g i to allow the sample to still obtain a non-zero gradient even if it jumps out of r k , and the gradient of GAP loss is calculated as follows: Taking positive samples as an example, positive samples with small activation values and large L P ij will obtain a larger value of g i when L P ij is not zero compared to g i , while as L P ij is equal to zero, positive samples will obtain a non-zero gradient based on their activation values, thus allowing the network to focus on those difficult positive samples with small activation values and large L P ij , while not "ignoring" those simple samples that jump out at r k .

Experiment
In this section, we first introduce the experimental datasets, evaluation metrics, and implementation details. Then, we conducted an ablation study to analyze SCSA-Net. Finally, we compare other recent methods with several examples.

1.
Official-SSDD (SSDD): Currently, the SSDD [26] dataset published in 2017 is the most widely used in the SAR ship detection field. Subsequently, Zhang et al. published the updated official SSDD dataset of the SSDD dataset in 2021 [29], which corrected the wrong labels in SSDD and provided richer label formats. The official SSDD dataset contains complex backgrounds and multi-scale offshore and inshore targets. Most of the images are 500 pixels wide, and the SSDD has a variety of SAR image samples with resolutions ranging from 1 m to 15 m from different sensors of RadarSat-2, Terra SAR-X, and Sentinel-1. The average size of ships in SSDD is only~35 × 35 pixels. In this paper, we refer to Official SSDD as SSDD for convenience. 2.

3.
High-Resolution SAR Images Dataset (HRSID): The HRSID proposed by Wei et al. [28] is constructed by using original SAR images from the Sentinel-1B, TerraSAR-X, and TanDEM-X satellites. The HRSID contains 5604 images of 800 × 800 size and 16,951 ship targets. These images have various polarization rates, imaging modes, imaging conditions, etc. As in its original reports in [28], the ratio of the training set and the test set is 13:7 according to its default configuration files.

Evaluation Metrics
In order to evaluate the detection performance quantitatively, the evaluation criteria we used include the precision rate, recall rate, and average precision (AP). The calculation method is as follows: For a predicted box, if the IoU between it and the corresponding ground truth box is greater than 0.5, then it is defined as true positive ( TP), otherwise, it is false positive ( FP). For a ground truth box, if there is no predicted box with its IoU greater than 0.5, it is defined as a false negative ( FN). AP is defined as: where p and r represents precision and recall at an IoU threshold of 0.5, and p() is a function that takes r as a parameter, which is equal to taking the area under the curve.

Training Details
The ResNet-50 pre-trained on ImageNet is carried out to initialize our backbone. The initial learning rate of the SGD optimizer is 0.01, which is divided by 10 at each decay step. The number of image iterations per epoch is 16 K. All training processes are carried out on a Tesla V100 GPU (32 G) server.

Ablation Study
To visualize the effect of GAP loss, we counted the mean curves of the classification scores of positive and negative samples predicted by the network during the training phase, as shown in Figures 4-6. The curves are the output of the network trained by focal loss, AP loss, and GAP loss, respectively. To be more intuitive, the classification scores are normalized by a sigmoid function to convert them into probabilities, which represents the probability that the network considers the sample to be a ship. For positive samples, it should be as close to 1 as possible, and for negative samples, it should be as close to 0 as possible. In the initial stage of training, the positive and negative sample scores of the network prediction are in the range of 0.5. No matter what kind of loss is used for training, the negative sample scores can drop to 0 quickly, but only the positive sample scores corresponding to the GAP loss can quickly increase to about 1. For the network trained with focal loss and AP loss, the output positive sample score will first drop from 0.5 to approximately 0.17 and 0.04 before gradually increasing. Additionally, since the gradient of AP loss for samples out of r k is 0, the average of the positive sample score corresponding to the AP loss at the end of training can only reach about 0.1, such that the gap of scores between positive and negative samples is relatively small, leading to poorer detection results. The focal loss can gradually increase the positive sample score of the network to about 0.9, but it is still lower than the positive sample score corresponding to the GAP loss.       To evaluate the effectiveness of the SCSA module and GAP loss, we performed ablation studies on SSDD, SAR-Ship-Dataset, and HRSID with different settings of SCSA-Net, respectively. The results of the ablation study are shown in Table 1 and Figure 7. Firstly, by adding the SCSA module, the performance of the network can be improved regardless of the loss used in training. Second, since AP loss results in a smaller gap between positive and negative sample scores, replacing the focal loss with AP loss results in a decrease in AP, with or without the SCSA module. In contrast, direct replacement of focal loss with GAP loss can improve AP with or without the SCSA module. Finally, using both the GAP loss and SCSA modules achieved the best results on all three datasets, 98.7% AP on the SSDD dataset, 97.9% AP on the SAR-Ship-Dataset dataset, and 95.4% AP on the HRSID dataset, respectively.  To evaluate the effectiveness of the SCSA module and GAP loss, we performed ablation studies on SSDD, SAR-Ship-Dataset, and HRSID with different settings of SCSA-Net, respectively. The results of the ablation study are shown in Table 1 and Figure 7. Firstly, by adding the SCSA module, the performance of the network can be improved regardless of the loss used in training. Second, since AP loss results in a smaller gap between positive and negative sample scores, replacing the focal loss with AP loss results in a decrease in AP, with or without the SCSA module. In contrast, direct replacement of focal loss with GAP loss can improve AP with or without the SCSA module. Finally, using both the GAP loss and SCSA modules achieved the best results on all three datasets, 98.7% AP on the SSDD dataset, 97.9% AP on the SAR-Ship-Dataset dataset, and 95.4% AP on the HRSID dataset, respectively. Some visualization results of the ablation experiments on SSDD, SAR-Ship-Dataset, and HRSID are shown in Figures 8-10, where Figure 8a-d-Figure 10a-d show the ground truths, detection results of FCOS trained by focal loss, detection results of FCOS trained by GAP loss, and detection results of FCOS with SCSA module trained by GAP loss, respectively. The samples include different background states in the offshore, deep sea, and moored in port, respectively. Since some of the ships have large noise and more complex nearshore backgrounds, the detection results of FCOS are easily affected by them, leading to missed detections and false alarms, as shown in Figures 8, 9 and 10b. By training FCOS with GAP loss, the scores of negative samples can be effectively reduced and the scores of positive samples can be increased, resulting in better AP, as shown in Figures 8, 9 and 10c. However, for some difficult samples, even training with GAP loss still cannot achieve good results due to the limited expressiveness of the network. By adding the SCSA module, the feature extraction capability of the network is improved, and the problem of missed detection and false alarm in some near-shore scenes is effectively solved, as shown in Figures 8, 9 and 10d. The samples include different background states in the offshore, deep sea, and moored in port, respectively. Since some of the ships have large noise and more complex nearshore backgrounds, the detection results of FCOS are easily affected by them, leading to missed detections and false alarms, as shown in Figures 8-10b. By training FCOS with GAP loss,        To further confirm the effectiveness of the SCSA module, we use different configurations of the SCSA module in the network and train the network using GAP loss. The detection results on SSDD, SAR-Ship-Dataset, and HRSID are shown in Table 2. The difference in settings is whether A s 1 and A s 2 are considered when calculating S i in the Equation (3). The AP values of the results can be improved by using either A s 1 or A s 2 on the three different datasets. By using A s 1 additionally in the calculation of S i , 0.096%, 0.187%, and 0.157% AP are improved on the SSDD, SAR-Ship-Dataset, and HRSID, respectively. By using A s 2 in the calculation of S i , 0.116%, 0.195%, and 0.184% AP are improved on the SSDD, SAR-Ship-Dataset, and HRSID, respectively. Additionally, the best detection results can be achieved by using A s 1 and A s 2 together, with an improvement of 0.270%, 0.311%, and 0.346% AP on the SSDD, SAR-Ship-Dataset, and HRSID, respectively.

Comparison with the Latest SAR Ship Detection Methods
To further demonstrate the advancement and superiority of our proposed method, we experimentally validated with the latest SAR ship detection method using the SSDD, SAR-Ship-Dataset, and HRSID, as shown in Tables 3-5. The proposed method achieves the best results on all three widely used datasets. On the SSDD, ISASDNet+r101 [34] achieves the highest 96.8% AP in the two-stage network, while the proposed SCSA-Net can achieve 98.7% AP, which is 1.9% AP improvement compared to ISASDNet+r101 [34] and 0.3% AP improvement compared to the anchor-free network proposed by Zhu et al. [33] (Table 3). On SAR-Ship-Dataset, SCSA-Net achieves 97.9% AP, which is 2.1% AP improvement compared to the highest 95.8% AP in the two-stage network (Table 4). SCSA-Net achieves 95.4% AP on HRSID (Table 5). As a one-stage network, SCSA-Net not only has the best performance in the one-stage network but also exceeds that of the two-stage network. Table 3. Comparison with the latest SAR ship detection methods on SSDD.

97.9
The best result is bold. Table 5. Comparison with the latest SAR ship detection methods on HRSID.
It should be noted that some SAR ship detection algorithms compared in Tables 3-5 cannot be reproduced using the same experimental equipment, experimental environment, and experimental parameters because there is no open-source code, so we directly quote the relevant performance indicators in the corresponding papers. Our proposed method achieves higher AP detection performance than the latest SAR ship detection methods.
To visualize the detection performance of different methods, Figure 11 shows the comparison results of one-stage networks on SSDD (first three rows), SAR-Ship-Dataset (rows 4 to 6), and HRSID (rows 7 to 9), where Figure 11a-d show the ground truths, detection results of HR-SDNet [37], detection results of ResNet-50+Quad-FPN [32], and detection results of SCSA-Net proposed in this paper, respectively. Additionally, Figure 12 shows the corresponding precision-recall curves (PR curves). As shown in Figure 11a-c, in the near-shore scene, the network detects a low target score due to the interference of the background, and it is easy to generate false alarms and missed detections. Compared with other methods, SCSA-Net can alleviate this problem with higher classification scores of targets in nearshore scenes, reducing false alarms and missed detections, as well as obtaining better detection performance, as shown in Figure 11d. Meanwhile, HR-SDNet and ResNet-50+Quad-FPN cannot effectively handle densely arranged targets in nearshore scenes and are prone to miss-detection and treat multiple targets as one, as shown in rows 1, 2, and 5 of Figure 11b,c, while SCSA-Net possesses a stronger ability to handle such targets than them, as shown in rows 1, 2, and 5 of Figure 11d

Discussion
The experimental results on SSDD, SAR-Ship-Dataset, and HRSID validate the effectiveness of the proposed method in this paper. However, the horizontal rectangular box detection method used by the proposed method cannot obtain the angle information of the ship. The rotatable rectangular box can locate the ship target more accurately to reduce the background information contained in the box and help to obtain the ship heading and aspect ratio information. Therefore, the subsequent research direction is how to obtain the rotation angle information of the ship to obtain a more accurate rotatable box and further improve the detection performance. Additionally, as shown in Table 6, the SCSA module added to the SCSA-Net accounted for 23.1% of the overall parameters, resulting in a 26.6% drop in FPS. Although the SCSA module is effective in improving accuracy, its impact on the amount of network computation and test speed cannot be ignored. Further, although the one-stage network is less computationally intensive compared to the two-stage network, it is still computing-heavy and difficult to deploy for some embedded devices. Therefore, it is also the goal of our future work to accurately locate and remove useless parameters from the network to reduce the parameter size and computation of the model without affecting the accuracy.

Conclusions
In this work, a one-stage anchor-free SCSA-Net is developed to accurately detect ship targets in SAR images. To improve the feature extraction ability of the network and reduce the interference of nearshore background to the ship target, an SCSA module is proposed to dynamically enhance the features in space. In addition, a GAP loss is proposed, which enables the network to optimize directly with AP as the target, and solves the "score shift" problem in AP loss, so that it can effectively improve the score of the predicted tar and

Discussion
The experimental results on SSDD, SAR-Ship-Dataset, and HRSID validate the effectiveness of the proposed method in this paper. However, the horizontal rectangular box detection method used by the proposed method cannot obtain the angle information of the ship. The rotatable rectangular box can locate the ship target more accurately to reduce the background information contained in the box and help to obtain the ship heading and aspect ratio information. Therefore, the subsequent research direction is how to obtain the rotation angle information of the ship to obtain a more accurate rotatable box and further improve the detection performance. Additionally, as shown in Table 6, the SCSA module added to the SCSA-Net accounted for 23.1% of the overall parameters, resulting in a 26.6% drop in FPS. Although the SCSA module is effective in improving accuracy, its impact on the amount of network computation and test speed cannot be ignored. Further, although the one-stage network is less computationally intensive compared to the two-stage network, it is still computing-heavy and difficult to deploy for some embedded devices. Therefore, it is also the goal of our future work to accurately locate and remove useless parameters from the network to reduce the parameter size and computation of the model without affecting the accuracy.

Conclusions
In this work, a one-stage anchor-free SCSA-Net is developed to accurately detect ship targets in SAR images. To improve the feature extraction ability of the network and reduce the interference of nearshore background to the ship target, an SCSA module is proposed to dynamically enhance the features in space. In addition, a GAP loss is proposed, which enables the network to optimize directly with AP as the target, and solves the "score shift" problem in AP loss, so that it can effectively improve the score of the predicted tar and promote the detection accuracy of the network. We can conclude the experimental results on SSDD, SAR-Ship-Dataset, and HRSID: (1) by using the SCSA module, the accuracy can be improved on all three datasets regardless of whether the training is performed using focal loss, AP loss, or GAP loss. Its effectiveness is confirmed; (2) by using GAP loss for training, the gap between the average classification scores of positive and negative samples can be effectively widened, and the accuracy can be improved with or without the SCSA module; (3) compared with other methods, the SCSA-Net proposed in this paper has higher detection accuracy on all three datasets.
Future work: our future work will focus on reducing the computing scale of the network without affecting its accuracy, and studying the application of arbitrarily oriented objects.