Quad-FPN: A Novel Quad Feature Pyramid Network for SAR Ship Detection

: Ship detection from synthetic aperture radar (SAR) imagery is a fundamental and signiﬁcant marine mission. It plays an important role in marine trafﬁc control, marine ﬁshery management, and marine rescue. Nevertheless, there are still some challenges hindering accuracy improvements of SAR ship detection, e.g., complex background interferences, multi-scale ship feature differences, and indistinctive small ship features. Therefore, to address these problems, a novel quad feature pyramid network (Quad-FPN) is proposed for SAR ship detection in this paper. Quad-FPN consists of four unique FPNs, i.e., a DEformable COnvolutional FPN (DE-CO-FPN), a Content-Aware Feature Reassembly FPN (CA-FR-FPN), a Path Aggregation Space Attention FPN (PA-SA-FPN), and a Balance Scale Global Attention FPN (BS-GA-FPN). To conﬁrm the effectiveness of each FPN, extensive ablation studies are conducted. We conduct experiments on ﬁve open SAR ship detection datasets, i.e., SAR ship detection dataset (SSDD), Gaofen-SSDD, Sentinel-SSDD, SAR-Ship-Dataset, and high-resolution SAR images dataset (HRSID). Qualitative and quantitative experimental results jointly reveal Quad-FPN’s optimal SAR ship detection performance compared with the other 12 competitive state-of-the-art convolutional neural network (CNN)-based SAR ship detectors. To conﬁrm the excellent migration application capability of Quad-FPN, the actual ship detection in another two large-scene Sentinel-1 SAR images is conducted. Their satisfactory detection results indicate the practical application value of Quad-FPN in marine surveillance.


Introduction
Synthetic aperture radar (SAR) is an advanced active microwave sensor for the highresolution remote sensing observation of the Earth [1]. Its all-day and all-weather working capacity makes it play an important role in marine surveillance [2]. As a fundamental marine mission, SAR ship detection is of great value in marine traffic control, fishery management, and emergent salvage at sea [3,4]. Thus, up to now, the topic of SAR ship detection has received continuous attention from an increasing number of scholars [5][6][7][8][9][10][11][12][13][14][15].
In earlier years, a standard solution is to design ship features by manual ways, e.g., constant false alarm rate (CFAR) [1], saliency [2], super-pixel [3], and transformation [4]. Yet, these traditional methods are always complex in algorithm, weak in migration, and cumbersome in manual design, leading to their limited migration applications. Moreover, they often use limited ship images for theoretical analysis to define ship features, but these features cannot reflect the characteristics of ships with various sizes under different backgrounds. This causes their poor multi-scale and multi-scene detection performance.
Fortunately, in recent years, with the rise of deep learning (DL) and convolutional neural networks (CNNs), current state-of-the-art DL-based/CNN-based SAR ship detectors have helped solve the above-mentioned problems, to some degree. Compared with traditional methods, CNN-based ones have significant advantages, i.e., simplicity, highefficiency, and high-accuracy, because they can enable computational models with multiple The main contributions of this paper are as follows:
Quad-FPN offers the most superior detection accuracy compared with the other 12 competitive state-of-the-art CNN-based SAR ship detectors.
The rest of this paper is arranged as follows. Section 2 introduces Quad-FPN. Section 3 introduces our experiments. Results are shown in Section 4. Ablation studies are presented in Section 5. Finally, a summary of this paper is made in Section 6.

Quad-FPN
Quad-FPN is the basis of classical Faster R-CNN [17] and FPN [22], which are both important solutions to handle mainstream detection tasks. Figure 2 shows Quad-FPN's overview. Four basic FPNs, i.e., DE-CO-FPN, CA-FR-FPN, PA-SA-FPN, and BS-GA-FPN, constitute its network architecture. Their implementation presents a pipeline that improves SAR ship detection performance progressively. The overall design idea of Quad-FPN is as follows.
(1) The overall structure of the first two FPNs (DE-CO-FPN and CA-FE-FPN) keeps the same as that of the raw FPN [22], including the sequence of DE-CO-FPN and CA-FE-FPN. In other words, the raw FPN also has two basic sub-FPNs, but they are replaced by our proposed DE-CO-FPN and CA-FE-FPN. Differently, the first sub-FPN in the raw FPN uses the standard convolution, but DE-CO-FPN uses the deformable convolution; the second sub-FPN in the raw FPN uses the simple up-sampling to achieve a feature fusion, but CA-FE-FPN proposes a CA-FR-Module to achieve a feature fusion. DE-CO-FPN's feature maps are from the backbone network, so it is located at the input-end of Quad-FPN. From Figure 2a, DE-CO-FPN realizes the information flow from the bottom to the top. According to the findings in [22], the pyramid top (A 5 ) has stronger semantic information than its low levels. The semantic information can improve detection performance. Therefore, a top-to-bottom branch in CA-FR-FPN is added to achieve the downward transmission of semantic information. Finally, DE-CO-FPN and CA-FE-FPN form an information interaction loop in which spatial location information and semantic information complement each other. (2) The design idea of the third FPN (PA-SA-FPN) is inspired from the work of PANET They found that the low-level location information of the pyramid bottom (B 5 ) was not considered to be transmitted to the top. This might lead to an inaccurate positioning of large objects, so the detection performance of large objects is reduced. Therefore, they added an extra bottom-to-top branch to address this problem. This branch is called PA-FPN in their original reports. Differently, our proposed PA-SA-FPN adds a PA-SA-Module to achieve the feature down-sampling so as to focus on more important spatial features. Finally, CA-FE-FPN and PA-SA-FPN form another information interaction loop in which spatial location information and semantic information complement each other again. Therefore, the overall sequence of DE-CO-FPN, CA-FE-FPN, and PA-SA-FPN is fixed. (3) The basic outline of Quad-FPN has been determined. BS-GA-FPN is designed to further refine features at each feature level to solve the feature level imbalance of different scale ships. Thus, it is arranged at the output-end of Quad-FPN.

DEformable COnvolutional FPN (DE-CO-FPN)
The core idea of DE-CO-FPN is that we use the deformable convolution [24] to extract ship features. It contains more useful ship shape information, meanwhile alleviating complex background interferences. Previous work [5][6][7][8][9][10][11][12][13][14][15] mostly adopted the standard or dilated convolutions [25] to extract features. However, the two have limited geometric modeling ability due to their regular kernels. This means that their ability to extract the shape features of multi-scale ships is bound to become poor, causing poor multi-scale detection performance. For inshore ships, the standard and dilated convolutions cannot restrain interferences of port facilities; for ships side-by-side parking at ports, they also cannot eliminate interferences from the nearby ship hull. Thus, to solve this problem, the deformable convolution is used to establish DE-CO-FPN. Figure 3 shows their intuitive comparison. From Figure 3, it is obvious that the deformable convolution can extract ship shape features more effectively; it can suppress the interference of complex backgrounds, especially for more complex inshore scenes. Finally, ships are likely to be separated successfully from complex backgrounds. Thus, this deformable convolution process can be regarded as an extraction of salient objects in various scenes, which plays a role of spatial attention. In the deformable convolution, the standard convolution kernel is augmented with offsets ∆p n that are adaptively learned in training to model targets' shape features, i.e., where p 0 denotes each location, denotes the convolution region, w denotes the weight parameters, x denotes the input, y denotes the output, and ∆p n denotes the learned offsets at the n-th location. It should be noted that compared with standard convolutions, deformable ones' training is in fact time-consuming; it needs more GPU memory. This is because the learned offsets add extra network parameters, increasing networks' complexity. A reasonable fitting of these offsets must be time-consuming. Yet, in this paper, to obtain better accuracy of ships with various shapes, we have not studied this issue deeply for the time being. This problem will be considered with due attention in our future work. In Equation (1), ∆p n is typically fractional. Thus, we use the bilinear interpolation to ensure the smooth implementation of convolutions, i.e., where p denotes the fraction location to be interpolated, q denotes all integral spatial locations in the feature map x, and G( ) denotes the bilinear interpolation kernel defined by In experiments, we add another one convolution layer to learn the offsets ∆p n . Then, the standard convolution combining ∆p n is performed on the input feature maps. Finally, ship features with rich shape information (A 1 , A 2 , A 3 , A 4 , and A 5 in Figure 2a) will be transferred to subsequent FPNs for more operations.

Content-Aware Feature Reassembly (CA-FR-FPN)
The core idea of CA-FR-FPN is that we design a CA-FR-Module (marked by circle in Figure 2b) to enhance feature transmission benefits when performing the up-sampling multi-level feature fusion. Previous work [5][6][7][8][9][10][11][12][13][14][15] added a feature fusion branch from top to bottom to via feature up-sampling. This feature up-sampling is often completed by the nearest neighbor or bilinear interpolations, but the two means merely consider sub-pixel neighborhoods, which cannot effectively capture the rich semantic information required by dense detection tasks [26], especially for densely distributed small ships. That is, features of small ships are easily diluted because of their poor conspicuousness, leading to feature loss. Thus, to solve this problem, we propose a CA-FR-Module in the up-sampling feature fusion branch from top to bottom to achieve a feature reassembly. It can be aware of important contents in feature maps, and attach importance to key small ship features, thereby improving feature transmission benefits. Figure 2b shows the network architecture of CA-FR-FPN. From Figure 2b, for five-scale levels (B 1 , B 2 , B 3 , B 4 , and B 5 ), four CA-FR-Modules are used for feature reassembly. In practice, CA-FR-Module will complete the task that is similar to the 2× up-sampling operation in essence. Figure 4 shows the implementation process of CA-FR-Module. From Figure 4, there are two basic steps in CA-FR-Module: (1) kernel prediction, and (2) content-aware feature reassembly. Step 1: Kernel Prediction Figure 4a shows the implementation process of the kernel prediction. In Figure 4, the feature maps F's dimension is L × L × C, where L denotes its size and C denotes its channel width. Overall, the process of the kernel prediction (denoted by ψ) is responsible for generating adaptive feature reassembly kernels W l at the original location l, according to the k × k neighbors of feature maps F l through a content-aware manner, i.e., where N(·) means the neighbors and W l denotes the reassembly kernel.
To enhance the content-aware benefits of the kernel prediction, we first design a convolution layer to amplify the inputted feature maps F by α times (from C to α·C). This convolution layer's kernel number is set to α·C, where α is an experimental hyperparameter that will be studied in Section 5.2.2. Then, we adopt another convolution layer to encode the content of input features so as to obtain reassembly kernels. Here, we set the kernel width as 2 2 × k × k where 2 is from the requirement of the 2× up-sampling operation. The purpose is to enlarge the size of feature maps to 2L. Moreover, k × k is from the k × k neighbors of feature maps F l . Afterwards, the content encoded features are reshaped to a 2L × 2L × (k × k) dimension via the pixel shuffle means [27]. Finally, each reassembly kernel is normalized by a soft-max function spatially to reflect the weight of each sub-content.
In summary, the above operations can be described by: where f amplify denotes the feature amplification operation, f encode denotes the content encode operation, shuffle denotes the pixel shuffle means, soft-max denotes the soft-max function defined by e X i / ∑ j e X j , and W l denotes the generated reassembly kernel.
Step 2: Content-Aware Feature Reassembly Figure 4b shows the implementation process of the content-aware feature reassembly. Overall, the process of the content-aware feature reassembly (denoted by φ) is responsible for generating the final up-sampling feature maps F l , i.e., where k denotes the k × k neighbors and W l denotes the reassembly kernel in Equation (4) that corresponds to the l' location of feature maps after up-sampling from the original l location. For each reassembly kernel W l , this step will reassemble the features within a local region via the function φ in Equation (6). Similar to the standard convolution operation, φ can be implemented by a weighted sum. Thus, for a target location l' and the corresponding square region N(F l , k) centered at l = (i, j), the reassembly output is described by where denotes the corresponding square region N(F l , k). Moreover, k is set to 5 in our work that is an optimal value followed by [26].
With the reassembly kernel W l , each pixel in the region of the original location l contributes to the up-sampled pixel l differently, based on the content of features rather than location distance. Semantic features from the pyramid top will be transferred into the bottom, bringing better transmission benefits. Finally, the pyramid top's features will be fused into the bottom to enhance the feature expression ability of small ships.

Path Aggregation Space Attention FPN (PA-SA-FPN)
The core idea of PA-SA-FPN is that we add an extra path aggregation branch with a space attention module (PA-SA-Module) (marked by circle in Figure 2c) from the pyramid bottom to the top. Previous work [5][6][7][8][9][10][11][12][13][14][15] often transmitted high-level strong semantic features to the bottom to improve the whole pyramid expressiveness. Yet, the low-level location information from the pyramid bottom was not considered to be transmitted to the top. This can lead to inaccurate positionings of large ship bounding boxes, so the detection performance of large ships is reduced. Thus, we add an extra path aggregation branch (bottom-to-top) to handle this problem. Moreover, to further improve path aggregation benefits, we design a PA-SA-Module to concentrate on important spatial information to avoid interferences of complex port facilities. Figure 2c shows PA-SA-FPN's architecture.
From Figure 2c, the location information of the pyramid bottom is transmitted to the top (C 1 → C 2 → C 3 → C 4 → C 5 ) by the feature down-sampling. In this way, the top semantic features will be enriched with more ship spatial information. This can improve feature expression ability of large ships. Moreover, before the down-sampling, the low-level feature maps are refined by a PA-SA-Module to improve path aggregation benefits [28]. Figure 5 shows the implementation process of PA-SA-Module. In Figure 5, the input feature maps are denoted by Q and the output ones are denoted by Q'. First, a global average pooling (GAP) [29] is used to obtain the average response in space; a global max pooling (GMP) [29] is used to obtain the maximum response in space. Then, their implementation results are concatenated as the synthetic feature maps, denoted by S. Unlike the previous convolutional block attention module [28], we design a space encoder f space-encode to encode the space information. It is used to represent the spatial correlation. This can improve spatial attention gains because features in the coding space are more concentrated. Then, the output of f space-encode is activated by a sigmod function to represent each pixel's importance-level in the original space, i.e., an importance-level weight matrix W S . Finally, an elementwise multiplication is conducted between the original feature maps Q and the importance-level weight matrix W S to obtain the output Q . In short, the above can be described by where Q denotes the input feature maps, Q denotes the output feature maps, denotes the elementwise multiplication, and W S denotes the importance-level weight matrix, i.e., where GAP denotes the global average-pooling, GMP denotes the global max-pooling, f space-encode denotes the space encoder, © denotes the concatenation operation, and sigmod is an activation function defined by 1/(1 + e −x ). Finally, the feature pyramid will be stronger when possessing both the top-to-bottom branch and bottom-to-top branch. Each level has rich spatial location information and abundant semantic information, which help improve large ships' detection performance.

Balance Scale Global Attention FPN (BS-GA-FPN)
The core idea of BS-GA-FPN is that we further refine features from each feature level in the pyramid, to address the feature level imbalance of different scale ships. SAR ships often present different characteristics at different levels in the pyramid, i.e., the existence of multi-scale ship feature differences. Due to the difference of resolutions, the difference of satellite shooting distances, and different slicing methods, there are many scales of ships in the existing SAR ship datasets. E.g., for SSDD, the smallest ship pixel size is 7 × 7 while the biggest one is 211 × 298. Such huge size gap results in large ship feature differences, which makes it very difficult to detect them. In the computer vision community, Pang et al. [30] found that such feature level imbalance may weaken the feature expression capacity of FPN, but previous work [5][6][7][8][9][10][11][12][13][14][15] in the SAR ship detection community was not aware of this problem. Thus, to handle this problem, we design a BS-GA-Module to further process pyramid features to recover a balanced BS-GA-FPN. Implementation process of BS-GA-Module consists of four steps: (1) feature pyramid resizing, (2) balanced multi-scale feature fusion, (3) global attention (GA) refinement, and (4) feature pyramid recovery, as in Figure 6. Step 1: Feature Pyramid Resizing Figure 6a shows the graphical description of the feature pyramid resizing. In Figure 6a, in the PA-SA-FPN, features maps at different levels are denoted by C 1 , C 2 , C 3 , C 4 , and C 5 . To facilitate the fusion of balanced features to preserve their semantic hierarchy at the same time, we resize each detection scale (C 1 , C 2 , C 3 , C 4 , and C 5 ) to a unified resolution, by a max-pooling or up-sampling. Here, C 3 is selected as this unified resolution level because it locates in the middle of the pyramid. It can maintain a trade-off between top semantic information and bottom spatial information. Finally, the above can be described by where H 1 , H 2 , H 3 , H 4 , and H 5 are the resized feature maps from the original ones, UpSampling n× denotes the n times up-sampling, and MaxPool n× denotes the n times maxpooling.
Step 2: Balanced Multi-Scale Feature Fusion Figure 6b shows the graphical description of the balanced multi-scale feature fusion. After obtaining feature maps with the same unified resolution, the balanced multi-scale feature fusion is executed by where k denotes the k-th detection level, (i, j) denotes the spatial location of feature maps, and I denotes the output integrated features. From Equation (11), the features from each scale (H 1 , H 2 , H 3 , H 4 , and H 5 ) are uniformly fused as the output I (a mean operation). Here, the average operation fully reflects the balanced idea of SAR ship scale feature fusion. Finally, the output I with condensed multi-scale information will contain balanced semantic features of various resolutions. In this way, big ship features and small ones can complement each other to facilitate the information flow.
Step 3: GA Refinement To make features from different scales become more discriminative, we also propose a GA refinement mechanism to further refine balanced features in Equation (11). This can enhance their global response ability. That is, the network will pay more attention to important spatial global information (feature self-attention), as in Figure 6c.
The GA refinement can be described by where I i denotes the input at the i-th location, O i denotes the output at the i-th location, f (·) is a function used to calculate the similarity between the location I i and I j , g(·) is a function to characterize the feature representation at the j-th location, and ξ(·) denotes a normalized coefficient (the input overall response). The i-th location information denotes the current location's response, and the j-th location information denotes the global response. In Equation (12), g(·) can be regarded as a linear embedding, where Wg is a weight matrix to be learned, and we use a 1 × 1 convolutional layer to obtain this weight matrix during training. Furthermore, one simple extension of the Gaussian function is to compute similarity f (·) in an embedding space, where θ(I i ) = W θ I i and φ(I j ) = W φ I j are two embeddings. W θ and W φ are the weight matrixes to be learned that are both achieved by other two 1 × 1 convolutional layers. As above, the normalized coefficient ξ(·) is set to Finally, the whole GA refinement is instantiated as: can be achieved by a soft-max function. Figure 6c shows the graphical description of the above GA refinement. From Figure 6c, two 1 × 1 convolutional layers are used to compute φ and θ. Then, by the matrix multiplication θ T φ, the similarity f is obtained. One 1 × 1 convolutional layer is used to characterize the representation of the features g. Finally, f with a soft-max function multiplies by g to obtain the feature self-attention output O = {O i | i in I}. Finally, the feature self-attention output O is further processed by one 1 × 1 convolutional layer (marked in a dotted box). The purpose is to make O match the dimension of the original input I to facilitate follow-up element-wise adding. This is similar to the residual/skip connections of ResNet. Consequently, the refined features I combining the feature self-attention information are achieved, which will be further processed in the subsequent steps, i.e., where W O is also a weight matrix to be learned, and another 1 × 1 convolutional layer can be used to obtain it during training. In essence, the GA refinement can directly capture long-range dependence of each location (global response) by calculating the interaction between two different arbitrary positions. It is equivalent to constructing a convolutional kernel with the same size as the feature map I, to maintain more useful ship information, making feature maps more discriminative. More detailed theories about this global attention can be found in [31].
Step 4: Feature Pyramid Recovery Figure 6d shows the graphical description of the feature pyramid recovery. From Figure 6d, the refined features I are resized again through using the similar but reverse procedure of Equation (10) to recover a balanced feature pyramid, i.e., where D 1 , D 2 , D 3 , D 4 , and D 5 denote the recovered feature maps at different levels after ship scale balance operations. They reconstruct the final network architecture of BS-GA-FPN. Ultimately, D 1 , D 2 , D 3 , D 4 , and D 5 in BS-GA-FPN will possess more multi-scale balanced features that will be used to be responsible for the final ship detection.

Experiments
Our experiments are run on a personal computer with i9-9900K CPU and RTX2080Ti GPU based on Pytorch. Quad-FPN and the other 12 competitive SAR ship detectors are implemented under the MMDetection toolbox [32] to ensure the comparison fairness.

Experimental Datasets
(1) SSDD: SSDD is the first open SAR ship detection dataset, proposed by Li et al. [5] in 2017. There are 1160 SAR images with 500 × 500 average image size in SSDD from Sentinel-1, TerraSAR-X, and RadarSat-2. SAR ships in SSDD are provided with various resolutions from 1m to 10m, and HH, HV, VV, and VH polarizations. We set the ratio of the training set and the test set to 8:2. Here, image names with the index suffix of 1 and 9 are selected as the test set, and the others as the training set. (2) Gaofen-SSDD: Gaofen-SSDD was constituted in [6] to make up for the shortcoming of insufficient samples in SSDD. There are 20,000 images with 160 × 160 image size in Gaofen-SSDD from Gaofen-3. SAR ships in Gaofen-SSDD are provided with various resolutions from 5 m to 10 m, and HH, HV, VV, and VH polarizations. Same as [6], the ratio of the training set, validation set, and the test set is 7:2:1 by a random selection. (3) Sentinel-SSDD: Sentinel-SSDD was constituted in [6] to make up for the shortcoming of insufficient sample number in SSDD. There are 20,000 images with 160 × 160 image size in Sentinel-SSDD from Sentinel-1. SAR ships in Sentinel-SSDD are provided with resolutions from 5 m to 20 m, and HH, HV, VV, and VH polarizations. Same as [6], the ratio of the training set, validation set, and the test set is 7:2:1 by a random selection. (4) SAR-Ship-Dataset: SAR-Ship-Dataset was released by Wang et al. [7] in 2019. There are 43,819 images with 256 × 256 image size in SAR-Ship-Dataset from Sentinel-1 and Gaofen-3. SAR ships in Sentinel-SSDD are provided with resolutions from 5 m to 20 m, and HH, HV, VV, and VH polarizations. Same as their original reports in [7], the ratio of the training set, validation set, and the test set is 7:2:1 by a random selection. (5) HRSID: HRSID was released by Wei et al. [8] in 2020. There are 5604 images with 800 × 800 image size in HRSID from Sentinel-1 and TerraSAR-X. SAR ships in HRSID are provided with resolutions from 0.1 m to 3 m, and HH, HV, and VV polarizations. Same as its original reports in [8], the ratio of the training set and the test set is 13:7 according to its default configuration files.

Experimental Details
ResNet-50 with pretraining on ImageNet [33] serves as Quad-FPNs' backbone network. Images in SSDD, Gaofen-SSDD, Sentinel-SSDD, SAR-Ship-Dataset, and HRSID are resized as the 512 × 512, 160 × 160, 160 × 160, 256 × 256, and 800 × 800 image size for training. We train Quad-FPN for 12 epochs with a batch size of 2, due to the limited GPU memory. Stochastic gradient descent (SGD) [34] serves as the optimizer with a 0.1 learning rate, a 0.9 momentum, and a 0.0001 weight decay. Moreover, the learning rate is reduced by 10 times per epoch from 8-epoch to 11-epoch to ensure an adequate loss reduction. Followed by Wei et al. [12], a soft non-maximum suppression (Soft-NMS) [35] algorithm is used to suppress duplicate detections with an intersection over union (IOU) threshold of 0.5.

Loss Function
Followed by Cui et al. [13], the cross entropy (CE) serves as the classification loss L cls , where p i denotes the predictive class probability, p i * denotes the ground truth class label, and N denotes the prediction number. The smooth L1 serves as the regression loss L reg , where t i denotes the predictive bounding box and t i * denotes the ground truth box.

Evaluation Indices
Evaluation indices from the PASCAL dataset [5] are adopted by this paper, including the recall (r), precision (p), and mean average precision (mAP) [36], i.e., where TP denotes the number of true positives, FN denotes that of false negatives, FP denotes that of false positives, and p(r) denotes the precision-recall curve. In this paper, mAP measures the final detection accuracy because it considers both precision and recall. Moreover, the frames per second (FPS) is used to measure the detection speed, which is defined by 1/t, where t refers to the time to detect an image, whose unit is the second (s).

Quantitative Results on Five Datasets
Tables 1-5 show the quantitative comparison with the other 12 competitive state-ofthe-art CNN-based SAR ship detectors, on SSDD, Gaofen-SSDD, Sentinel-SSDD, SAR-Ship-Dataset, and HRSID. From Tables 1-5, one can clearly find that:

1.
On SSDD, Quad-FPN offers the best accuracy (95.29% mAP on the entire scenes). The second-best one is 92.27% mAP in the entire scenes from DCN [24], but it is still lower than Quad-FPN by~3% mAP, showing the best detection performance of Quad-FPN.

2.
On Gaofen-SSDD, Quad-FPN offers the best accuracy (92.84% mAP on the entire scenes). The second-best one is 91.35% mAP in the entire scenes from Free-Anchor, but it is still lower than Quad-FPN by~1.5% mAP, showing the best detection performance of Quad-FPN.

3.
On Sentinel-SSDD, Quad-FPN offers the best accuracy (95.20% mAP on the entire scenes). The second-best one is 94.31% mAP in the entire scenes from Free-Anchor, but it is still lower than Quad-FPN by~1% mAP, showing the best detection performance of Quad-FPN.

4.
On SAR-Ship-Dataset, Quad-FPN offers the best accuracy (94.39% mAP on the entire scenes). The second-best one is 93.70% mAP in the entire scenes from Free-Anchor, but it is still lower than Quad-FPN by~1% mAP, showing the best detection performance of Quad-FPN.

5.
On HRSID, Quad-FPN offers the best detection accuracy (86.12% mAP on the entire scenes). The second-best one is 83.72% mAP in the entire scenes from Guided Anchoring, but it is still lower than Quad-FPN by~3.5% mAP.

6.
Furthermore, for Quad-FPN and the other 12 methods, the detection accuracies of inshore scenes are all lower than that of offshore scenes. This is in line with common sense because the former has more complex backgrounds than the latter. 7.
For the more complex inshore scenes, the detection accuracy advantage of Quad-FPN is more obvious than the other 12 methods. Specifically, Quad-FPN offers an accuracy of 84.68% mAP on the SSDD's inshore scenes, superior to the second-best DCN [24] by~10% mAP; it offers an accuracy of 85.68% mAP on the Gaofen-SSDD's inshore scenes, superior to the second-best Free-Anchor by~4% mAP; it offers an accuracy of 84.68% mAP on the Sentinel-SSDD's inshore scenes, superior to the second-best Free-Anchor by~5% mAP; it offers an accuracy of 83.93% mAP on the SAR-Ship-Dataset's inshore scenes, superior to the second-best Double-Head R-CNN by~2% mAP; and it offers an accuracy of 70.80% mAP on the HRSID's inshore scenes, superior to the second-best Guided Anchoring by~7% mAP. Thus, Quad-FPN seems to be robust for background interferences because the deformable convolution can suppress the interference of complex backgrounds, especially for inshore scenes. 8.
The r values of the other 12 methods are lower than Quad-FPN, perhaps from their poor small ship detection performance. The p values of Quad-FPN are sometimes lower than others. Thus, an appropriate score threshold can be further considered in the future to make a trade-off between missed detections and false alarms. 9.
To be honest, Quad-FPN sacrifices speed due to the network's high-complexity. Yet, it is also important to further improve the accuracy, e.g., the precision strike of military targets. In the future, we will make a trade-off between accuracy and speed.   The best detector is bold and the second-best is underlined. The best detector is bold and the second-best is underlined. The best detector is bold and the second-best is underlined. The best detector is bold and the second-best is underlined. The best detector is bold and the second-best is underlined.    Taking SSDD in Figure 7 as an example, we can draw the following conclusions:

Qualitative Results on Five Datasets
1.
Quad-FPN can successfully detect various SAR ships with different sizes under various backgrounds. This shows its excellent detection performance with excellent scale-adaptation and scene-adaptation. Compared with the second-best CNN-based ship detector DCN [24], Quad-FPN can improve the detection confidence scores. For example, in the first detection sample of Figure 7, Quad-FPN increases the confidence score from 0.96 to 1.0. This can show Quad-FPN's higher credibility.

2.
Quad-FPN can suppress some false alarms from complex inshore facilities. For example, in the second detection sample of Figure 7, one land false alarm is removed by Quad-FPN. This shows Quad-FPN's better scene-adaptability.

3.
Quad-FPN can avoid some missed detections of densely arranged ships and small ships. For example, in the second sample of Figure 7, small ships densely parked at ports are detected again by Quad-FPN. This is because the adopted deformable convolution in DE-CO-FPN can alleviate the negative influence from the hull of a nearby ship. In the fourth sample of Figure 7, many small ships are detected successfully again by Quad-FPN, but DCN failed most of them. This is because CA-FR-FPN can transmit more abundant semantic information from the pyramid top to the bottom, to improve the expression capacity of small ship features. This shows Quad-FPN's better detection capacity of both inshore ships and small ones.

4.
Moreover, from the third sample of Figure 7, ships with different scales on the same SAR image are detected at the same time. This is because the proposed BS-GA-GPN can balance the feature differences of different sizes of ships, showing Quad-FPN's excellent scale-adaptability.
Moreover, from the detection results of the second sample on Gaofen-SSDD in Figure 8, Quad-FPN can remove false alarms from ship-like man-made facilities, meanwhile successfully detecting the ship moored at port, even under the strong speckle noise interference, or rather low signal to noise ratio (SNR). This shows Quad-FPN has both keen judgment merits and robust anti-noise performance. Similarly, the detection results of the first three samples on SAR-Ship-Dataset in Figure 10 can also reveal its excellent anti-noise performance. Finally, from the detection results of the third sample on SAR-Ship-Dataset in Figure 10, a large ship parking at port is detected by Quad-FPN again. This is because PA-SA-FPN can transmit the low-level location information from the pyramid bottom to the pyramid top, which can bring more accurate positionings of large ship bounding boxes. Correspondingly, the feature learning benefits of large ships are enhanced, thereby avoiding their missed detections. Given the above, Quad-FPN offers state-of-the-art SAR ship detection performance.

Large-Scene Application in Sentinel-1 SAR Images
We conduct the actual ship detection in another two large-scene Sentinel-1 SAR images to confirm the good migration capability of Quad-FPN. Figure 12 shows the coverage areas of the two large-scene Sentinel-1 SAR images. The two areas are both the world's major shipping routes, so they are selected. Table 6 shows their descriptions. From Table 6, the VV polarization SAR images are selected given that ships generally exhibit higher backscattering values in VV polarization [42]. In addition, the interferometric wide-swath (IW) mode of Sentinel-1 is selected specifically because it is the main mode to acquire data in areas of maritime surveillance interest [42]. The ship ground truths are annotated by SAR experts using the automatic identification system (AIS) and Google Earth. This can provide a more reliable performance evaluation. These two SAR images are resized as 24,000 × 16,000 image size, respectively. Then, followed by [43], they are cut into 800 × 800 small sub-images directly for training and testing because of the limited GPU memory. Finally, they are inputted into Quad-FPN for the actual SAR ship detection. After that, the detection results of these sub-images are integrated to the original large-scene SAR image.   Figure 13 shows the visualization SAR ship detection results of Quad-FPN on the two large-scene SAR images. From Figure 13, most ships can be detected by Quad-FPN successfully, which shows its good migration application capability in ocean surveillance.   Tables 7 and 8, the GPU time is selected to compare their speed (t GPU ) because modern CNN-based detectors are always run on GPUs. From Tables 7 and 8, one can find that Quad-FPN achieves the best detection accuracy on the two large-scene SAR images, showing its good migration capability.
On the Image 1, Quad-FPN offers an accuracy of 83.96% mAP, superior to the secondbest PANET [37] (83.96% mAP > 80.51% mAP); on the Image 2, Quad-FPN offers an accuracy of 87.03% mAP, superior to the second-best PANET [37] (87.03% mAP > 84.33% mAP). To be honest, we find that Quad-FPN's detection speed is relatively modest in contrast to others; thus, further detection speed improvements can be performed in the future.

Quantitative Comparison with CFAR
Finally, we perform an experiment to compare performance with a classical and common-used two-parameter CFAR detector. Following the standard implementation process from Deng et al. [44], we obtain the CFAR's detection results in the Sentinel-1 toolbox [45]. Tables 9 and 10 show their quantitative detection results. The best detector is bold and the second-best is underlined. The best detector is bold and the second-best is underlined. In Tables 9 and 10, the traditional CFAR usually does not use mAP from the DL community to measure accuracy, so F1 is used to represent accuracy, defined by: Moreover, in Tables 9 and 10, CFARs are usually run on CPUs, whereas modern DL-based methods are always run on GPUs; to ensure a reasonable comparison, the CPU time is selected for their speed comparison (t CPU ). From Tables 9 and 10, Quad-FPN is greatly superior to CFAR in terms of the detection accuracy, i.e., 0.74 F1 of CFAR on Image 1 << 0.84 F1 of Quad-FPN on Image 1, and 0.69 F1 of CFAR on Image 2 << 0.84 F1 of Quad-FPN on Image 2. The detection speed of Quad-FPN is also greatly superior to CFAR, i.e., 223.15 s CPU time of Quad-FPN on Image 1 << 884.00 s CPU time of CFAR on Image 1, and 226.08 s CPU time of Quad-FPN on Image 2 << 735.00 s CPU time of CFAR on Image 2. Therefore, Quad-FPN might still meet the needs of practical applications.

Ablation Study
In this section, ablation studies are conducted to verify the effectiveness of each FPN. We also discuss the advantages of each innovation. Here, we take the SSDD dataset as an example to show the results, due to limited pages. Table 11 shows the effectiveness of the Quad-FPN pipeline (DE-CO-FPN→CA-FR-FPN→PA-SA-FPN→BS-GA-FPN). From Table 11, the detection accuracy is improved step by step from left to right in the Quad-FPN pipeline architecture (89.92% mAP→93.61% mAP→94.58% mAP→95.29% mAP). This can show each FPN's effectiveness from the perspective of the overall structure. To be clear, the sequence of the four FPNs is better kept unchanged; otherwise, the final accuracy cannot reach the best level according to our experiments. Some detailed analysis can be found in Section 2 (i.e., the overall design idea of Quad-FPN).

Ablation Study on DE-CO-FPN
We make two experiments with respect to DE-CO-FPN. Experiment 1 in Section 5.1.1 is used to confirm the effectiveness of DE-CO-FPN, directly. Experiment 2 in Section 5.1.2 is used to confirm the advantage of the deformable convolution. Table 12 shows the ablation study results on DE-CO-FPM. In Table 12, "" denotes removing DE-CO-FPN (the other three FPNs are reserved) and "" denotes using DE-CO-FPN. From Table 12, DE-CO-FPN improves the accuracy by~3% mAP, which shows its effectiveness. Combined with it, SAR ship features extracted by networks will contain useful shape information; moreover, they can alleviate complex background interferences.  Table 13 shows the ablation study results on different convolution types. In Table 13, "Standard" denotes the traditional regular convolution in Figure 3a, "Dilated" denotes the dilated convolution in Figure 3b, and "Deformable" denotes the deformable convolution in Figure 3c. From Table 13, the deformable convolution achieves the best detection accuracy because it can more effectively model various ships' shapes by its adaptive kernel offset learning. This adaptive kernel offset learning can extract the shape and edge features of ships accurately, to suppress the interference of complex backgrounds, especially for the complex inshore scenes. In this way, ships can be separated successfully from complex backgrounds. Thus, this deformable convolution process can be regarded as an extraction of salient objects in various scenes, which plays a role of spatial attention. Accordingly, the accuracy on the overall dataset is improved.

Ablation Study on CA-FR-FPN
With respect to CA-FR-FPN, we will make two experiments. Experiment 1 in Section 5.2.1 is used to confirm the effectiveness of CA-FR-FPN, directly. Experiment 2 in Section 5.2.2 is used to determine the appropriate feature amplification factor α in CA-FR-Module. Table 14 shows the ablation study results on CA-FR-FPN. In Table 14, "" denotes removing CA-FR-FPN (i.e., not using the CA-FR-Module, but the other three FPNs are reserved.); "" denotes using the CA-FR-FPN. From Table 14, CA-FR-FPN improves the detection accuracy by~1% mAP because it can be aware of more valuable information for feature up-sampling. Its adaptive content-aware kernel can improve the transmission benefits of information flow, to improve the detection performance. This is because it can effectively capture the rich semantic information required by dense detection tasks, especially for densely distributed small ships. This can avoid the feature loss because of small ship features' poor conspicuousness. Accordingly, the accuracy on the overall dataset is improved.  Table 15 shows the ablation study results on feature amplification factor α in CA-FR-Module. In Table 15, "" denotes not amplifying features. From Table 15, when features are amplified no matter what the value of α is, the detection accuracy can obtain improvements, compared with not amplifying features. Therefore, the feature amplification can indeed enhance the content-aware benefits of the kernel prediction, no matter what the value of α is. This is because in the embedded feature amplification space, the amount of information of feature maps will be effectively increased, promoting the better correctness of the kernel prediction. Finally, in our Quad-FPN, to obtain a better detection accuracy (95.29% mAP), α is set to an optimal or saturated value 8.

Ablation Study on PA-SA-FPN
We make three experiments with respect to PA-SA-FPN. Experiment 1 in Section 5.3.1 is used to confirm the effectiveness of PA-SA-FPN, directly. Experiment 2 in Section 5.3.2 is used to confirm the effectiveness of PA-SA-Module. Experiment 3 in Section 5.3.3 is used to confirm the advantage of PA-SA-Module.  Table 16 shows the ablation study results on PA-SA-FPN. In Table 16, "" denotes removing PA-SA-FPN (the other three FPNs are reserved); "" denotes using PA-SA-FPN. From Table 16, PA-SA-FPN improves the detection accuracy by~1.5% mAP because the low-level spatial location information in the pyramid bottom has been transmitted to the top in PA-SA-FPN. In this way, the positionings of large ship bounding boxes will become more accurate. Accordingly, the accuracy on the overall dataset is improved.  Table 17 shows the ablation study results on PA-SA-Module. From Table 17, PA-SA-Module can effectively enhance the detection accuracy by~1% mAP because it can enable more pivotal spatial information in the pyramid bottom be effectively transmitted to the top. This can improve path aggregation benefits. In this way, the features of large ships might become richer and more discriminative. Accordingly, the accuracy on the overall dataset is improved.  Table 18 shows the ablation study results on different attention types. In Table 18, "SE" denotes the squeeze-and-excitation mechanism [36] and "CBAM" denotes the convolutional block attention module [28]. From Table 18, PA-SA-Module is superior to others because it can cause key spatial global information to be transmitted more efficiently, which means that it is more suitable for PA-SA-FPN. Moreover, different from the previous CBAM, our designed space encoder f space-encode can encode the space information. It is can represent the spatial correlation more effectively. This can improve spatial attention gains because the features in the coding space are more concentrated.

Ablation Study on BS-GA-FPN
We conduct three experiments with respect to BS-GA-FPN. Experiment 1 in Section 5.4.1 is used to confirm the effectiveness of BS-GA-FPN, directly. Experiment 2 in Section 5.4.2 is used to confirm the effectiveness of GA. Experiment 3 in Section 5.4.3 is used to confirm the advantage of GA. Table 19 shows the ablation study results on BS-GA-FPN. In Table 19, "" denotes removing BS-GA-FPN (the other three FPNs are reserved); "" denotes using BS-GA-FPN. From Table 19, BS-GA-FPN can play an important role in ensuring higher detection accuracy because it can improve the accuracy by~1% mAP. In this way, ship multi-scale features can be effectively balanced, which can achieve a stronger feature expression capacity of the final FPN. Accordingly, the accuracy on the overall dataset is improved.  Table 20 shows the ablation study results on GA. From Table 20, GA can improve the detection accuracy because when various ship multi-scale features are refined by it, they can become more discriminative. This feature self-attention might amplify important global information and suppress tiresome interferences, which can enhance the feature expressiveness of FPN. Essentially, GA is able to directly capture long-range dependence of each location (global response) through calculating the interaction between two different arbitrary positions. The whole GA refinement is essentially equivalent to construct a convolutional kernel with the same size as the feature map, to maintain more useful ship information. Accordingly, the accuracy on the overall dataset is improved.  Table 21 shows the ablation study results of different refinement types. In Table 21, we compare three refinement types, including a convolutional layer, an SE [36], and a CBAM [28]. From Table 21, GA offers the best detection accuracy because it can directly capture long-range dependence of each location (global response) to maintain more useful ship information that makes feature maps more discriminative. Different from the traditional convolution refinement types, its receptive field is wider, i.e., the whole input feature map's size, resulting in a better spatial correlation learning. Accordingly, the accuracy on the overall dataset is improved.

Conclusions
Aiming at some challenges in SAR ship detection, e.g., complex background interferences, multi-scale ship feature differences, and indistinctive small ship features, a novel Quad-FPN is proposed for SAR ship detection in this paper. Quad-FPN consists of four unique FPNs that can guarantee its excellent detection performance, i.e., DE-CO-FPN, CA-FR-FPN, PA-SA-FPN, and BS-GA-FPN. In DE-CO-FPN, we adopt the deformable convolution to extract SAR ship features that will contain more useful ship shape information, meanwhile alleviating complex background interferences. In CA-FR-FPN, we design a CA-FR-Module to enhance feature transmission benefits when performing the up-sampling multi-level feature fusion. In PA-SA-FPN, we add an extra path aggregation branch with a space attention module from the pyramid bottom to the top. In BS-GA-FPN, we further refine features from each feature level in the pyramid to address feature level imbalance of different scale ships. We perform extensive ablation studies to confirm the effectiveness of each FPN. Experimental results on five open datasets jointly reveal that Quad-FPN can offer the most superior SAR ship detection performance compared with the other 12 competitive state-of-the-art CNN-based SAR ship detectors. Moreover, the satisfactory detection results in two large-scene Sentinel-1 SAR images showing Quad-FPN's excellent migration capability in ocean surveillance. Quad-FPN is an excellent two-stage SAR ship detector. Four FPNs' internal implementations are different from previous work. They are well-designed improvements to ensure the state-of-the-art detection performance, without bells and whistles. They can exactly enable Quad-FPN's excellent ship scale-adaptability and detection scene-adaptability.
Our future work is as follows:

4.
We will consider the challenges within SAR data, e.g., the azimuth ambiguity, sidelobes, and the sea state, to optimize Quad-FPN's detection performance, in the future. 5.
We will consider making efforts to combine modern deep CNN abstract features and traditional concrete ones to further improve detection accuracy, in the future. Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article.