Detection of Small Ship Objects Using Anchor Boxes Cluster and Feature Pyramid Network Model for SAR Imagery

: The synthetic aperture radar (SAR) has a special ability to detect objects in any climate and weather conditions. Consequently, SAR images are widely used in maritime transportation safety and fishery law enforcement for maritime object detection. Currently, deep-learning models are being extensively used for the detection of objects from images. Among them, the feature pyramid network (FPN) uses pyramids for representing semantic information regardless of the scale and has an improved accuracy of object detection. It is also suitable for the detection of multiple small ship objects in SAR images. This study aims to resolve the problems associated with small-object and multi-object ship detection in complex scenarios e.g., when a ship nears the port, by proposing a detection method based on an optimized FPN model. The feature pyramid model is first embedded in a traditional region proposal network (RPN) and mapped into a new feature space for object identification. Subsequently, the k-means clustering algorithm based on the shape similar distance (SSD) measure is used to optimize the FPN. Initial anchor boxes and tests are created using the SAR ship dataset. Experimental results show that the proposed algorithm for object detection shows an accuracy of 98.62%. Compared with Yolo, the RPN based on VGG/ResNet, FPN based on VGG/ResNet, and other models in complex scenarios, the proposed model shows a higher accuracy rate and better overall performance.

information is usually lost. Therefore, an optimized feature pyramid model is proposed to solve the aforementioned problems.
By contrast, the detection of multi-scale ship targets on the FPN network is directly related to the quality of candidate boxes based on anchor boxes. Therefore, to solve the aforementioned problems, this paper proposes an optimized FPN model with the ability to generate anchor boxes.

Region Proposal Network (RPN) on a Backbone Network
A problem associated with ship object detection via SAR imaging is the low accuracy of multi-object ship identification in complex scenarios such as offshore ports and islands. Therefore, a more accurate object detection model is needed. The two-stage method constructs a multi-task loss function using the image classification loss and bounding box regression loss and mainly comprises two parts when training the network. Object detection models based on the frame of interest only use the topmost feature layer for prediction, examples of which include SPP-Net, Fast R-CNN, Faster R-CNN, etc., all of which use the features of the last layer of the network, as shown in Figure  1.  Figure 1. Architecture of the region proposal network (RPN) based on the backbone-network. After the original image is convolved with five layers, the last layer, Conv5, is used as a feature map to enter the positioning and detection stages.

Feature Pyramid Network (FPN) on Backbone
Based on the feature map extracted by the CNN, the feature semantic information of the lower layer is not as rich as the location information and the object position is more accurate, which in turn enhances the detection of small objects. Feature semantic information of higher layers is more sufficient, while the object location is less pronounced. The feature pyramid network (FPN) uses multi-scale features and top-down architecture for object detection and high-layer features with sufficient semantic information to map onto bottom-layer features with high resolution and adequate detail. The features of various layers are integrated to improve the detection of small objects.
The FPN is embedded in the RPN network, and each layer is independently predicted [22]. The image is then inserted into the pre-trained backbone network.  Three rectangular boxes (32 × 32 pixels, 64 × 64 pixels, 128 × 128 pixels) with different pixel areas are allocated on the five feature mapping layers (P1,P2,P3,P4,P5). Simultaneously, multiple aspect ratios (1:2, 1:1, 2:1) are used. These rectangular boxes are called anchor boxes. On each feature mapping layer, sliding the window with anchor boxes as a fixed area generates a large number of candidate frames, called proposals.
The training labels are assigned to the proposals based on the intersection over union (IOU) of the proposals and the actual boundary box (ground truth) ratio. If a proposal has the highest IOU for a given actual bounding box or the IOU of any actual bounding box exceeds 0.7, it is assigned a positive label, otherwise it is assigned a negative label. Four parameters are recorded at the same time, which represent the coordinates (x, y), height (h) and width (w) of the center point of the proposals. The five feature levels provide a total of proposals. Before entering the fully connected layer FC6, FPN provides 4k proposals parameters and k category parameters.
The ground truth is used as the actual value, and regression training is performed according to the proposals corresponding to the positive and negative labels, such that the proposals are closest to the ground truth. The mapping function is defined as f, f (Ax, Ay, Aw, The model then has to learn the offsets dx(A), dy(A) and scaling transformations dw(A), dh(A).

Anchor Boxes Generation Based on Shape Similar Distance (SSD)-Kmeans
Because the target ships are multi-scale, anchor boxes-based algorithms are used to generate the proposals for of the FPN map. The shapes and sizes of anchor boxes are a set of hyperparameters. In the actual SAR image, the size of the target changes considerably. The use of the anchor boxes generation mechanism in FPN leads to slower convergence of border regression and, therefore, the better initial anchor boxes need to be chosen.
The k-means algorithm is a clustering algorithm based on the distance evaluation scale to the prototype. Due to its simple and efficient characteristics, it is widely used in clustering problems. In this paper, the k-means algorithm based on the SSD clustering is used to obtain k initial anchor boxes, namely SSD-Kmeans.
The k-means algorithm involves clustering based on the evaluation scale to the prototype distance. It is widely used in clustering problems due to its simplicity and efficiency. In this paper, the k-means algorithm based on the shape similar distance (SSD) clustering is used to obtain k initial anchor boxes, namely SSD-Kmeans.
First, some ground truths need to be randomly selected as the center of the initial k clusters Subsequently, for the ground truth of each ship target xi, Equation (1) is used to calculate the cluster labels of the sample with SSD distance measurement.
After the cluster labels of all the ground truths were obtained, each cluster center was updated using Equation (2).
Equations (1) and (2) were repeatedly calculated until the preset number of cycles or the square error expressed in Equation (3) converged to the local optimal solution.
The distance dSSD based on shape is shown in Equations (4)- (7), where, dED is the Euclidean distance, dMD is the Manhattan distance, and dAD is the absolute value of the vector difference. The coordinates of the ground truth are subsequently stored, and xmin, ymin, w and h of the target are recorded, respectively.

Anchor Boxes Training
The anchor boxes generated by clustering are used to replace the anchor boxes generated in FPN according to different proportions and scales. The slide on the generated feature map is used to obtain a large number of proposals. The anchor box with the largest IOU is then obtained through non-maximum suppression and regression training is performed, which in turn causes the anchor box to be closest to ground truth.
A typical pyramid can realize multi-scale feature representation and, therefore, the CNN can be integrated with image positioning, thereby combining top-down and horizontal connection to create a feature representation with strong semantics on all scales as the re-input image. Next, the input image is rebuilt, and the fully connected layers are employed for classification. The loss feedback of the reconstructed input image is then combined with the original multi-task loss function.
First, the global perception field of the fully connected layer is connected to the k convolution kernel (1*1*512) of the three fully connected layers, and the last fully connected layer corresponds to the Softmax layer. The maximum value is obtained as a probability, and the output value pi is obtained as follows: The multi-task loss function includes the classification loss and the regression loss of locating the target box. Therefore, the loss function can be defined as follows: where L ( , )  (11)

Dataset
China's Gaofen-3 satellite SAR and Sentinel-1 SAR datasets as the main data sources used in this study, with a total usage of 102 Gaofen-3 images and 108 Sentinel-1 SAR images [23]. The SAR dataset includes 43,819 ship data slices. Imaging models of Gaofen-3 include Strip-Map (UFS), Fine Strip-Map 1 (FSI), Full Polarization 1 (QPSI), Full Polarization 2 (QPSII), and Fine Strip-Map 2 (FSII). These five models have resolutions of 3 m, 5 m, 8 m, 25 m, and 10 m, respectively. Sentinel-1′ s imaging models are the stripe models (S3 and S6) with wide-field imaging. The ship target dataset is shown in Figure 3. In addition, labelling tools are used to label the vessel position and for classification. The training, verification, and testing sets, respectively, constitute 70%, 20%, and 10%. A large number of SAR images are used to train the network model in order to improve detection accuracy. The proposed workflow is illustrated in Figure 4.

Network Training
The experimental platform is Ubuntu 14.0, the graphics processing unit (GPU) is NVIDIA Tesla V100, and the computer language is Python 3.6. The model was implemented on Keras, and the proposed network was meant for ship object detection. For training, the Adam [24] gradient descent method, using the first-order moment estimation and second-order moment estimation of the gradient was used to adaptively adjust the learning step size of each parameter. The Adam attenuation coefficients are 0.9 and 0.999, respectively, and the batch size was set to 64. Each iteration (epoch) randomly arranges the dataset. The training termination condition is that the value of the loss function remains almost unchanged. FPN + VGG training takes 32 h in our GPU implementation, and 48 h with FPN+ ResNet-101.

Anchor Boxes Generation
Considering that the number of anchor boxes at each position should not be too large or too small, the number k of anchor boxes is selected to be 6, 9 and 12. The training method used by the FPN model is called alternative training. The detection algorithm model evaluates the overall detection accuracy of the test set of the dataset. Table 1 presents the comparison of test results with different models and different values of k. When k = 9, higher accuracy can be obtained in different network models. Figure 5a shows the convergence process of the loss function during the training of the FPN + VGG backbone network when k = 9, and the anchor box is the SSD-Kmeans cluster. In Figure 5b, the backbone network is changed to FPN + Resnet101. It is shown that the loss convergence speed of FPN+Resnet101 backbone network is slightly faster than the FPN + VGG network.  After training, the test dataset is used to evaluate the model in different scenarios. In the sea scene, the detection accuracy of different ship sizes was close to 100%, and no missed or false detection occurred. This shows that the model works well under no interference conditions. Near the islands and ports, ships of all sizes have high accuracy and no background implied that there are no false identifications. In the offshore area, due to the complicated background and low resolution, a few items of background debris similar to the ship were misidentified as ships. As presented in Table 2, the accuracy of this algorithm was based on different backbone networks.

Analysis and Discussion
After training, the test dataset is used to evaluate the model by comparing it with the RPN based on VGG/ResNet, FPN based on VGG/ResNet, Yolo, and SSD-Kmeans FPN models, and the evaluation indicators corresponding to each algorithm are calculated and categorized into the ship true detection rate , the false detection rate , and F1 Score defined as follows: where TP is the number of true detections of ship objects, FPis refers to the unrecognized ship objects, FN is the number of false detections of ship objects and, therefore, TP + FP is the actual number of true ship detections and TP + FN denotes the total number of ship objects detections. The test results are detailed in Table 3. The Pd of the SSD-Kmeans FPN proposed in this paper is 98.62%, Pj is 10.07% and F1 Score is 0.941. Specifically, as compared with the models based on RPNs, the detection accuracies of FPNs based on backbone networks increased by 5% and 4%. The results show that the FPN works better in small-object ship detection for SAR images. Furthermore, true detection accuracy of SSD-Kmeans FPN compared with Yolo improved by 12%, and the false detection rate reduced by 12%. The results show that the two-stage detection accuracy based on FPN is significantly higher than that of one-stage detection.
Therefore, the algorithm proposed in this paper has better accuracy than the traditional CNN models in detecting small-object ships in complex scenarios. The FPN based on VGG/ResNet and SSD-Kmeans FPN both use FPNs embedded in RPNs, and due to the introduction of high-resolution features mapping, these models are more conducive to small-object location and recognition. The anchor boxes based on the shape clustering algorithm provide more accurate proposals and, therefore, the accuracy of object positioning is improved. Furthermore, in complex scenarios such as islands, ports, and offshore buildings, the detections are prone to errors, and in such cases, two-stage models are significantly better than one-stage detection models. The detection results obtained in different complex scenarios are presented in Figure 6. It can be seen that all the four models demonstrated effective performance when the background was an ocean; and, when the ship approached offshore lands, islands, and ports, although the models were still able to detect the ship, false positives or underreports were generated. Underreporting occurs mainly because the RPN extracts features from the last convolution layer, the relative positional deviation of which will be large in the case of small-object detection. The FPN-based algorithm can improve the accuracy of small-object positioning. False reporting is the misidentification of machineries on buildings and ports as ships. As can also been seen from Figure 6, the false reporting rates of the four models are similar. The underreporting rate of the proposed algorithm is low; however, its overall detection result is the best among all the four algorithms.

Conclusions
We presented an FPN model using anchor boxes obtained through SSD-Kmeans clustering for small-object ship recognition in complex backgrounds. Different scenarios of Sentinel-1 SAR images were used to verify the proposed model. The experimental results showed that the optimized anchor boxes FPN network has a detection accuracy rate close to 100% for different scales in the case of small target objects (with a scale below 64 pixels). Compared with the two-stage object-detection methods; FPN, faster R-CNN, and one-stage detection method, Yolo, in complex scenarios, the detection accuracy was demonstrated to improve with the use of the proposed method, and the false detection rate and the omission ratio of the target ship also reduced. The proposed model is also suitable for multi-scale and multi-target recognition in simple scenarios. The model also has several advantages over the existing models for detection of small target ships in complex scenarios.