MSR2N: Multi-Stage Rotational Region Based Network for Arbitrary-Oriented Ship Detection in SAR Images.

In synthetic aperture radar (SAR) images, ships are often arbitrary-oriented and densely arranged in complex backgrounds, posing enormous challenges for ship detection. However, most existing methods detect ships with horizontal bounding boxes, which leads to the redundancy of detected regions. Furthermore, the high Intersection-over-Union (IoU) between two horizontal bounding boxes of densely arranged ships can cause missing detection. In this paper, a multi-stage rotational region based network (MSR2N) is proposed to solve the above problems. In MSR2N, the rotated bounding boxes, which can reduce background noise and prevent missing detection caused by high IoUs, are utilized to represent ship regions. MSR2N consists of three modules: feature pyramid network (FPN), rotational region proposal network (RRPN), and multi-stage rotational detection network (MSRDN). First of all, the FPN is applied to combine high-resolution features with semantically strong features. Second, in RRPN, a rotation-angle-dependent strategy is employed to generate multi-angle anchors which can represent arbitrary-oriented ship regions more felicitously than horizontal anchors. Finally, the MSRDN with three sub-networks is proposed to regress proposals of ship regions stage by stage. Meanwhile, the incrementally increasing IoU thresholds are selected for resampling positive and negative proposals in sequential stages of MSRDN, which eliminates close false positive proposals successively. With the above characteristics, MSR2N is more suitable and robust for ship detection in SAR images. The experimental results on SAR ship detection dataset (SSDD) show that the MSR2N has achieved state-of-the-art performance.

Traditional ship detection methods are mainly based on the following three aspects: (1) statistics characteristics [8][9][10][11]; (2) wavelet transform [12,13]; and (3) polarization information [14,15]. Among these methods, constant false alarm rate (CFAR) and variants thereof [8][9][10] are most widely utilized. CFAR detectors adaptively calculate the detection thresholds by estimating the statistics of background clutter and maintain a constant probability of false alarm. However, the determination of detection IoU thresholds of 0.5 and 0.7, respectively. Some noisy background regions are detected as ship regions in Figure 2a, and the missing detection appears in Figure 2b. Due to the inherent drawback of horizontal bounding boxes, rotated bounding boxes were gradually developed in optical remote sensing [47][48][49][50]. Ref. [47] proposed a rotation dense feature pyramid network (R-DFPN) in which the dense FPN and multiscale region of interest (ROI) align are used to detect ships in different scenes. In [48], a multi-category rotation detector was proposed for small, cluttered and rotated objects. Li et al. [49] proposed a rotatable region-based residual network (R3-Net) for multi-oriented vehicle detection. The ROI Transformer [50] that is lightweight was proposed to decrease the computational complexity.
To address the problems in SAR ship detection, a multi-stage rotational region based network (MSR2N) is proposed for arbitrary-oriented ship detection. As shown in Figure 1b, rotated bounding boxes can locate ships more accurately with less redundant noise background, and would not overlap with each other even in a dense arrangement. Therefore, the rotated bounding box representation is utilized in this paper. In the feature extraction module, we apply the FPN to fuse the high-resolution features and semantically strong features from the backbone network, which enhances feature representation. To generate rotated anchors and proposals, a rotational RPN (RRPN) is utilized.
In addition, a multi-stage rotational detection network (MSRDN) is applied in MSR2N. The MSRDN, which contains an initial rotational detection network (IRDN) and two refined rotational detection networks (RRDN), is trained stage by stage. Three increasing IoU thresholds are selected to sample the positive and negative proposals in three stages, respectively. The increasing IoU thresholds guarantee sufficient positive samples to avoid overfitting, and reduce close false positive proposals in the meantime. Compared to other methods, the proposed MSR2N achieves state-of-the-art performance on SAR ship detection, especially for densely arranged ships in inshore complex backgrounds. The main contributions of proposed MSR2N are enumerated as follows: 1. Alluding to the characteristics of SAR images, the MSR2N framework is proposed in this paper, which is more beneficial for arbitrary-oriented ship detection than horizontal bounding box based methods. 2. In RRPN, a rotation-angle-dependent strategy is utilized to generate anchors with multiple scales, ratio aspects, and rotation angles, which can represent arbitrary-oriented ships more adequately. 3. The MSRDN is proposed, where three increasing IoU thresholds are chosen to resample and refine proposals successively. With the proposals refined more accurately, the number of refined proposals also increases. 4. The multi-stage loss function is employed to accumulate losses of RRPN and three stages of MSRDN to train the entire network. 5. Compared to other methods, the proposed MSR2N has achieved state-of-the-art performance on SAR ship detection dataset (SSDD).
This paper is organized as follows. Section 2 describes the proposed MSR2N in detail. In Section 3, the ablation and comparative experiments are carried out on SSDD, which verifies the effectiveness of proposed MSR2N. Section 4 draws the conclusions for this paper.

Proposed Approach
The overall framework of the proposed MSR2N is illustrated in Figure 3. The whole framework can be divided into three modules: FPN, RRPN, and MSRDN. The FPN fuses feature maps from different layers of the backbone to generate the feature pyramid. In the RRPN module, a rotated anchor generation strategy is utilized to produce anchors with various rotation angles. The RRPN head outputs the softmax probabilities of being a ship and the regression offsets that encode the coordinates of coarse proposals. Millions of proposals are generated, but most of them are redundant noisy proposals and the amount of computation is huge. Hence, we select N pre proposals with the highest probabilities per feature pyramid level and perform the skew NMS operation [51] on the selected proposals to obtain N post final proposals which are fed into the next MSRDN. The MSRDN containing the IRDN and two RRDNs is trained stage by stage with three increasing IoU thresholds. During training, the MSR2N is optimized under the supervision of softmax probabilities and rotated boxes regression offsets from RRPN and each stage of MSRDN. For the inference time, the final refined proposals from the last stage of MSRDN are selected by a probability threshold and sampled by the skew NMS operation. Finally, the predicted bounding boxes are obtained.

Feature Pyramid Network
In this paper, we select ResNet50 [17] as the backbone network, as shown in Figure 3. To extract sufficient ship features in SAR images, especially for small ships, the FPN [24] is utilized to combine the high-resolution, semantically weak features with low-resolution, and semantically strong features. As illustrated in Figure 4

Parameterization of Rotated Bounding Box
In the conventional SAR ship detection networks, the ship region is a horizontal rectangle, which is represented by four parameters (x min , y min , x max , y max ). (x min , y min ) and (x max , y max ) denote the coordinates of the top left and bottom right corners of a bounding box, respectively. For a rotated bounding box, five parameters (x, y, w, h, θ) are conducted. The coordinate (x, y) represents the center of the rotated bounding box. To avert the disorder of coordinate expression, we set w and h as the longer side and shorter side of the rotated bounding box. θ denotes the angle from the positive direction of the x-axis counterclockwise to the longer side of the rotated bounding box. As half angular space is enough for the description of the rotation angle, the range of θ is set to [−90 • , 90 • ). Figure 5 shows the geometric representation of two rotated bounding boxes with different rotation angles.

Rotated Anchors and Proposals
In [24], multi-scale anchors with multiple aspect ratios are generated on the SAR images with respect to the feature pyramid. However, the generation of horizontal anchors is not sufficient enough for arbitrary-oriented bounding boxes. On the other hand, the characteristic of ships with large aspect ratios should be taken into consideration. Corresponding to the representation of rotated bounding boxes, we utilize a rotation-angle-dependent strategy to generate rotated anchors. As illustrated in Figure 6, a set of cell anchors are generated at the points of a SAR image with the strides of FPN, respectively. As shown in Figure 7, three aspects are taken into consideration in the cell anchor generation. First, referring to [24], the scales of anchors are set to {16 2 , 32 2 , 64 2 , 128 2 } pixels on the feature pyramid levels of {P2, P3, P4, P5}, respectively. Second, the large aspect ratios {2, 5, 8} replace the usual aspect ratios {0.5, 1, 2} in [23,24] with a view to the characteristic of strip-like ships. Finally, we set multiple rotation angles for anchors of various scales and aspect ratios, which can represent proposals more effectively. Six rotation angles {−75 • , −45 • , −15 • , 15 • , 45 • , 75 • } are set in consideration of the trade-off between half angular coverage and computational efficiency. Hence, each point per level of feature pyramid generates 18 (1 × 3 × 6) anchors. As shown in Figure 3, the classification layer outputs 36 (2 × 18) softmax probabilities of being a ship and background, and the regression layer outputs 90 (5 × 18) regression offsets that encode the coordinates (x, y, w, h, θ) for 18 proposals per position.

Initial Rotational Detection Network
As shown in Figure 3, the feature pyramid {P2, P3, P4, P5} and the coarse proposals produced from RRPN are both fed into the next stage called IRDN. Different from the conventional object detection network, IRDN aims to detect rotated proposals in this paper. N RROI proposals are randomly sampled as an RROI mini-batch in which an IoU threshold is chosen to distinguish the positive and negative samples. Here, we utilize the skew IoU computation [51] to compute IoU between rotated bounding boxes. If the IoU between a proposal and any ground-truth box is higher than the threshold, the proposal is assigned to a positive sample and vice versa. The ratio of positive and negative proposals is 1:3. In the RROI mini-batch, if the number of positive proposals is less than 25% of N RROI , the negative proposals should be padded.  As illustrated in Figure 8, the rotational ROI (RROI) align layer is applied for arbitrary-oriented proposals. It converts the feature map of RROIs into fixed spatial feature maps. An RROI is divided into 7 × 7 sub-regions by the RROI align layer, and the max pooling is operated in each sub-region. Then, the feature map of an RROI with a fixed spatial extent of 7 × 7 can be obtained. The feature maps of RROIs are input to two successive fully connected (FC) layers with a dimension of 1024. Two 1 × 1 convolutional layers for classification and regression are attached to the FC layers. The classification layer outputs softmax probabilities of being a ship and background, and the regression layer outputs the regression offsets that encode the coordinates (x, y, w, h, θ) of refined proposals.

Multi-Stage Detector
As mentioned in Section 1, a single IoU threshold in IRDN can't separate positive and negative samples properly, which can cause false and missing detection. Inspired by [52], a multi-stage rotational detection strategy is proposed. As illustrated in Figure 3, the MSRDN contains an IRDN and two RRDNs which share the same structure with IRDN. The MSRDN is a multi-stage regression framework, which successively resamples proposals by increasing IoU thresholds. In the first stage of MSRDN, the distribution of coarse proposals is set by a low IoU threshold, which guarantees enough positive samples. The feature pyramid and sampled coarse proposals are fed into the IRDN to produce the refined proposals. In the second stage, the refined proposals from the first stage are resampled by a medium IoU threshold. These sampled refined proposals and the feature pyramid are fed into the RRDN which outputs new refined proposals. In the third stage, the repetitive procedures occur with a high IoU threshold. The process of MSRDN can be expressed as: where f denotes the feature maps from FPN, d 1 is the initial sampled distribution of proposals in IRDN,

Multi-Stage Loss Computation
During the training of RRPN, a sampling strategy for rotated anchors is used. A mini-batch of N RRPN anchors is selected for the loss computation, where the ratio of positive and negative samples is 1:1. In the mini-batch, if positive samples are not enough, the negative samples should be padded. The positive samples are defined by following rules: (i) the IoU between the anchor and any ground-truth is larger than 0.7; and (ii) an angular difference between the anchor and the ground-truth is smaller than 15 • . The negative samples should satisfy the following rules: (i) the IoU between the anchor and any ground-truth is lower than 0.3; or (ii) the IoU between the anchor and a ground-truth is larger than 0.7, while the angular difference is larger than 15 • . We minimize an objective function of RRPN with the multi-task loss [22] defined as: Here, i denotes the index of samples in a mini-batch, p * i is the label of sample i, where p * i equals 1 if the sample is a positive one and p * i equals 0 if the sample is a negative one. p i is the predicted softmax probablility of sample i being a ship. t * i represents the offsets between sample i and the assigned ground-truth, and t i represents the predicted bounding box regression offsets. In Equation (2), only if sample i is a positive one with p * i = 1 is the regression loss L reg calculated. The classification loss L cls is log loss which is defined as: The regression loss is defined by the robust loss function smooth L1 as: The classification loss L cls and regression loss L reg are respectively normalized by N cls and N reg , where N cls and N reg are both set to the mini-batch size. The regression loss L reg is weighted by a balancing parameter λ, which is set to 1 in the experiments.
For rotated bounding box regression, the five parameterized coordinate regression offsets are defined as: where x, y, w, h and θ denote the center coordinates, width, height, and rotation angle of a rotated bounding box.
x, x a , x * are respectively for the predicted box, anchor box, and ground-truth box, likewise for y, w, h, θ. The parameter k ∈ Z keeps θ in the range of [−180 • , 180 • ). As described in Section 2.4, during the training of MSRDN, the proposals are resampled into a mini-batch and refined by a sequence of rotational detection sub-networks. Meanwhile, each sub-network outputs the softmax probabilities of a proposal being a ship and rotated bounding box regression offsets for loss computation. The multi-task multi-stage loss function of MSRDN is defined as: where L IRDN , L RRDN 1 , and L RRDN 2 represent the losses of IRDN, first RRDN and second RRDN, respectively. The losses of three stages have the same definition as the multi-task loss function of RRPN.

Dataset
We evaluate the proposed framework on the public SAR ship detection dataset-SSDD [39]. The SSDD contains SAR ship images of various scenes from three sensors. The details of SSDD are shown in Table 1. There are 1160 SAR images including 2456 ships in total, with an average of 2.12 ships per image. Ships in SSDD are annotated with coordinates of the four vertices of rotated bounding boxes. The SSDD is divided into a training set and a test set by a ratio of 4:1.

Setup and Implementation Details
The experiments are implemented in the deep learning framework Pytorch [53] on four NVIDIA TITIAN Xp GPUs with 12 GB memory. Our network is initialized by the pre-trained ResNet50 model for ImageNet classification. The whole network is trained for 12 k iterations in total, and a batch of 48 images is fed into the network in an iteration. The initial learning rate is 0.01 for the first 8k iterations; then, learning rates of 0.001 and 0.0001 are set for the next 2.5 k iterations and the remaining 1.5 k iterations, respectively. The stochastic gradient descent (SGD) [54] optimizer with a weight decay of 0.0001 and a momentum of 0.9 is selected for the model.
Referring to faster R-CNN, we resize images such that the shorter side is 350 pixels under the premise that ensures the longer side less than 800 pixels. For data augmentation, we flip the images horizontally with a probability of 0.5.
As mentioned in Section 2, N pre denotes the number of pre-selected proposals per feature pyramid level, and N post represents the number of post-selected proposals. Here, we set N pre to 2000 and 1000 for training and inference, respectively. In addition, N post is set to 1000 for both training and inference. N RRPN and N RROI , which are the sizes of RRPN and RROI mini-batches, are set to 256 and 512, respectively. In inference time, an NMS threshold is used to select final predicted bounding boxes, and a predicted bounding box can be retained if its score is higher than the score threshold. The NMS threshold and score threshold are respectively set to 0.3 and 0.1 in all experiments.

Evaluation Metrics
To quantitatively evaluate the performance of proposed MSR2N, the precision, recall, mean Average Precision (mAP), and F-measure (F1) are utilized as evaluation metrics.
Precision is the rate that correctly detected ships in all detected results, and recall means the rate that correctly detected ships in all ground-truths. The definitions of precision and recall are as follows: where TP, FP, and FN denote the number of correctly detected ships, false alarms, and undetected ships, respectively. In addition, a bounding box is admitted as a correctly detected ship in the case that the IoU between the bounding box and a ground-truth is higher than the threshold of 0.5. The precision-recall (PR) curve shows the precision-recall pairs at different confidence score thresholds.
The mAP is a comprehensive metric that calculates the average value of precision under the recall in a range of [0, 1]. The definition of mAP is as follows: where R denotes a recall value and P represents the precision corresponding to a recall. F1 evaluates the comprehensive performance of a detector by taking the precision and recall into consideration. F1 is defined as:

Ablation Study
In this part, some ablation experiments are carried out to investigate the effectiveness of the main modules of the proposed MSR2N.

Effect of MSRDN
To illustrate the detection performance of MSRDN, the experimental results are listed in Table 2. We can find that the single-stage detectors have poor detection performance. The detector with a low IoU threshold of 0.5 achieves a high recall but a low precision due to the introduction of noisy regions. With the increase of the IoU threshold, the precision increases, but the recall decreases. When the IoU threshold reaches 0.7, the recall decrease to 84.50% because of the overfitting caused by inadequate positive proposals. When using the two-stage detectors with an IRDN and an RRDN, they can achieve better detection performance than the single-stage ones. The detection performance of three-stage detector with the increasing IoU thresholds of {0.5, 0.6, 0.7} is significantly improved, which achieves the recall of 92.05%, precision of 86.52%, mAP of 90.11%, and F1 of 89.20%. These experiments demonstrate the effectiveness of multiple stages of MSRDN. Comparing to the three-stage detectors with uniform thresholds, the detector with the increasing IoU thresholds still has superior performance, which proves the effectiveness of increasing IoU thresholds of MSRDN. With the multi-stage regression, the number of close false positive proposals is decreased and the location of proposals is more accurate in the meantime. Therefore, the MSRDN with the increasing IoU thresholds of {0.5, 0.6, 0.7} is chosen in this paper.

Effect of Multi-Stage Loss Computation
The experimental results of multi-stage loss computation are presented in Table 3. It can be observed that the detection performance with three-stage loss computation is preferable, and the experiment only using loss from the last stage obtains the inferior detection performance. With the multi-stage loss used, the precision increases significantly, which means that the false alarms are effectively suppressed. We can conclude that the multi-stage loss computation has a strong constraint on the training of the whole network.

Effect of Rotation Angles
Some experiments about rotation angles are carried out, and their results are shown in Table 4. It can be seen in Table 4 that the method which only generates horizontal and vertical anchors achieves inferior detection performance. As the interval of rotation angles decreases, the detection performance

Comparison with Other Object Detection Methods
To verify the performance of our proposed MSR2N, we compare the MSR2N with the multi-stage horizontal region based network (MSHRN), rotational Faster R-CNN [23] (Faster RR-CNN), rotational FPN [24] (R-FPN), and rotational RetinaNet [29] (R-RetinaNet). MSHRN is the horizontal variant of our proposed MSR2N, where the RPN sub-network is used and the ROI pooling layer replaces the RROI align layer for predicting horizontal bounding boxes. Faster RR-CNN and R-FPN are two-stage detectors, which are respectively rotational variants of Faster R-CNN and FPN. The one-stage detector R-RetinaNet is the rotational variant of RetinaNet. In [29], the feature pyramid with levels {P3, P4, P5, P6, P7} is used as the output of the feature extraction module. For consistency, we use the feature pyramid {P2, P3, P4, P5} in the experiment of R-RetinaNet, which is the same as the experiments of R-FPN and MSR2N. Here, ResNet50 is chosen as the backbone network of all methods. Table 6 summarizes the experimental results of different methods on SSDD, and Figure 9 shows the PR curves of different methods. From Table 6, we can find that MSR2N achieves state-of-the-art performance: 92.05% for recall, 86.52% for precision, 90.11% for mAP, and 89.20% for F1. Comparing to the other four methods, MSR2N obtains 2.27%, 7.89%, 6.45%, and 9.54% gains in mAP, respectively. In Figure 9, the PR curve of MSR2N is higher than those of Faster RR-CNN, R-FPN, and R-RetinaNet. The PR curve of MSHRN only can reach a recall of 90.36%, but MSR2N can achieve a recall of 92.05%. This phenomenon indicates that horizontal region based detection leads to more undetected ships than rotational region based detection. To further verify the capability of handling hard cases, we construct a subset by selecting SAR images, which contain densely arranged ships in inshore complex scenes, from the original test set. Table 7 shows the experimental results of different methods on hard cases. From Table 7, it can be observed that all the values of the evaluation metrics are relatively low, which indicates that these hard cases are difficult to detect correctly. For example, the mAP of R-RetinaNet only reaches 42.91%. Nevertheless, our proposed MSR2N achieves superior performance. Comparing to the other four methods, MSR2N obtains 12.43%, 8.52%, 13.08%, and 28.30% gains in mAP, respectively. Figure 10 illustrates the detection results of four images from the test set of SSDD using different methods. From the detection results of three inshore SAR images, it can be observed that there exist undetected ships in Figure 10e-g. This is on account of the high IoU between two overlapped horizontal bounding boxes in MSHRN. On the other hand, the background noise in the horizontal bounding boxes can interfere with the detection. In addition, the detection results of Faster RR-CNN, R-FPN, and R-RetinaNet still have problems with missing detection and false detection. On the contrary, MSR2N shows superior detection performance in Figure 10u-w. All ships are successfully detected, and the position of ships is located more accurately than other methods. For the methods of MSHRN, Faster RR-CNN, and R-RetinaNet, there exist several false alarms and undetected ships in the results of fourth SAR images. Generally speaking, all methods are competent to detect ships far from shore. Based on the above discussion, we can conclude that MSR2N has state-of-the-art performance on SAR ship detection, especially for densely arranged ships in complex backgrounds.

Conclusions
This paper proposes an arbitrary-oriented ship detection network called MSR2N which aims at detecting ships in various complex scenes. In our MSR2N, the rotating bounding boxes are utilized to represent ship regions, which can reduce redundant noisy regions and overlaps between bounding boxes. The SSDD is employed to verify the effectiveness of the proposed MSR2N. The ablation experiments illustrate that high precision and high recall can be achieved simultaneously due to the sequential resampling of proposals and training in MSRDN sub-networks. The experiments about loss computation show the significance of multi-stage loss computation. The experiments of rotation angles and FPN are also carried out, which proves the effectiveness of rotation angles and FPN. Compared to other methods, MSR2N obtains superior detection performance, especially when ships are densely arranged in inshore complex scenes.
Author Contributions: Z.P. proposed the method and performed the experiments; R.Y. performed the experiments; and Z.Z. supervised the study and reviewed this paper. All authors have read and agreed to the published version of the manuscript.