RADet: Reﬁne Feature Pyramid Network and Multi-Layer Attention Network for Arbitrary-Oriented Object Detection of Remote Sensing Images

: Object detection has made signiﬁcant progress in many real-world scenes. Despite this remarkable progress, the common use case of detection in remote sensing images remains challenging even for leading object detectors, due to the complex background, objects with arbitrary orientation, and large difference in scale of objects. In this paper, we propose a novel rotation detector for remote sensing images, mainly inspired by Mask R-CNN, namely RADet. RADet can obtain the rotation bounding box of objects with shape mask predicted by the mask branch, which is a novel, simple and effective way to get the rotation bounding box of objects. Speciﬁcally, a reﬁne feature pyramid network is devised with an improved building block constructing top-down feature maps, to solve the problem of large difference in scales. Meanwhile, the position attention network and the channel attention network are jointly explored by modeling the spatial position dependence between global pixels and highlighting the object feature, for detecting small object surrounded by complex background. Extensive experiments on two remote sensing public datasets, DOTA and NWPUVHR -10, show our method to outperform existing leading object detectors in remote sensing ﬁeld.


Introduction
Remote sensing image processing is a hot issue, which includes many types of tasks, such as image segmentation and object detection. Many scholars have proposed many methods, for example, in [1][2][3][4], researchers have proposed a series of machine learning-based image segmentation methods to improve SAR remote sensing image segmentation. In this paper, we mainly study the object detection of optical remote sensing image based on deep learning.
With the development of Deep Neural Network, object detection has made great progress in natural images in recent years. The object detection networks based on deep learning can be divided into two types: two-stage object detection networks and single-stage object detection networks. Most of the current two-stage object detectors are developed on the basis of region proposals with CNNs (R-CNN) [5]. In a two-stage framework of object detection, such as Faster R-CNN [6], category-independent region proposals generated from an image in the first stage. Based on the region proposals, features are extracted individually from the feature maps obtained by a CNN backbone For the problem of target scale change, building a multi-layer network is the most effective strategy. As is known to all, the low-level high-resolution feature map of deep neural network can retain the location information of the object, while the high-level low-resolution feature map can provide rich semantic clues of the object. There are many methods to improve object detection and instance segmentation by using multiple scale feature maps. Fully convolutional networks (FCN) [17] improves semantic segmentation result by summing the partial scores for each category over multiple scales. Some other methods, such as HyperNet [18], ParseNet [19] and Inside-Outside Net (ION) [20], concatenate features of multiple layers to make predictions, which is equivalent to summing up features from different scale feature maps. Both SSD and MS-CNN [21] detect targets on multi-scale feature maps, without combining features or scores. Feature pyramid networks (FPN) [22] is a network that merges the lower-layer feature map with the higher-level feature map to get the multi-scale feature maps. It consists of a Bottom-up pathway and a Top-Down pathway. Based on SSD, [23] proposed RefineDet, which fuses the higher-level feature map of the backbone network in the SSD with the lower-layer feature map to obtain multi-scale feature maps for object detection. attention mechanism on Recurrent neural network (RNN) model to improve the performance of image classification. In [25], Bahdanau et al. used an attention-like mechanism to simultaneously translate and align on a machine translation task, allowing attention mechanism to be applied in Natural language processing (NLP) field. Intra attention proposed by [26] focuses on all positions in a sequence to get the response of a certain position in the sequence. Ref. [27] further demonstrated that machine translation model by self-attention can achieve excellent performance. Ref. [28] designed a non-local neural network (NLNet) to model pixel-level pairwise relationships with attention mechanism. Based on NLNet, Ref. [29] proposed Self-Attention Generative Adversarial Network (SAGAN), which allows attention-driven, long-range dependency modeling for image generation tasks. Ref. [30] achieved the feature weight of each channel in the feature map through global average pooling, and make the model pay different attention to each channel in the feature map. Squeeze-and-Excitation Networks (SENet) is often added to other networks as channel-attention to improve the efficiency. Inspired by the related works mentioned above, we proposed a novel object detector of remote sensing images for the difficulties in remote sensing filed. First, we used the shape mask prediction of Mask R-CNN to locate the target area. Thus, our method can flexibly detect arbitrary-oriented objects, without any predefined rotation anchor. Second, in order to retain more positional information for small objects, we designed a refine feature pyramid network, which merges the high-layer semantic features with the low-layer positional features and obtains the multi-scale feature maps, solving the problem that the scales of objects in the same image vary greatly. Finally, inspired by the attention mechanism of the human brain, we designed a multi-layer attention network that enables the network to accurately detect small objects of interest from complex backgrounds and to focus on learning the features of small objects, just like focused attention in cognitive neuroscience.
Combined with the above techniques, our method can significantly improve detection performance. Furthermore, the proposed method can obtain the performance of mAP 69.09% on DOTA (a large remote sensing dataset), which is better than the previous leading algorithms. The contributions of this paper are as follows: (1) For more robust handling of arbitrary-oriented objects, we use the instance segmentation branch of Mask R-CNN to generate shape masks of the objects, and then use them to determine the accurate object's rotation bounding box. Compared with the existing rotation detector, this is a simple and efficient method for obtaining a rotating bounding box, because it is not necessary to design a rotation anchor or a rotation branch in advance.
(2) Considering that the scales of objects vary greatly, a refined feature pyramid network is developed to merge the high-layer sematic features with the low-layer positional features and to get multi-scale feature maps. Compared with the existing multi-scale feature map methods, our refine feature pyramid network can effectively reduce the checkerboard effect or aliasing effect in feature fusing and improve the effectiveness of feature fusion.
(3) For complex background, a multi-layer attention network is designed to reduce the impact of background noise and to highlight target features. Compared with existing attention networks, the proposed multi-layer attention network simultaneously focus on the spatial position and features of objects, which is extremely helpful for the detection of small objects overwhelmed by complex backgrounds.

Proposed Methods
In this section, we will describe the various parts of our pipeline in detail. Figure 2 shows the overall framework of our method. Our pipeline consists of two key components: a Refine Feature Pyramid Network (RFPN) and a Multi-layer Attention Network(MANet), and is based on Mask R-CNN. Specifically, RFPN can generate a set of multi-scale feature maps by fusing features for each input image, and then MANet further suppresses background noise and highlight target features through attention mechanism. Then, obtaining the high-quality regional proposals from RPN for subsequent Fast R-CNN and mask branch. In the second stage, horizontal bounding box regression, class prediction, and shape mask prediction are obtained. Finally, the predicted shape masks are applied to calculate the object's rotation bounding box.

Rotation Bounding Box Prediction Based on Mask
Mask R-CNN is an extension of Faster R-CNN, which can simultaneously achieve object detection and instance segmentation. This multi-task learning method can effectively improve the performance of object detection. In this paper, we use the object mask predicted by the mask branch of Mask R-CNN to obtain the rotation bounding box of the object, to achieve arbitrary-oriented object detection in remote sensing images.

Instance Label Generation
The instance label of remote sensing image is shown in Figure 3. Unlike natural image datasets such as COCO and Pascal VOC, which provide pixel level labels, remote sensing image object detection datasets only provide coordinate labels of object. Therefore, generating instance label is a precondition for using Mask R-CNN. In this paper, we get the polygon connected by the object's coordinates, and regarded the pixels in the polygon as the object, and the pixels outside the polygon as the non-object, then we get an instance label of the object. Although this approach will bring some noise, the implementation process is quite simple. This kind of instance labels has little effect on the final instance segmentation results as demonstrated by experiments.

Rotation Bounding Box Prediction
As we all know, the mask branch of Mask R-CNN will predict a shape mask for each object in the image. For the predicted shape mask, we calculate its minimum area rectangle and use it as the object's rotated bounding box. In this process, we can easily obtain the rotation bounding box of arbitrary-oriented object without using any rotation anchor. The prediction of rotation bounding box is: x, y, h, w, θ = minAreaRect(mask) (1) where x, y, h, w, θ denote the rotation bounding box's center coordinates and its width, height and angle. minAreaRect(·) represents a function that computes the minimum area rectangle of the shape mask. (x i , y i ) denote the ith coordinates of the rotation bounding box.

Refine Feature Pyramid Network
Now, there are many deep convolutional neural networks with strong ability to extract image features, such as ResNet [31]. However, due to pooling layers used in deep layers, small object will lose most of its positional features in deep layers, while the large object still retains good positional and semantic features in deep layers. Therefore, if the multi-scale feature maps with context information can be obtained by merging the high-level features with the low-level features, the problem of that scales of objects vary greatly in the same image can be solved.
Inspired by RefineDet, we designed a Refine Feature Pyramid Network, which can fuse the higher-layer feature maps with the lower-layer feature maps, and then obtain the multi-scale feature maps with rich context information. Figure 4 shows the improved building block that constructs our top-down feature maps. Moreover, similar to the idea of default box settings in SSD, we use single-scale different-ratio anchors at each level and use different-scale anchors on different levels. In other words, the large-scale anchors used in high-layer feature maps (small scale) is mainly responsible for large object detection, and the small-scale anchors used in low-layer feature maps (large scale) is mainly responsible for small object detection, so as to overcome the problem of that scale of objects varied greatly in remote sensing images.  Specifically, for Resnet, our Refine Feature Pyramid Network only acts on the feature activation output by the last residual block output at each stage of Resnet, which are denoted as C2, C3, C4 and C5. We all know that feature maps of the same size can be fused, so the high-level feature maps need to be up-sampled before being fused with the low-level feature maps. Interpolation and deconvolution are commonly used up-sampling methods. Ref. [23] used deconvolution to go from a low-resolution feature maps to a higher-resolution feature maps for fusion. However, unfortunately, deconvolution will produce uneven overlap (checkerboard artifacts) when up-sampling a feature map, as shown in Figure 5. The uneven overlap means that the convolution kernel operates more in some places than others. FPN uses nearest neighbor interpolation to obtain larger-sized feature maps, but it produces aliasing effects. In fact, convolution can filter out aliased high-frequency signals. Therefore, we use a combination of nearest neighbor interpolation and convolution instead of deconvolution or simple interpolation, which will effectively reduce the checkerboard effect or aliasing effect. Moreover, we use a 1 × 1 convolution (which can reduce channel dimensions) and a 3 × 3 convolution to further extract the low-layer detailed location information, and the ReLU layer is applied between the two convolution layers to obtain the non-linear representation. Then, we fuse the high-level semantic features with the low-level location features through element-wise addition, and obtain the merged map through a 3 × 3 convolution and two ReLU layers. This process is repeated until the finest resolution feature map is generated. Finally, we obtain a set of multi-scale feature maps corresponding to the merged maps of each layer, which is defined as {P2, P3, P4, P5}. It is worth noting that P5 is obtained by C5 through a 1 × 1 convolution and a 3 × 3 convolution, which is the same way as P5 in FPN.
Since all levels of multi-scale feature map is shared by RPN and detection head, we fixed the feature dimension (denoted as d) of each level feature map and all extra convolutional layers. In this study, we set d = 256, which can meet the requirement of a fixed number of feature map channels in each layer, and also reduce memory consumption and maintain good performance.

Multi-Layer Attention Network
Referring to the human brain's focus attention mechanism, we design a multi-layer attention network, which enables the network to focus on processing some key information or information of interest when faced with a large amount of input information, so as to improve the performance of network. The proposed multi-layer attention network consists of four identical attention layer, which are connected after {P2, P3, P4, P5}, and the output is {A2, A3, A4, A5}. As illustrated in Figure 6, each attention layer contains two parts: position attention block and channel attention block.
The position attention block is adopted to model the pairwise long-range dependencies, guiding the network to pay special attention to the location of the target. Then, the channel attention block aims to model the channel-wise relations, guiding the network to pay more attention to the features of targets, which are the key to determining which category the target belongs to.
As we all know, object detection is a visual task sensitive to position, i.e., once the position of the object in the image changes, the network needs to give a meaningful respond accordingly. However, convolutional neural network favors translation invariance-shift of an object inside an image should be indiscriminative, which is obviously contrary to the principle of object detection. The proposed position attention block can effectively model the relationships among widely separated spatial positions, making the network more sensitive to the position of targets, thus enhancing the network locating performance. On the other hand, the network distinguishes between objects and non-objects based on the learned features and classifies the objects correctly. Therefore, our goal is to enable the network to learn the importance of different features, and strengthen the learning of important features. To achieve this, we propose a channel attention block inspired by SENet [30]. In deep convolutional neural networks, each channel dimension of the feature map is learned by a convolution kernel due to the weight-sharing characteristics of the convolution. That is, different channels of the feature map represent different features learned from images. In fact, different features contribute differently to the network. To enhance features with high contributions (target features) and weaken features with low contributions (non-target features), our channel attention block, which follows the position attention block, will first quantifies the contribution of each feature in the feature map through global averaging pooling, and then aggregates it to the original input by broadcast element-wise addition, thus enabling the features with greater contributions to receive more attention. In the position attention block, the input is the image feature of the previous hidden layer x ∈ R C×H×W , x are then transformed into three feature spaces F 1 , F 2 and F 3 . First, the attention map that models long-range dependence between pixels is obtained through , and A i,j indicates the pairwise relationship between position i and position j. Here, C, H and W are respectively the number of channels, height and width of the feature map of the previous hidden layer. Then, the obtained attention map is applied to the feature space F 3 to obtain the response z ∈ R C×H×W of each query position at all positions on the feature map, where, In the above formula, W 1 ∈ R C ×C , W 2 ∈ R C ×C , W 3 ∈ R C×C is the weight matrix learned by 1 × 1 convolution. i is the index of query position. N = H × W and N denotes number of feature locations. ⊗ denotes matrix multiplication. Because the feature spaces interact with each other through matrix multiplication, there will be a large memory footprint, especially for feature maps with large sizes. Therefore, we can improve memory efficiency by reducing the number of feature map channels. However, in order to ensure the same number of input and output channels in each attention layer, we can reduce C to C = C/k, k = 1, 2, 4, 8, 16. We found that when k = 8, the memory consumption is minimal and the performance loss is minimal. Therefore, in order to balance the efficiency and performance of the model, we set C = C/8 The input of the channel attention block is the output of the position attention block. To make better use of the spatial location information learned by the location attention block, we first generate channel-wise statistics via a global average pooling, which squeezes the global spatial information into a channel descriptor. This process can be expressed by the Formula (6). Then, to fully capture channel-wise dependencies, we designed a transform architecture (Eqation (7)) that can meet two criteria: (1) it can learn a non-linear interaction between channels. (2) it can learn a non-mutual-exclusive relationship between channels.
where z c , H and W denotes channel c, height and width of the feature map z.
where LN denotes layer normalization that can ease normalization of the two-layer architecture for the transform block, and W v1 ∈ R C r ×C , W v2 ∈ R C× C r . To make the channel attention block lighter, we reduced the dimension of the first fully connected (FC) layer by ratio r. In fact, there will be certain redundancy features in the FC layer. Therefore, setting r too small will affect the network performance and have a high memory consumption. When r is set too large, some important features may be lost, but the memory efficiency is high. In this paper we set r = 4, which can achieve the balance between efficiency and performance of the model.
In addition, to further enhance the features of each position, we use residual connections between inputs and outputs of each attention layer. Therefore, the final output of the channel attention block is o i = y i ⊕ x i , where ⊕ denotes the broadcast element-wise addition.

Loss Function
When training RPN, we assign a binary category label to each anchor. We assign a positive label to the anchors that meet two conditions: 1) the anchor has a highest Intersection-over-Union (IOU) overlap with a ground-truth box; Or 2) the IOU overlap between an anchor and the ground-truth box is greater than 0.7. When the IOU overlap with any ground-truth box is less than 0.3, the anchor is considered to be the background (non-object), and we assign a negative label to it. It is worth noting that anchors that are neither positive samples nor negative samples don not contribute to objective training. We minimize the objective function following the multi-task loss function of mask R-CNN, which is defined as follows: where L rpn , L cls , L reg and L mask are the same as defined in [11], λ 1 , λ 2 and λ 3 are the balance parameters between each task loss. In this paper, we set λ 1 = λ 2 = λ 3 = 1.0. Here, the mask branch of Mask R-CNN has a Km 2 dimensional output for each ROI, encoding K binary mask of resolution m × m, one of each of K classes. To achieve this, we apply per-pixel sigmoid on each output of mask branch, and the L mask is defined as the average binary cross-entropy loss. When an ROI is associated with ground-truth class k, the output mask only belongs to k class will contributes to the loss. In addition, L reg is defined as: In which smooth L 1 (x) = 0.5x 2 , i f |x|<1 |x|−0.5, otherwise For the bounding box regression, we adopt the parameterization of four coordinates, defined as follows: where x, y, w and h denotes box's center coordinates and its width and height. Variables x, x a and x * are for the predicted box, anchor box, and ground-truth box respectively (likewise for y, w, h).

Experiments and Results
In this section, we will introduce the dataset and implementation details used in our experiments. All experiments in this paper were implemented by Pytorch on a server with Nvidia Geforce GTX 2080Ti GPU and 11 G memory.

Datasets
We evaluate our approach on two public remote sensing datasets: DOTA and NWPUVHR-10. The datasets used for the experiments in this paper are briefly introduced as follows: DOTA [15] is a large dataset for object detection in aerial images. It can be used for developing and evaluating object detectors in remote sensing field and contains 2806 aerial images from different sensors and platforms. Each image ranges in size from 800 × 800 to 4000 × 4000 pixels and contains a wide variety of scales, directions, and shapes. These DOTA images are then annotated by aviation image interpreters using 15 common object categories. The fully annotated DOTA benchmark contains 188,282 instances, each of which is labeled with an arbitrary quadrilateral. DOTA has two detection tasks: horizontal boundary box (HBB) and directional boundary box (OBB). To ensure that the training data and test data distributions approximately match, half of the original image were selected as the training set, 1/6 as the verification set, and 1/3 as the test set. We divided the DOTA images into sub-images of size 1024 × 1024, with an overlap of 200 pixels, and scaled it to 1333 × 800. We then removed the blank sample that did not contain any object. With all these processes, we obtain 10,276 patches for training, 3626 patches for validating and 10,833 patches for testing. NWPUVHR-10 [32,33] dataset is a public detection dataset with 10 class geospatial objects for detection, which is only used for research purposes. The dataset contains a total of 800 very-high-resolution (VHR) remote sensing images culled from the Google earth and Vaihingen datasets, which are then manually annotated by experts using 10 common object categories. The 10 categories are airplane, ship, storage-tank, baseball-diamond, tennis-court, basketball-court, ground-track-field, harbor, bridge and vehicle. The NWPUVHR-10 dataset contains two sub-datasets: a positive dataset of 650 images and a negative dataset of 150 images. For the positive dataset, each image contains at least one object to be detection. Hence, we only use the positive dataset of NWPUVHR-10 dataset. Each image in the positive dataset is about 1000 pixels. In this paper, the split ratios of the training dataset, validation dataset and test dataset are 60%, 20% and 20%, respectively.

Rpn
RPN is used to generate object proposals for subsequent fast R-CNN and mask branches. We adapt the RPN by replacing the singe-scale feature map with our multi-scale feature maps, and assign anchors of different sizes at different stages. Specifically, on five stages {A2, A3, A4, A5, A6}, the area of the anchors is set to {32 2 , 64 2 , 128 2 , 256 2 , 512 2 }pixels, respectively. It is worth noting that A6 is obtained by A5 through max pooling. Meanwhile, different aspect ratios {1 : 2, 1 : 1, 2 : 1} of anchors are adopted at each stage.

Training
Since Mask R-CNN is our baseline network, we set hyper-parameters mainly following Mask R-CNN. Our base network is ResNet 101 and is initialized with its pre-trained weights on ImageNet. All new layers are initialized with kaimingnormal. In training stage of all the experiments, we used SGD as the optimizer, with a batch size of 2 (the number of GPUs is 1 and each GPU calculates 2 images), momentum of 0.9 and weight decay of 0.0001. We train our model for 12 epochs with a learning rate of 0.0025, and use a linear warmup learning strategy to accelerate the network convergence. The warmup step is 500, and the learning rate will decrease to 0.00025 and 0.00003 at the 8th and 11th epoch. The mini-batch size of RPN and Fast R-CNN are set to 256 and 512 per image with 1:3 sample ratio of positives and negatives.

Inference
In the inference stage, first, RPN generates many object proposals. After NMS with a threshold of 0.7, 1000 object proposals are fed into Fast R-CNN. Then, Fast R-CNN further fine-tunes the target position according to the object proposals generated in the first stage, obtains object's category and horizontal candidate boxes by regression, and removes the redundant candidate boxes through the NMS with a threshold of 0.5. The kept candidate boxes are input to the mask branch to generate the shape mask maps of objects. Finally, the objects' rotation bounding box is generated based on the predicted shape mask.

Evaluation Indicators
To quantitatively evaluate the performance of the proposed method, we use the Average Precision (AP), precision-recall curve (PRC), and mean Average Precision (mAP) as the evaluation indicators for the experiments in this paper. AP, PRC and mAP are three well-known and widely applied indicators to evaluate the performance of object detection methods [34]. PRC can be obtained through four evaluation components: true positive (TP), false positive (FP), false negative (FN) and true negative (TN) [35]. TP and FP indicate the number of targets detected correctly and the number of targets detected incorrectly, respectively. FN represents the number of targets not detected. Based on these four evaluation components, the formulas for recall and accuracy are defined below: AP is the average precision of the target in the range of recall = 0 to recall = 1, and is generally the area under PRC. mAP is the average value of AP values for all classes, and the larger the mAP value, the better the object detection performance.

Peer Methods Comparison
The proposed RADet with Refine Feature Pyramid Network and Multi-layer Attention Network is compared with other object detectors on two datasets: DOTA and NWPUVHR-10. The results show that Our model achieves competitive performance and outperforms other models.

Results on Dota
In addition to the official baseline given by DOTA, we also compared our results with R-DFPN [16], R2CNN [13], RRPN [14] and the method proposed by Yang et al. in [36]. The performance of these methods is shown in Table 2. As can be seen from Table 2, compared to other methods, due to the addition of the proposed Multi-layer Attention Network, RADet has a significant effect on improving the detection performance of small objects surrounded by complex backgrounds such as bridge, ship, swimming pool, small vehicle, and large vehicle. Moreover, with the proposed Refine Feature Pyramid Network, the detection performance of objects that may exist on the same image and have large scale differences, such as baseball diamond and small vehicle, and harbor and ship, can also be improved simultaneously. In conclusion, our method is better than the existing published results, reaching 69.09% mAP.
Some detection examples of RADet on DOTA dataset are shown in Figure 7. In Figure 7, it can be seen that the false alarm rate of the proposed RADet are very low, while recall rate is high. Figure 7 also shows that the proposed RADet can well deal with complex background noise, and can also detect targets with large scale changes.

Results on Nwpuvhr-10
On the NWPUVHR-10 dataset, we evaluate our method using AP, mAP and PRC as evaluation indicators. Table 3 shows the overall performance comparison results of our method and other classical object detection algorithms on the NWPUVHR-10 dataset. There is no doubt that our method also achieves the first place on the NWPUVHR-10 dataset, with 90.24% mAP. In Figure 8, it can be seen that our method achieved the best detection results in more than half of the categories, such as airplane, vehicle, basketball court, ground track field, baseball diamond and tennis court. In short, by comprehensively analyzing the AP values, mAP values and PRCs, we can see that our RADet has achieved the best detection performance.

Quantitative Analysis
To verify the effectiveness of the proposed approach, we do two sets of ablation experiments on the test set of the DOTA dataset. All results were obtained by submitting the prediction results to the official DOTA evaluation server. In both sets of ablation experiments, we used AP and mAP as evaluation indicators. Table 4 shows the results of our model on the DOTA dataset in two different up-sampling methods of the Refine Feature Pyramid Network. Table 5 summarizes the results of our model with different settings on our DOTA dataset.
Baseline setting. We chose mask R-CNN with FPN as our baseline. For fairness, all our experiments use ResNet101 as the base model, and all the experimental data and parameter settings are strictly consistent.
Effect of Refine Feature Pyramid Network. We replace the FPN in the baseline with the proposed Refine Feature Pyramid Network, which can increase the total mAP by 0.54%. As discussed in Section 2.2, our resize-convolution can effectively reduce the checkerboard effect generated during the up-sampling process, which can also be proved by the results in Table 4. Compared with Refine Feature Pyramid Network using deconvolution, the Refine Feature Pyramid Network using resize-convolution can increase mAP by 0.78% , which shows that resize-convolution can effectively reduce the checkerboard effect analyzed in Section 2.2.

Effect of Multi-layer Attention Network.
To further effectively suppress the influence of background noise and highlight the object feature, we propose Multi-layer Attention Network. The results in Table 5 show that our Multi-layer Attention Network can significantly improve the detection results of small objects such as swimming pool and storage tank that may be surrounded by complex background. Adding our proposed Multi-layer Attention Network to the baseline can increase the total mAP of the model by 0.88% to 65.98%, and increase the AP of the storage tank category by 5.75%, and the AP of the swimming pool category by 3.15%. In addition, adding Multi-layer Attention Network to our model using Refine Feature Pyramid Network can also improve the performance of the model, which further demonstrate the effectiveness and portability of Multi-layer Attention Network.

Qualitative Analysis
Qualitative analysis of Resize-convolution. It can be seen from Figure 9b that using deconvolution for up-sampling will produce serious uneven overlap (checkerboard artifacts), which will affect the detection performance of the final network, as shown in Table 4. Similarly, as shown in Figure 9c using only nearest neighbor interpolation for up-sampling also produces some aliasing effects. Our resize-convolution (combination of nearest neighbor interpolation and convolution) works best, as shown in Figure 9d, because 3 × 3 convolution can filter out some aliased high-frequency signals. Qualitative analysis of Multi-layer Attention Network. Due to the complexity of real-world data such as remote sensing images, there will be a lot of noise information near the objects. Extensive noise will overwhelm the object information and the boundary between objects will be blurred, as shown in Figure 10b, leading to missed detection and increasing false alarms. It can be seen from Figure 10c that our proposed Multi-layer Attention Network can effectively suppress background noise and highlight object information, which is helpful to improve the final detection performance of the model. In addition, it can be seen from Figure 10d that due to the position attention block, the proposed RADet can pay attention to some information around the swimming pool, such as small vehicles and houses, which greatly helps the precise positioning of the swimming pool.   Table 6. The larger the mAP, the better the detection performance of the model. As can be seen from Table 6, the mAP of Faster R-CNN increased to 61.79% (increased by 1.33%) after adding the Refine Feature Pyramid Network, and increased to 61.82% (increased by 1.36%) after adding Multi-layer Attention Network, which further proves the effectiveness of the proposed Refine Feature Pyramid Network and Multi-layer Attention Network. In fact, although RFPN and MANet can improve the overall detection performance of the model, from the perspective of the AP of each class, RFPN and MANet have weakened the performance of Faster R-CNN in detecting large targets to some extent.

Sensitivity Analysis of Nms Threshold for Radet
Non-maximum suppression (NMS) is the most commonly used post-processing method in object detection field. NMS can eliminate redundant boxes, leaving the best object detection position. Therefore, there is a necessary correlation between the non-maximum suppression threshold and the final detection performance of the object detection algorithm. Figure 11 analyzes the effect of NMS threshold used in post-processing of the proposed RADet on AP of each category and mAP of the algorithm. In Figure 11, it can be seen that when the NMS threshold is too high or too low, both AP and mAP show a downward trend, i.e., the detection effect of the algorithm shows a tendency to deteriorate. This is because that when the NMS threshold is too high, fewer redundant boxes are removed and the possibility of false detection is high; when the NMS threshold is too low, many redundant boxes are removed, the recall rate is low, and the possibility of missed detection is high.

Conclusions
In this paper, we propose an end-to-end multi-category detector designed for arbitrary-oriented objects in remote sensing images. Our method is improved based on Mask R-CNN and obtain the rotation bounding box of the objects through the shape mask predicted by the network. In addition, considering that the scales of objects in remote sensing images vary greatly, we adopt the backbone of the pyramid structure to obtain multi-scale feature maps and further improve the up-sampling method to reduce the checkerboard effect produced by deconvolution. Based on this, we propose a Refine Feature Pyramid Network, which can overcome the difficulty of large differences in object's scale and effectively reduce the checkerboard effect. Moreover, the proposed RADet weakens the influence of noise from complex background and highlights the object features through the proposed Multi-layer Attention Network, which can further improve the detection performance of small objects surrounded by complex backgrounds. Our method achieved the best detection performance on two public remote sensing image datasets: DOTA and NWPUVHR-10.
There is no doubt that there is still room for improvement in our approach. Since our method obtains the object's rotation bounding box based on the predicted shape mask, once the shape mask of the object is not well predicted, it will affect the quality of the rotation bounding box. In addition, like most two-stage target detectors, our method does not implement real-time detection. Therefore, in the future, we are interested in the following directions: 1) Further improve the mask branch to obtain a better shape mask. 2) Implement RADet in anchor free mode, to make RADet lighter and more flexible.