Single-Stage Rotation-Decoupled Detector for Oriented Object

: Oriented object detection has received extensive attention in recent years, especially for the task of detecting targets in aerial imagery. Traditional detectors locate objects by horizontal bounding boxes (HBBs), which may cause inaccuracies when detecting objects with arbitrary oriented angles, dense distribution and a large aspect ratio. Oriented bounding boxes (OBBs), which add di ﬀ erent rotation angles to the horizontal bounding boxes, can better deal with the above problems. New problems arise with the introduction of oriented bounding boxes for rotation detectors, such as an increase in the number of anchors and the sensitivity of the intersection over union (IoU) to changes of angle. To overcome these shortcomings while taking advantage of the oriented bounding boxes, we propose a novel rotation detector which redesigns the matching strategy between oriented anchors and ground truth boxes. The main idea of the new strategy is to decouple the rotating bounding box into a horizontal bounding box during matching, thereby reducing the instability of the angle to the matching process. Extensive experiments on public remote sensing datasets including DOTA, HRSC2016 and UCAS-AOD demonstrate that the proposed approach achieves state-of-the-art detection accuracy with higher e ﬃ ciency. The experimental results demonstrated that our method achieves state-of-the-art detection accuracy with high e ﬃ ciency.


Introduction
With the increasing number of applications based on convolutional neural networks (CNNs) in the field of computer vision, object detection algorithms have been developed rapidly. Existing detectors [1][2][3][4] have achieved promising results on real-life datasets including MS COCO [5] and VOC2007 [6]. Related models typically use horizontal bounding boxes (HBBs) to locate targets. Most targets in remote sensing imageries are characterized by an arbitrary directionality, high aspect ratio and dense distribution; consequently, the models based on HBBs may cause serious overlap and noise. Subsequently, the rotating bounding box was devised to deal with these targets, with the advantages of capturing the target with better accuracy and introducing the least background noise. In addition, oriented bounding boxes (OBBs) separate densely distributed targets perfectly and thus avoid the overlapping of the adjacent bounding boxes. Specifically, for the detection of ships and vehicles, oriented detectors [7][8][9][10][11] based on rotating bounding boxes perform well.
However, with the introduction of the rotating bounding box, due to the sensitivity of the intersection over union (IoU) to changes in angle, some problems also arise. A small angle change will cause a rapid drop in the IoU, which will lead to inaccurate detection. The usage of oriented anchors will lead to a sharp increase in the number of anchors. As a result of these problems, the IoU between the matched oriented anchor and the ground truth box fluctuates dramatically with the change of

Horizontal Object Detection
Object detection algorithms typically use horizontal bounding boxes to locate targets. At the beginning of the application of the convolutional neural network (CNN) to the object detection task, R-CNN [22] uses a selective search algorithm to generate category-independent region proposals and then extracts fixed-length feature vectors from each region proposal through CNN for classification. Due to the success of R-CNN, many models have been developed based on it. Fast R-CNN [23] introduces RoI Pooling to process region proposals, which effectively reduces computational redundancy. Faster R-CNN [24] uses a region proposal network (RPN) to generate region proposals so that the model can be trained end-to-end. At this point, the structure of the two-stage detector is basically determined: generating region proposals and then predicting the precise location of the targets and the corresponding category labels. According to the characteristics of the two-stage detector, Mask R-CNN [25] embeds the image segmentation task into the detector's second-stage task, effectively improving the accuracy of instance segmentation. In order to achieve real-time detection, single-stage detectors have appeared that perform two stages simultaneously. YOLO [26] grids the images and performs simultaneous category prediction and position regression directly on the feature map output from CNN. SSD [1] makes full use of multiple feature maps with different resolutions to naturally predict targets of different sizes. In order to solve the category imbalance problem of single-stage detectors, RetinaNet [2] proposes Focal Loss, which is a dynamically scaled cross entropy loss. RefineDet [3] uses the anchor refinement module (ARM) and the object detection module (ODM) to imitate the two-stage structure and produces accurate detection results with high efficiency. EfficientDet [4] realizes easy and fast multi-scale feature fusion through a weighted bi-directional Remote Sens. 2020, 12, 3262 3 of 13 feature pyramid network (BiFPN). EfficientDet [4] also proposes a new compound scaling method to make the model complexity and accuracy adjustable. The above-mentioned methods are all anchor-based; in recent years, anchor-free methods have begun to emerge. CornerNet [27] locates the target by learning a pair of key points: the top-left corner and the bottom-right corner. CenterNet [28] models the object as a key point and directly predicts the center point and other properties of the object. FCOS [29] further optimizes the performance of anchor-free detectors and unifies the detection process with other fully convolutional network (FCN)-solvable tasks. In general, the two-stage detector maintains a high detection accuracy rate, while the single-stage detector achieves a balance between efficiency and accuracy.

Oriented Object Detection
The application of rotating object detection in the field of oriented object detection in aerial imagery is being extensively studied. The anchor-based method shows strong stability for multiple multi-oriented object detection benchmarks. Considering the difficulty of the anchor-based method for high aspect ratio object detection, the anchor-free method is also widely applied. For remote sensing object detection, RoI Transformer [13] learns the transformation from HRoIs to RRoIs and then extracts the rotation-invariant features from the RRoI through rotated position-sensitive RoI alignment. R 3 Det [30] uses a combination strategy, first performing rapid detection based on horizontal anchor boxes and then performing oriented object detection based on refined rotating anchor boxes. In [16], a novel method of rotating bounding box representation based on a gliding vertex on the horizontal bounding box is introduced to describe multi-oriented objects more accurately and avoid confusion issues. Considering the background noise interference caused by the horizontal bounding box, SCRDet++ [31] proposes instance level denoising (InLD) for small and cluttered objects. The anchor-free methods also show strong competitiveness in remote sensing object detection. DHN [17] presents a dynamic refinement network which alleviates the misalignment between receptive fields and objects by a feature selection module (FSM) and refines the prediction in an object-wise manner by using a dynamic refinement head (DRH). Regarding the angular periodicity problem in rotating object detection, APE [18] represents the angle as continuously changing periodic vectors to avoid ambiguity. In addition, APE [18] designs a length-independent IoU (LIIoU) for long objects to make the detector more robust.

Proposed Method
The proposed RDD is designed based on an FPN [32] architecture which uses multi-scale feature maps for detection and is currently widely adopted. The structure of RDD is simple compared to many current models. In addition, as a rotation detector, the learning of position parameters is also more concise. The arbitrarily oriented targets are represented as rotated bounding boxes which are more accurate than the horizontal boundary frame. However, the IoU is sensitive to changes of angle. We found in experiments that, by designing appropriate learning targets, even without using a rotating anchor box, the angle can be learned accurately. Specifically, we designed a new rotated bounding box representation method for this. Furthermore, a new rotation-decoupled anchor matching strategy is proposed to optimize the learning of the position information of the arbitrary oriented target. The positive and negative sample balance strategy is adopted to deal with foreground-background class imbalance. The experimental results show that the proposed method achieves state-of-the-art accuracy on both single-category and multi-category rotation detection datasets.

Network Architecture
RDD has a lightweight network architecture, which is illustrated in Figure 1. The type of each operation is shown at the bottom of Figure 1. At first, the multi-scale feature maps were obtained from the backbone network, and the widely used ResNet101 [33] was chosen for experiments in this paper. Secondly, the multi-scale feature maps were inputted into the pyramid structure network for feature Remote Sens. 2020, 12, 3262 4 of 13 fusion; the details of the pyramid structure network are illustrated in Figure 1. The pyramid network realizes the transmission of semantic information, which is helpful for multi-scale object detection. In order to connect the feature layers of different scales, we up-sampled the feature map and summed it with the feature map of the previous layer in an element-wise manner. We added a convolutional layer before and after the summation to ensure the discriminability of features for detection. Finally, the prediction layers output the classification and regression. Classification and regression use two prediction layers with the same structure, and they only differ in the number of output channels. For classification, the number of output channels is a × c; for regression, the number of output channels is a × 5. a and c refer to the number of anchors and the number of categories. The illustrated structure can also be extended to more layers in practice. In this study, a rotation-decoupled anchor matching strategy was designed at the training stage, and only horizontal anchors were subsequently employed by the proposed model instead of the oriented anchors.
Remote Sens. 2020, 12, x FOR PEER REVIEW 4 of 13 and summed it with the feature map of the previous layer in an element-wise manner. We added a convolutional layer before and after the summation to ensure the discriminability of features for detection. Finally, the prediction layers output the classification and regression. Classification and regression use two prediction layers with the same structure, and they only differ in the number of output channels. For classification, the number of output channels is ; for regression, the number of output channels is 5. and refer to the number of anchors and the number of categories. The illustrated structure can also be extended to more layers in practice. In this study, a rotation-decoupled anchor matching strategy was designed at the training stage, and only horizontal anchors were subsequently employed by the proposed model instead of the oriented anchors.

Rotated Bounding Box Representation
The HBB has good robustness but insufficient accuracy. It is usually represented by ( , , , ℎ), where ( , ) is the center and and ℎ are the lengths of the bounding box along the and axes, respectively. The OBB is more accurate but less robust. It is usually represented by ( , , ℎ, , ), where is the angle of the bounding box; however, the periodicity of usually causes a sudden IoU drop, and inaccurate object detection is subsequently avoidable, especially in case of large aspect ratios. Both the HBB and OBB have different advantages and disadvantages, respectively; therefore, the advantages of the HBB and OBB representation methods are combined to redefine the new bounding box.
Traditionally, an HBB defined as ( , , , ℎ) is different to an HBB defined as ( , , ℎ, ); however, they are redefined as HBBh as the HBB in the horizontal direction and HBBv as the same HBB in the vertical direction. Therefore, for any OBB, we can find a corresponding HBBh/v. They have the same shape and center point, and the angle between them is within − /4, /4 . We redefine as the angle of the new bounding box. Figure 2 shows the redefinition of the bounding box intuitively. Through the new definition of the bounding box, the HBBh/v is used as the ground truth box to match with the anchors, and it effectively avoids the problem of angle periodicity induced by the OBB.

Rotated Bounding Box Representation
The HBB has good robustness but insufficient accuracy. It is usually represented by (x, y, w, h), where (x, y) is the center and w and h are the lengths of the bounding box along the X and Y axes, respectively. The OBB is more accurate but less robust. It is usually represented by (x, y, h, w, θ), where θ is the angle of the bounding box; however, the periodicity of θ usually causes a sudden IoU drop, and inaccurate object detection is subsequently avoidable, especially in case of large aspect ratios. Both the HBB and OBB have different advantages and disadvantages, respectively; therefore, the advantages of the HBB and OBB representation methods are combined to redefine the new bounding box.
Traditionally, an HBB defined as (x, y, w, h) is different to an HBB defined as (x, y, h, w); however, they are redefined as HBB h as the HBB in the horizontal direction and HBB v as the same HBB in the vertical direction. Therefore, for any OBB, we can find a corresponding HBB h/v . They have the same shape and center point, and the angle θ between them is within [−π/4, π/4]. We redefine θ as the angle of the new bounding box. Figure 2 shows the redefinition of the bounding box intuitively. Through the new definition of the bounding box, the HBB h/v is used as the ground truth box to match with the anchors, and it effectively avoids the problem of angle periodicity induced by the OBB.

Anchor Setting
Previous rotating detectors usually set a large number of rotating anchors to obtain more accurate detection results for objects with arbitrary angles. In contrast, the anchor-selecting strategy of the SSD method is used by the proposed method, and only horizontal anchors instead of the oriented anchors are used so that the proposed method largely eliminates the influence of the angle and thus focuses more on shape matching. Furthermore, several times fewer anchors are required compared to methods based on oriented anchors, which greatly accelerates the training and interfacing process.

Rotation-Decoupled Anchor Matching Strategy
Based on the redefined bounding box, we implement a new rotation-decoupled anchor matching strategy. Before matching, the rotating bounding box/ground truth box is decoupled to a HBBh/v and an acute angle, and the HBBh/v is used as the new ground truth box for matching. Subsequently, the matching strategy similar to SSD [1] that is based on horizontal anchors is taken. Figure 3 shows the difference between the proposed strategy and the strategy based on oriented anchors. The IoU between the horizontal anchor and the decoupled ground truth box does not consider the angle, but the IoU between the rotating anchor and the ground truth box considers the angle. Specifically, anchors are assigned to ground truth boxes and considered as foreground (positive samples) when the IoU is greater than the given threshold; anchors are considered as background (negative samples) when the IoU is below another given threshold. In this study, the foreground IoU threshold is set to 0.5 and the background IoU threshold is set to 0.4, as implemented in RetinaNet [2]. The proposed matching strategy suppresses the influence of angles and pays more attention to the matching of shapes. Thus, the ground truth boxes will naturally match the horizontal bounding boxes with the smallest angle to its principal axis, which avoids the periodicity of the angle.

Anchor Setting
Previous rotating detectors usually set a large number of rotating anchors to obtain more accurate detection results for objects with arbitrary angles. In contrast, the anchor-selecting strategy of the SSD method is used by the proposed method, and only horizontal anchors instead of the oriented anchors are used so that the proposed method largely eliminates the influence of the angle and thus focuses more on shape matching. Furthermore, several times fewer anchors are required compared to methods based on oriented anchors, which greatly accelerates the training and interfacing process.

Rotation-Decoupled Anchor Matching Strategy
Based on the redefined bounding box, we implement a new rotation-decoupled anchor matching strategy. Before matching, the rotating bounding box/ground truth box is decoupled to a HBB h/v and an acute angle, and the HBB h/v is used as the new ground truth box for matching. Subsequently, the matching strategy similar to SSD [1] that is based on horizontal anchors is taken. Figure 3 shows the difference between the proposed strategy and the strategy based on oriented anchors. The IoU between the horizontal anchor and the decoupled ground truth box does not consider the angle, but the IoU between the rotating anchor and the ground truth box considers the angle. Specifically, anchors are assigned to ground truth boxes and considered as foreground (positive samples) when the IoU is greater than the given threshold; anchors are considered as background (negative samples) when the IoU is below another given threshold. In this study, the foreground IoU threshold is set to 0.5 and the background IoU threshold is set to 0.4, as implemented in RetinaNet [2]. The proposed matching strategy suppresses the influence of angles and pays more attention to the matching of shapes. Thus, the ground truth boxes will naturally match the horizontal bounding boxes with the smallest angle to its principal axis, which avoids the periodicity of the angle. For a better comparison, we simulate the matching process of the proposed strategy and the previous strategy based on oriented anchors, respectively. We use horizontal anchors at seven aspect ratios {1, 2, 1/2, 4, 1/4, 8, 1/8}. Anchors with three scales {2 0 , 2 1/3 , 2 2/3 } are subsequently added for denser scale coverage. Oriented anchors are obtained by adding a series of angles at 30° intervals on horizontal anchors. Figure 4 shows an example of the matching results using the proposed strategy and the strategy based on oriented anchors. It can be seen that, despite the setting of dense anchors, For a better comparison, we simulate the matching process of the proposed strategy and the previous strategy based on oriented anchors, respectively. We use horizontal anchors at seven aspect ratios {1, 2, 1/2, 4, 1/4, 8, 1/8}. Anchors with three scales {2 0 , 2 1/3 , 2 2/3 } are subsequently added for denser scale coverage. Oriented anchors are obtained by adding a series of angles at 30 • intervals on horizontal anchors. Figure 4 shows an example of the matching results using the proposed strategy and the strategy based on oriented anchors. It can be seen that, despite the setting of dense anchors, the overlap between the ground truth box and the matched anchor is not high at some angles, which is due to the fact that the angles of the oriented anchors are set with fixed intervals without considering the aspect ratio, and anchors with a limited number of angles are subsequently used for matching. This problem is exacerbated by the sensitivity of the IoU to changes of angle. Further, we plot the change curves for the maximum IoU and the number of matched anchors under both strategies in Figure 5 under the condition of Figure 4. The foreground IoU threshold is set to 0.5. Figure 5 shows that for the strategy based on oriented anchors, the maximum IoU curve fluctuates sharply as the aspect increases, while the maximum IoU curve is unaffected by using the proposed strategy. The difference is more pronounced with the change of the matched anchor number. As the aspect increases, the difference in the number of matched anchors increases rapidly for oriented anchors with the same shape and different angles. In some cases, the number of anchors matching the ground truth box even reaches 0, which means that the model will not be able to learn from the ground truth box. However, such a large difference is unreasonable. The ideal situation is that the matching result is not affected by the angle change. In contrast, the proposed strategy is always stable under various conditions. For a better comparison, we simulate the matching process of the proposed strategy and the previous strategy based on oriented anchors, respectively. We use horizontal anchors at seven aspect ratios {1, 2, 1/2, 4, 1/4, 8, 1/8}. Anchors with three scales {2 0 , 2 1/3 , 2 2/3 } are subsequently added for denser scale coverage. Oriented anchors are obtained by adding a series of angles at 30° intervals on horizontal anchors. Figure 4 shows an example of the matching results using the proposed strategy and the strategy based on oriented anchors. It can be seen that, despite the setting of dense anchors, the overlap between the ground truth box and the matched anchor is not high at some angles, which is due to the fact that the angles of the oriented anchors are set with fixed intervals without considering the aspect ratio, and anchors with a limited number of angles are subsequently used for matching. This problem is exacerbated by the sensitivity of the IoU to changes of angle. Further, we plot the change curves for the maximum IoU and the number of matched anchors under both strategies in Figure 5 under the condition of Figure 4. The foreground IoU threshold is set to 0.5. Figure 5 shows that for the strategy based on oriented anchors, the maximum IoU curve fluctuates sharply as the aspect increases, while the maximum IoU curve is unaffected by using the proposed strategy. The difference is more pronounced with the change of the matched anchor number. As the aspect increases, the difference in the number of matched anchors increases rapidly for oriented anchors with the same shape and different angles. In some cases, the number of anchors matching the ground truth box even reaches 0, which means that the model will not be able to learn from the ground truth box. However, such a large difference is unreasonable. The ideal situation is that the matching result is not affected by the angle change. In contrast, the proposed strategy is always stable under various conditions.

Positive and Negative Sample Balance Strategy
After the matching step, most anchors are labeled as background or as the negative class, while few are labeled as foreground or as the positive class. The foreground-background imbalance problem occurs during training and does not depend on the number of samples in each class [34]. We implement a balance strategy similar to Focal Loss [2], with the difference that the new strategy no longer dynamically scales the loss. We assign category labels of 1, 0 and −1 for foreground anchors and background anchors and ignored anchors. The corresponding binary cross-entropy loss is

Positive and Negative Sample Balance Strategy
After the matching step, most anchors are labeled as background or as the negative class, while few are labeled as foreground or as the positive class. The foreground-background imbalance problem occurs during training and does not depend on the number of samples in each class [34]. We implement a balance strategy similar to Focal Loss [2], with the difference that the new strategy no longer dynamically scales the loss. We assign category labels of 1, 0 and −1 for foreground anchors and background anchors and ignored anchors. The corresponding binary cross-entropy loss is defined as follows: where y is the class label of an anchor and p is the predicted probability. Classification loss is defined as follows: where N indicates the number of anchors and N pos indicates the number of foreground anchors.
Positive samples make a more stable contribution to classification loss than negative samples, but the number of positive samples is small compared to negative samples; N pos instead of N is used for averaging, which can better solve the problem introduced by the sample unbalancing. Formulas (1) and (2) are key to the sample balancing strategy. As with Faster R-CNN [24], the ground truth box (x, y, w, h, θ) and prediction box ( for position regression. The definition of v and v * are listed in Equations (3) and (4). The anchors are expressed as (x a , y a , w a , h a ): We adopt smooth-L1 loss for the rotation bounding box regression, and only the foreground anchors are included: Finally, the multi-task loss is defined as The trade-off between two terms is controlled by the balancing parameter α. In the experiment presented in this paper, α is set to 1.

Experiments
We evaluate the proposed detector on three public remote sensing datasets annotated with oriented bounding boxes, known as the DOTA [19], HRSC2016 [20] and UCAS-AOD [21] datasets.

Datasets and Settings
DOTA is a large remote sensing dataset for object detection which contains 15 categories: plane (PL), baseball diamond (BD), bridge (BR), ground field track (GTF), small vehicle (SV), large vehicle (LV), ship (SH), tennis court (TC), basketball court (BC), storage tank (ST), soccer ball field (SBF), roundabout (RA), harbor (HA), swimming pool (SP) and helicopter (HC). DOTA contains 2806 aerial images collected from different sensors and platforms, including 1411 images for training, 937 images for testing and 458 images for validation. The size of each image is approximately 800 × 800 to 4000 × 4000 pixels. The dataset labels a total of 188,282 targets with both horizontal bounding boxes and oriented bounding boxes. We cropped the original image into sub-images of different sizes {512, 768, 1024, 1536} with an overlap of 0.25 and resize them to 768 × 768 for training and testing. We trained on the training set and validation set and evaluate the model on the test set. We train the proposed network for a total of 250,000 iterations, with an initial learning rate of 0.001, which is then set to 1 × 10 −4 at 100,000 iterations and 2 × 10 −5 at 200,000 iterations.
The HRSC2016 dataset was built for the ship recognition task and collected 1061 images from Google Earth. The dataset contains 436 images including 1207 samples for training, 181 images including 541 samples for validation and 444 images including 1228 samples for testing. The image sizes range from 300 × 300 to 1500 × 900. We resized the images to 768 × 768 for training and testing. We trained on the training set and validation set and evaluated the model on the test set. We trained the proposed network for a total of 12,000 iterations, with an initial learning rate of 0.001, which was then set to 1 × 10 −4 at 7500 iterations.
UCAS-AOD contains 1510 aerial images collected from Google Earth. Among them, 7482 planes are annotated in 1000 images and 7114 vehicles are annotated in another 510 images. These images have two sizes: 1280 × 659 pixels and 1714 × 1176 pixels. Since the dataset is not divided into a training set and test set, in [19,30,35], the authors randomly selected 1110 images for training and 400 for testing. Similar to these authors, we selected 400 images at regular intervals for testing, and the remaining 1110 images were used for training. We cropped the image into a series of sub-images whose length and width did not exceed 768 pixels. The model was trained by 30,000 iterations in total. The initial learning rate was set to 0.001 and changed from 1 × 10 −4 to 2 × 10 −5 at 15,000 iterations and 24,000 iterations, respectively. We trained the model with a batch size of 12 on 1 Titan RTX GPU. The network was trained by an SGD optimizer, and the momentum and weight decay were set to 0.9 and 5 × 10 −4 , respectively. The anchors had areas of 24 2 to 384 2 on pyramid levels P3 to P7. At each pyramid level, we used anchors at three scales {2 0 , 2 1/3 , 2 2/3 }. We set different aspect ratios {1, 2, 1/2, 4, 1/4, 8, 1/8}, {1.5, 1/1.5, 3, 1/3, 5, 1/5, 8, 1/8}, {1, 2, 1/2} for DOTA, HRSC2016 and UCAS-AOD, respectively. In order to improve the robustness of the model, we used several data augmentation strategies, such as random photometric distortion [36], random horizontal, vertical flipping, random rotation, etc. Additional experiments with ResNet152 [33] as the backbone network kept the same settings except that the batch size was set to 6. The code will be made public at https://github.com/Capino512/pytorch-rotation-decoupled-detector.

Experimental Results
Results on DOTA. We compare our results on DOTA with other state-of-the-art methods, as shown in Table 1. The results are obtained by submitting the predictions to the official DOTA evaluation server. The existing detectors are mainly two-stage in DOTA dataset research, and their detection speed is usually slower than that of one-stage detectors. Benefiting from our designed rotation-decoupled anchor matching strategy, even compared to the most advanced two-stage detectors, the proposed single-stage detector achieves comparable performance while maintaining a simple network structure. Compared to various methods, our method achieves relatively stable detection results in all categories without any extra network design such as cascade refinement and an attention mechanism; furthermore, our method achieves the highest mAP, which is even higher than all other listed two-stage detectors. The effectiveness of the foreground-background class balance strategy also plays an important role. Results on HRSC2016. The HRSC2016 dataset poses a huge challenge in terms the accuracy of the rotation detector since it contains a large number of ship instances with high aspect ratios and arbitrary orientation. Table 2 shows the comparison of the proposed model with other models. The times of the interface and post process are included when calculating the frames per second (FPS). The proposed method shows the effectiveness of detecting such targets. The proposed method learns the position information of the oriented object accurately without adding oriented anchors for angle regression. At the same time, benefiting from the simplicity of the implementation method, the proposed detector maintains a fairly high detection speed. Results on UCAS-AOD. The UCAS-AOD dataset annotates two types of targets, airplanes and cars, which have relatively small sizes and cannot occupy the entire image. Therefore, only the feature maps on pyramid levels P3 to P6 are used. We train the model on the cropped image, and then make predictions on the uncropped original image. As shown in Table 3, the proposed detector also performs well on the UCAS-AOD. Table 3. Accuracy comparison on UCAS-AOD. "Ours*" indicates that sResNet152 is used as the backbone.  Figure 6 provides a visual representation of our results on DOTA, HRSC2016 and UCAS-AOD datasets. It shows that our method has yielded fairly accurate detection results on each dataset.

Method
DRBox [39] 94.  Figure 6 provides a visual representation of our results on DOTA, HRSC2016 and UCAS-AOD datasets. It shows that our method has yielded fairly accurate detection results on each dataset.

Discussion
The experimental results show that proposed method achieves desirable results on three different benchmarks without additional help. This demonstrates that our method is feasible and has good applicability. However, the proposed method also has the common problems of the current anchor-based method: (1) the anchors need to be set according to the shapes of the objects to be detected; (2) multi-scale anchors are used at each pyramid level, which will result in a large amount of calculation when the objects are of various shapes; and (3) it is difficult to detect highly overlapping objects. In the future, we will study how to reduce the dependence of the detector on multi-scale anchors and try to design a new single-stage detector to balance performance and efficiency.

Conclusions
In this study, we proposed a novel rotation-decoupled anchor matching strategy to simplify the anchor matching process for anchor-based methods. The new strategy optimizes the way in which the model learns the object position information without using rotating anchors. Instead of learning the shape and angle simultaneously, as in traditional rotating anchors, the proposed strategy learns the shape (using a horizontal anchor) and angle separately. Based on the proposed strategy, we build a single-stage rotation detector with a simple structure. The detector is accurate and efficient and can be further improved by adding advanced structures to it. The positive and negative sample balance strategy is applied to deal with the foreground-background imbalance problem encountered by single-stage detectors. We performed comparative experiments on multiple rotation detection datasets including DOTA, HRSC2016 and UCAS-AOD. The experimental results demonstrated that our method achieves state-of-the-art detection accuracy with high efficiency.