Point RCNN: An Angle-Free Framework for Rotated Object Detection

Rotated object detection in aerial images is still challenging due to arbitrary orientations, large scale and aspect ratio variations, and extreme density of objects. Existing state-of-the-art rotated object detection methods mainly rely on angle-based detectors. However, angle regression can easily suffer from the long-standing boundary problem. To tackle this problem, we propose a purely angle-free framework for rotated object detection, called Point RCNN, which mainly consists of PointRPN and PointReg. In particular, PointRPN generates accurate rotated RoIs (RRoIs) by converting the learned representative points with a coarse-to-fine manner, which is motivated by RepPoints. Based on the learned RRoIs, PointReg performs corner points refinement for more accurate detection. In addition, aerial images are often severely unbalanced in categories, and existing methods almost ignore this issue. In this paper, we also experimentally verify that re-sampling the images of the rare categories will stabilize training and further improve the detection performance. Experiments demonstrate that our Point RCNN achieves the new state-of-the-art detection performance on commonly used aerial datasets, including DOTA-v1.0, DOTA-v1.5, and HRSC2016.


Introduction
Although object detection has achieved significant progress in natural images, it still remains challenging for rotated object detection in aerial images, due to the arbitrary orientations, large scale and aspect ratio variations, and extreme density of objects [40]. Rotated object detection aims at predicting a set of oriented bounding box (OBB) and the corresponding classes in an aerial image, which has been serving as an essential step in many applications, e.g., urban management, emergency rescue, precise agriculture [9]. Modern rotated object detectors can be divided into two categories in terms of the representation of OBB: angle-based detectors and angle-free detectors.
In angle-based detectors, an OBB of a rotated object is usually represented as a five-parameter (x, y, w, h, θ). Most existing state-of-the-art methods are angle-based detectors relying on two-stage RCNN frameworks [8,16,20,41,48]. Generally, these methods use an RPN to generate horizontal or rotated RoIs, then use a designed RoI pooling operator to extract features from these RoIs. Finally, an RCNN head is used to predict the OBB and the corresponding classes. Compared to two-stage detectors, one-stage anglebased detectors [15,31,44,45,52] directly regress the OBB and classify them based on dense anchors for efficiency.
• We propose PointRPN and PointReg to reformulate angle prediction as the more straightforward points regression. Both of them are angle-free and have consistent parameter units. We further propose to resample images of rare categories to stabilize training and improve overall performance.
• Compared with the state-of-the-art methods, our Point RCNN framework attains higher detection performance on several large-scale datasets.

Horizontal Object Detection
In the past decade, object detection has become an important computer vision task and has received considerable attention. One line of research focuses on two-stage detectors [3,11,12,17,18,25,35], which first generates a sparse set of Regions of Interests (RoI) with a Region Proposal Network (RPN), and then performs classification and bounding box regression. While two-stage detectors still attract much attention, another line of research tends to develop efficient one-stage detectors [10,22,26,28,34,37,50], in which SSD [28] and YOLO [34] are the fundamental methods that use a set of pre-defined anchor boxes to predict object category and anchor box offsets. Recently, some anchor-free methods [10,22,50] detect object by predicting the center or corner or representative points, which also inspire us to develop the angle-free detector for rotated object.

Rotated Object Detection
In terms of the representation of oriented bounding box (OBB), modern rotated object detectors can be mainly divided into two categories: angle-based detectors and anglefree detectors.
Angle-based detectors detect rotated object by learning a five-parameter OBB (x, y, w, h, θ), in which (x, y, w, h) denotes a horizontal bounding box and θ denotes the rotated angle between the longer edge and the horizontal axis. RRPN [31] and R 2 PN [52] make use of multiple rotated anchors with different angles, scales, and aspect ratios, which improves the performance while increasing the computational complexity (see Fig. 1(a)). R 2 CNN [20] proposes to detect horizontal and rotated bounding box simultaneously with multi task learning. RoI Transformer [8] proposes a rotated RoI (RRoI) learner to transform a horizontal RoI into a RRoI, which provides more accurate RRoIs with a complex pipeline (see Fig. 1(b)). SCRDet [48] enhances features with attention module and proposes an IoU-smooth L 1 loss to alleviate the loss discontinuity issue. CSL [45] reformulates angle prediction from regression to classification to alleviate discontinuous boundary problem. GWD [46] and KLD [49] propose more efficient loss function for OBB regression. S 2 A-Net [15] proposes a single-shot alignment network to realize full feature alignment and alleviates the inconsistency between regression and classification. Recently, ReDet [16] proposes to use rotation-equivariant network to encode rotation equivariance explicitly and presents rotation-invariant RoI Align to extract rotation-invariant features. Oriented R-CNN [41] proposes a two-stage detector that consists of an oriented RPN for generating RRoI and an oriented RCNN for refining the RRoI. Both ReDet and Oriented RCNN provide promising accuracy.
However, the boundary problem in the angle regression learning still causes training unstable and limits the performance. While angle-based detectors still find many applications, angle-free methods are getting more and more attention from the community.  Figure 2. The proposed angle-free Point RCNN framework for rotated object detection. Point RCNN mainly consists of two modules: PointRPN for generating rotated proposals, and PointReg for refining for more accurate detection. "RRoI" denotes rotated RoI, "FC" denotes fully-connected layer, "C" and "B" represent the predicted category and rotated box coordinates of each RRoI, respectively.

Angle-free detectors reformulate rotated object
regression as learning a eight-parameter OBB (x 1 , y 1 , x 2 , y 2 , x 3 , y 3 , x 4 , y 4 ), which represents the four corner points of a rotated object. ICN [1] proposes to directly estimate the four vertices of a quadrilateral to regress an oriented object based on image pyramid and feature pyramid. RSDet [33] and Gliding Vertex [42] achieve more accurate rotated object detection via directly quadrilateral regression prediction. Recently, BBAVectors [51] extends the horizontal keypoint-based object detector to the oriented object detection task. CFA [13] proposes a convex-hull feature adaptation approach for configuring convolutional features. Compared to angle-based methods, angle-free detectors are more straightforward and can alleviate the boundary problem to a large extent. However, the performance is relatively limited yet.
In this paper, we propose an effective angle-free framework for rotated object detection, i.e., Point RCNN, which mainly consists of PointRPN and PointReg. Compared with other RRoI generation methods, our PointRPN generates accurate RRoI in an anchor-free and angle-free manner (see Fig. 1(c)).

Point RCNN
The overall structure of our Point RCNN is depicted in Fig. 2. We start by revisiting the boundary discontinuity problem of angle-based detectors. Then, we describe the overall pipeline of Point RCNN. Finally, we elaborate the PointRPN and PointReg modules, and propose a balanced dataset strategy to rebalance the long-tailed datasets during training.

Boundary Discontinuity
Boundary problem [43,47] is a long-standing problem that existed in angle-based detectors. Take the commonly used five-parameter OBB representation (x, y, w, h, θ) as an example, where (x, y) represents the center coordinates, (w, h) represents the shorter and longer edges of the box, and θ represents the angle between the longer edge and the horizontal axis. As shown in Fig. 3, when the target box is approximately square, a slight variation in edge length may cause w and h to swap, leading to a substantial variation of π/2 in angle θ.
This boundary discontinuity issue in angle prediction will confuse the optimization of the network and limit the detection performance.

Overview
The overall pipeline of Point RCNN is shown in Fig. 2. During training, Backbone-FPN first extracts feature maps given an input image. Then, PointRPN performs representative points regression and generates pseudo OBB for rotated RoI (RRoI). Finally, for each RRoI, PointReg refines the corner points and classifies them for final detection results. Besides, we propose to resample images of rare categories to stabilize training and improve the overall performance.
The overall training objective is described as: Offset Deformable conv.

Offset
Foreground Initial Stage Refine Stage where L P ointRP N denotes the losses in PointRPN, and L P ointReg denotes the losses in PointReg. We will describe them in detail in the following sections.

PointRPN
Existing rotated object detection methods generate rotated proposals indirectly by transforming the outputs of RPN [36] and suffer from the boundary discontinuity problem caused by angle prediction. For example, [8,16] use RoI transformer to convert horizontal proposals to rotated proposals with an additional angle prediction task. Unlike these methods, in this paper, we propose to directly predict the rotated proposals with representative points learning. The learning of points is more flexible, and the distribution of points can reflect the angle and size of the rotated object. The boundary discontinuity problem can thus be alleviated without angle regression.
Representative Points Prediction. Inspired by Rep-Points [50] and CFA [13], we propose PointRPN to predict the representative points in the RPN stage. The predicted points can effectively represent the rotating box and can be easily converted to rotated proposals in subsequent RCNN stages.
As shown in Fig. 4, PointRPN learns a set of representative points for each feature point. In order to make the features better adapt to the representative points learning, we take a coarse-to-fine prediction manner. In this way, the features will be refined with DCN [7] and the predicted offset in the initial stage. For each feature point, the predicted representative points of the two stages are: where K denotes the number of predicted representative points and we set K = 9 by default.
denote the learned offsets in the refine stage.
Label Assignment. PointRPN predicts representative points for each feature point in the initial and refine stages. This section will describe how we determine the positive samples among all feature points for these two stages.
For the initial stage, we project each ground-truth box to the corresponding feature level according to its area, and then select the feature point closest to its center as the positive sample. The rule used for projecting the ground-truth box b * i to the corresponding feature level is defined as: where s is a hyper-parameter and is set to 16 by default. w i and h i are the width and height of b * i . For the refine stage, we use the predicted representative points from the initial stage to help determine the positive samples. Specifically, for each feature point with its corresponding prediction R init , if the maximum convex-hull IoU (defined in Eq. (6)) between R init and ground-truth boxes exceeds the threshold τ , we select this feature point as a positive sample. We set τ = 0.1 in all our experiments.
Optimization. The optimization of the proposed PointRPN is driven by classification loss and rotated object localization loss. The learning objective is formulated as: where λ 1 , λ 2 , and λ 3 are the trade-off parameters and are set to 0.5, 1.0, and 1.0 by default, respectively. + L init loc denotes the localization loss of the initial stage. L ref ine cls and + L ref ine loc denote the classification loss and localization loss of the refine stage. Note that the classification loss is only calculated in the refine stage, and the two localization losses are only calculated for the positive samples.
In the initial stage, the localization loss is conducted between the convex-hulls converted from the learned points R init (see initial stage in Fig. 4) and the ground-truth OBBs. We use convex-hull GIoU loss [13] to calculate the localization loss: where N 0 pos indicates the number of positive samples of the initial stage. b * i is the matched ground-truth OBB. CIoU represents the convex-hull GIoU between the two convexhulls Γ(R init i ) and Γ(b * i ), which is differential and can be calculated as: where the first term denotes the convex-hull IoU, and P i denotes the smallest enclosing convex object area of Γ(R init i ) and Γ(b * i ). Γ(·) denotes the Jarvis March algorithm [19] used to calculate the convex-hull from points.
The learning of the refine stage, which is responsible for outputting more accurate rotated proposals, is driven by both classification loss and localization loss. L ref ine cls is a standard focal loss [26], which can be calculated as: With the refined representative points, pseudo OBB is converted using the MinAreaRect function of OpenCV [2], which is then used for generating RRoI for PointReg.
As illustrated in Fig. 6, our PointRPN can automatically learn extreme points and semantic key points of rotated objects.

PointReg
Corner Points Refine. The rotated proposals generated by PointRPN already provide a reasonable estimate for the target rotated objects. To avoid the problems caused by angle regression and further improve the performance, we turn to refine the four corners of the rotated proposals in the RCNN stage. As shown in Fig. 5, with the rotated proposals as input, we use a RRoI feature extractor [8,16] to extract RRoI features. Then, given the RRoI features, two consecutive fully-connected and ReLU layers are used to encode the RRoI features. Finally, two fully-connected layers are responsible for predicting the class probability P and refined corners C of the corresponding rotated object. The refined corners are represented as: where {(x i , y i )} 4 i=1 denotes the corner coordinates of the input rotated proposals. {(∆x i , ∆y i )} 4 i=1 denotes the predicted corner offsets.
Instead of directly performing angle prediction, we refine the four corners of the input rotated proposals. There are three advantages of adopting corner points refinement: 1). it can alleviate the boundary discontinuity problem caused by angle prediction; 2). the parameter units are consistent among the eight parameters; 3). it is possible to improve the localization accuracy using a coarse-to-fine manner.
We can easily extend PointReg to cascade structure for better performance. As shown in Fig. 2, in the cascade structure, the refined rotated proposals of the previous stage are used as the input of the current stage.
Optimization. The learning of PointReg is driven by classification loss and rotated object localization loss: where µ 1 and µ 2 are the trade-off coefficients and are both set to 1.0 by default. L cls indicates the classification loss, which is a cross-entropy loss: where N denotes the number of training samples in PointReg, C is the number of categories excluding background, P i is the predicted classification probability of the i th RRoI. Y i→c = 1 if the ground-truth class of the i th RRoI is c; otherwise it is 0. + L loc represents the localization loss between the refined corners and the corners of ground-truth OBB. We use L 1 loss to optimize the corner refinement learning: where C i (= {(x j , y j )} 4 j=1 ) denotes refined corners for the i th rotated proposal, b * i (= {(x * j , y * j )} 4 j=1 ) denotes the corners of matched ground-truth OBB. ϑ(b * i ) denotes the permutation of four corners of b * i with the smallest L 1 loss |C i − ϑ(b * i )|. Note that + L loc is only calculated for positive training samples.

Balanced Dataset
The extremely nonuniform object densities of aerial images usually make the dataset long-tailed, which may cause the training process to be unstable and limit the detection performance. For instance, DOTA-v1.0 contains 52, 516 ship instances but only 678 ground track field instances [9]. To alleviate this issue, we resample the images of rare categories, which is inspired by [14]. More concretely, first, for each category c ∈ C, compute the fraction of images F c that contains this category. Then, compute the categorylevel repeat factor for each category: where β thr is a threshold which indicates that there will be not oversampling if "F c > β thr ". Finally, compute the image-level repeat factor r I for each image I: where C I denotes the categories contained in I. In other words, those images that contain long-tailed categories will have a greater chance of being resampled during training.

Datasets
To evaluate the effectiveness of our proposed Point RCNN framework, we perform experiments on two popular large-scale datasets: DOTA [40] and HRSC2016 [30].
DOTA [40] is the largest dataset for oriented object detection with three instances. This is a more challenging dataset, which introduces a new category Container Crane (CC) and more small instances.

Implementation Details
We implement Point RCNN using the MMDetection tool-box [6]. We follow ReDet [16] to use ReResNet with ReFPN as our backbone (ReR50-ReFPN), which has shown the ability to extract rotation-equivariant features. We also verify with the more generalized transformer backbone (Swin-Tiny) to show the generalization and scalability of our Point RCNN.
On the DOTA dataset, following previous methods [8,15,16], we crop the image to 1024 × 1024 with 824 pixels as a stride and we also resize the image to three scales {0, 5, 1.0, 1.5} for multi-scale data. Random horizontal flipping and random rotation ([−45 • , 45 • ]) are adopted for multi-scale training. On the HRSC2016 dataset, like previous method [16], we resize all the images to (800, 512), random horizontal flipping is applied during training. Unless otherwise specified, we train all the models with 19 epochs for DOTA and 36 epochs for HRSC2016. To be specific, we train the models using AdamW [21] on 8 Tesla-V100 GPUs with β 1 =0.9 and β 2 =0.999, with an initial learning rate of 0.0002, a weight decay of 0.05, and a mini-batch size of 16 (2 images per GPU). The learning rate decays by a factor of 10 at each decay step.

Main Results
We compare our Point RCNN framework with other state-of-the-art methods on three datasets: DOTA-v1.0, DOTA-v1.5, and HRSC2016. As shown in Tab. 1, Tab. 2, and Tab. 3, without bells and whistles, our Point RCNN demonstrates superior performance against state-of-the-art methods.
On DOTA-v1.5, which is more challenging compared to DOTA-v1.0, Point RCNN achieves 79.31 mAP, which significantly improve the performance by 2.51%. With the more generalized transformer backbone Swin-T, Point RCNN further improves the performance by 0.83% (from 79.31 to 80.14). The results are reported in Tab. 2 On HRSC2016, as reported in Tab. 3, Point RCNN attains the new state-of-the-art performance under both the VOC2007 and VOC2012 metrics, respectively.

Ablation Study
In this section, if not specified, all the models are trained only on the training and validation set with scale 1.0 for simplicity, and are tested using multi-scale testing. The metric mAP is evaluated on the DOTA-v1.5 test set and obtained by submitting prediction results to DOTA's evaluation server.

Effect of PointRPN
To analysis the efficiency of PointRPN, we evaluate the detection recall of PointRPN on the validation set of DOTA-v1.5. For simplicity, we train the models on the training  In this section, we analysis the impact of the oversampling threshold β thr of the balanced dataset strategy. As shown in Tab. 6, we achieve the best detection accuracy of 77.60% at β thr = 0.3. Therefore, we set β thr = 0.3 in all other experiments on DOTA.

Factor-by-factor Experiment
To explore the effectiveness of each module of our proposed Point RCNN framework, we conduct a factor-byfactor experiment on the proposed PointRPN, PointReg and  Table 7. Factor-by-factor ablation experiments. The detection performance is evaluated on the test set of DOTA-v1.5 dataset.
balanced dataset strategy. The results are depicted in Tab. 7, each component has a positive effect, and all components are combined to obtain the best performance.

Visualization Analysis
We visualize some detection results on DOTA-v1.0 test set. Fig. 6 shows some examples of the learned points of PointRPN, which indicates that PointRPN is capable of learning representative points of rotated object. Fig. 7 shows some final detection results of Point RCNN, the red points denote the corner points learned by PointReg and the colored OBBs converted by the MinAreaRect function of OpenCV are the final results.

Limitations
Although experiments substantiate the superiority of Point RCNN over state-of-the-art methods, our method does not perform well enough in some categories, e.g., PL (Plane), which needs to be further explored. Point RCNN also needs to use rotate NMS to remove duplicate results, which may mistakenly delete the TP. Transformer-based methods [4] may be the potential solutions, which will be the future work.

Conclusion
In this work, we revisit rotated object detection and propose a purely angle-free framework for rotated object detection, named Point RCNN, which mainly consists of a PointRPN for generating accurate RRoIs, and a PointReg for refining corner points based on the generated RRoIs. In addition, we propose a balanced dataset strategy to overcome the long-tailed distribution of different object classes in aerial images. Extensive experiments on several largescale benchmarks demonstrate the significant superiority of our proposed framework against the state-of-the-arts.