Rotated Object Detection with Circular Gaussian Distribution

Rotated object detection is a challenging task due to the difficulties of locating the rotated objects and separating them effectively from the background. For rotated object prediction, researchers have explored numerous regression-based and classification-based approaches to predict a rotation angle. However, both paradigms are constrained by some flaws that make it difficult to accurately predict angles, such as multi-solution and boundary issues, which limits the performance upper bound of detectors. To address these issues, we propose a circular Gaussian distribution (CGD)-based method for angular prediction. We convert the labeled angle into a discrete circular Gaussian distribution spanning a single minimal positive period, and let the model predict the distribution parameters instead of directly regressing or classifying the angle. To improve the overall efficiency of the detection model, we also design a rotated object detector based on CenterNet. Experimental results on various public datasets demonstrated the effectiveness and superior performances of our method. In particular, our approach achieves better results than state-of-the-art competitors, with improvements of 1.92% and 1.04% in terms of AP points on the HRSC2016 and DOTA datasets, respectively.


Introduction
Rotated object detection has emerged as a fundamental component for visual analysis across various types of images, including aerial images [1,2], panoramic images [3][4][5] and scene text [6].It is a more general approach compared to traditional horizontal object detection [7,8].As conventional Horizontal Bounding Boxes (HBBs) can not tightly enclose oriented objects, Rotated Bounding Boxes (RBBs) have been introduced in recent works [9,10].For rotated object detection, the rotation angle is a sensitive parameter, and even a small deviation in angle can lead to a significant drop of Intersection over Union (IoU) between predicted boxes and ground truth.This effect is especially pronounced when the aspect ratio of an object is large.Therefore, accurate angular prediction is crucial to improve the performance of oriented object detectors.
The rotation angle is a numerical attribute with intrinsic periodicity, which gives rise to the multi-solution issue and the boundary issue [11], as illustrated in Figure 1a.Accurate prediction of this attribute is challenging.A rotated bounding box (RBB) is produced by rotating a horizontal bounding box (HBB) around its center, and the period of its rotation angle is 180 • (long-edge representation).The RBBs rotated with the angles offset by several periods (e.g., 1 • and 181 • ) are exactly identical.This means that there exists multi-solution in angular space, which increases the uncertainty of the model optimization.When only considering a single period, the RBBs rotated with the larger angle and the smaller angle (e.g., 88 • and −88 • ) are actually similar in the pose.This may cause the model to take long detours during the optimization, as noted in the analysis of GWD [12].Researchers have explored numerous regression-based [3][4][5][12][13][14] and classificationbased [15,16] approaches to predict rotation angle.The naive regression predicts a continuous value and measures the numerical deviation of the angle directly.It fails to handle either multi-solution issues or boundary issues by itself.To remedy the natural imperfections, some periodic methods, such as trigonometric functions and modulus operators, have been introduced into the loss calculation.However, the angular values output by the model can still lie in any period, and these methods only decorate angular predictions by projecting them into a single period.Other integrated methods, such as GWD [12], ingeniously transform the OBB into a continuous Gaussian distribution, but it is essentially equivalent to applying trigonometric functions to optimization in some sophisticated way, which never completely avoids the previous shortcoming.Another appealing approach, classification, predicts a category of fine-grained angular ranges that can naturally aggregate angular values across periods into their equivalence classes to fix the multi-solution issue.However, it still suffers from boundary issues, as a result of indiscrimination treatment for different angle ranges, where the near-far relation between angle ranges/categories is completely ignored.Some improved versions, such as CSL [15] and VGL [16], utilize window functions to smooth category labels, but they do not completely eliminate the defects of classification.In a word, it is difficult to accurately predict angles through the above paradigms due to various issues, which limits the performance upper bound of detectors.
To address the above issues, we propose a circular Gaussian distribution (CGD)based method for angular prediction, as shown in Figure 1b.Specifically, we convert the labeled angle into a discrete circular Gaussian distribution spanning a single minimal positive period as new ground truth, and let the model predict the distribution parameters instead of directly regressing or classifying the angle.Then, we calculate the loss in the way to measure the Kullback-Leibler divergence between the predicted distribution and the ground-truth distribution.Here, the Gaussian distribution can reasonably reflect the adjacency between the angles, and each angle is assigned a certain probability based on its offset from the actual angle of the object.In this way, the angular distribution as a whole effectively overcomes the disadvantage of classifying each angular bin in isolation.The circularization of Gaussians solves the boundary issue, and discretization avoids the multisolution issue.Additionally, we design a rotated object detector based on CenterNet [17,18] to improve the overall efficiency of detection.
In summary, the main contributions of this paper are as follows: 1.
We propose a new paradigm for angular prediction, namely CGD.It effectively avoids the shortcomings of previous approaches.

2.
We design a rotated object detector, based on CenterNet, which can improve the overall efficiency of the detection model.

3.
We conduct extensive experiments on various public datasets to verify the effectiveness and superior performances of our approaches.

Related Work
In this section, we introduce related works, including horizontal object detection, rotated object detection, and loss function for rotated object detection.

Horizontal Object Detection Method
Due to the development of CNNs, significant progress has been made in the past few years in horizontal box-based object detection.Currently, object detectors can generally be classified into two paradigms: two-stage detectors and one-stage detectors.
The RCNN [19] was the first two-stage detector method, where the first stage is used to generate candidate boxes through a selective search algorithm, and the second stage uses CNNs to extract features from the candidate boxes.Fast-RCNN [20] improved on RCNN by extracting region-of-interest (RoI) from feature maps, saving the computational cost of sharing backbone networks in RCNN.Faster-RCNN [7] introduced the region proposal network (RPN) for candidate boxes generation.Therefore, the entire network is trained end-to-end.To detect objects of different scales, the feature pyramid network (FPN) [21] was proposed to establish a pyramid hierarchical structure, which can effectively improve the performance of the detector.
One-stage detectors remove the ROI extraction process in two-stage detectors and directly perform the bounding box regression and classification.Early representative one-stage detectors include SSD and YOLOv1.SSD densely places anchor boxes on input images, while YOLOv1 divides input images into grids of different sizes.To address class imbalance issues, Focal loss is designed in RetinaNet [8] to dynamically adjust the weight of each anchor box.FCOS further improves RetinaNet by removing predefined anchor boxes and directly regressing and classifying reference points.CenterNet [17] proposes to regress the width and height of bounding boxes from the center of objects.During inference, CenterNet does not use the NMS algorithm, thus improving the inference speed of the detector.

Rotated Object Detection Method
Rotational object detection is a new task in the current field of object detection, which has been well applied in aerial image object detection and panoramic image object detection.Unlike horizontal object detectors, rotational detectors generally use rotated rectangles as bounding boxes because rotated boxes can better enclose objects.To solve the problem of mismatch in densely packed object scenarios, ROI Transformer [1] proposes spatial transformation on ROI.Oriented RCNN [22] designs an oriented RPN to directly generate high-quality oriented candidate boxes at almost no cost.S2A-Net [23] proposes a feature alignment module to obtain high-quality anchor points.Based on Faster-RCNN, CAD-NET [24] proposes a context-aware detection network to learn global and local contexts in images.SCRDET++ [25] designs a new feature map-based instance-level denoising module for detecting small, cluttered, and rotated objects.

Circular Gaussian Distribution Construction
In this section, we present our approach.First, we adopt the discrete circular Gaussian distribution spanning a single minimal positive period as the ground truth of the angle.The circularization of Gaussians solves the boundary problem and discretization avoids the multi-valued problem.For instance, an original label of the angle (−89 • ) can be used to generate a circular Gaussian distribution in [−90 • , 89 • ], as shown in Figure 1b.The loss of accuracy in the rotation detection task is minimal despite the conversion from continuous angle to discrete angle, just like the analysis in CLS [15].Next, the Gaussian distribution of the angle can be represented as a multi-dimensional vector D t = {d −90 , d −89 , . . ., d 89 }, with the l-th dimension as follows: ), l = −90, 89, . . ., 89 where l denotes the l-th binned angle, θ t is the binned ground-truth angle, and σ is the standard deviation of Gaussian distribution.As a hyperparameter, σ, we experimentally find that its selection is reliable within a particular range.More information will be provided in Table 1.Then, the training set can be represented as {(I i , D t i ), 1 ≤ i ≤ B}, and the objective of the model learning is to obtain a group of model parameters ω to produce a probability distribution F p (I i ; ω) for the label set.
Finally, the loss function quantifying the similarity between the predicted distribution F p (I i ; ω) and the ground-truth distribution D t i is constructed using the Kullback-Leibler divergence.The objective of the Label Distribution Learning is to minimize the following loss function: where I is the input image, and B is the number of images in the batch.
Algorithm 1 provides the pseudo-code of circular Gaussian distribution (CGD).

Overall Architecture
Our overall framework is illustrated in Figure 2. Anchor-based detectors require the calculation of the angle for each anchor, and discretizing the angle for anchor-based detectors increases the computational complexity of the model dramatically.Therefore, we used an anchor-free detector CenterNet [17] as our baseline, which models an object as a single point (i.e., the center point of the bounding box) and predicts the center offset, object size, and angle.Specifically, the first branch is utilized to detect the center points of the bounding box and predict the offsets of the center point.The second branch is used to determine the size (i.e., width w and height h) of the bounding box for each object.The last branch predicts the angle distribution of the bounding boxes.

Backbone
Different from raw CenterNet, we used ResNet as the backbone and built a feature pyramid network (FPN).The FPN enhances a conventional convolutional network with a top-down pathway and lateral connections to effectively build a rich, multi-scale feature pyramid from a single-resolution input image.

Center Branch
For oriented objects in the input image, the heatmap is utilized to localize their center locations.Following the original CenterNet, a 2-D Gaussian kernel is adopted to produce a heatmap Y ∈ [0, 1] (W/4)×(H/4)×C .Any other pixel is treated as the negative sample, while the peak of the Gaussian distribution, which is also the pixel at the center of the box, is treated as the positive sample.The element-wise maximum is used when two Gaussians of the same class overlap.The training objective is a modified focal loss: where α and β are hyper-parameters of the focal loss, and N is the number of keypoints in image I.We set α = 2 and β = 4 in all our experiments, following CenterNet.The model predicts an additional center offset O ∈ [0, 1] (W/R)×(H/R)×2 to remove the discretization error brought on by the output stride.For the regression of the center offset, we used Smooth L1 loss.

Size Branch
In contrast to the raw CenterNet, which directly regressed the variables w and h, we used indirect regression log((w/R)), log((h/R)) to lessen the effect of the varied object aspect ratio.Then, we used Smooth L1 loss for the estimation of the {log((w/R)), log((h/R))}.

Angle Branch
For the object orientation, the model outputs a distribution of angle F p ∈ [0, 1] (W/4)×(H/4)×180 .Then, the Kullback-Leibler Divergence loss is applied to measure the difference between prediction and target distribution.
Thus, the overall training objective of our model is where L cls , L size , and L o f f are the losses of center point recognition, scale regression, and offset regression, which are the same as CenterNet; and λ size , λ o f f , and λ ang are constant factors, set to 0.1 in our experiments.DOTA [9] is a large dataset dedicated to rotated object detection for aerial images.The size of each image varies from 800 × 800 to 4000 × 4000.Specifically, the annotated DOTA contains 2806 aerial images, which include 188,282 instances of 15 object categories.The whole dataset is divided into 1411, 458, and 937 images for training, validation, and testing, respectively.Furthermore, the training images are cropped into patches of size 1024 × 1024 pixels with an overlap of 256 pixels so as to fit the limited GPU memory.

HRSC2016
HRSC2016 [10] is an aerial image dataset for ship detection.The dataset contains 1061 images of ships sourced from two scenarios which are the sea and the inshore at six famous harbors.The size of each image varies from 300 × 300 to 1500 × 900.HRSC2016 is divided into 436,181, and 444 images for training, validation, and testing.Furthermore, the long side of each image is resized to a fixed size (e.g., 640 px) in experiments.To keep structural information, the original aspect ratio of each image is kept.

Evaluation Metric
mAP is a classical metric for detection methods.Therefore, we use mAP in all our experiments to evaluate the performance.For the DOTA dataset, we can obtain the test results from the official evaluation server.For the HRSC2016 dataset, we use metrics that include the VOC07 AP and VOC12 AP with an IoU threshold of 0.5.For the PANDORA dataset, we use metrics that include mAP with an IoU threshold of 0.5.

Implementation Details
Our implementation is based on PyTorch and 8 NVIDIA GeForce RTX 3090 GPUs.We choose the ResNet model pretrained on ImageNet as our backbone.The models are optimized by Adam for 140 epochs with the learning rate dropped by 10× at 100 and 130 epochs for all datasets.For all datasets, the batch size is set to 64, and the initial learning rates are set to 2 × 10 −4 and 1.25 × 10 −4 , respectively.Data augmentations are used to improve model performance.Our data augmentations include random graying, random flipping, and random rotation.ResNet-50-FPN is used for ablation studies.

Ablation Study
We create the following ablation research to eliminate the hyper-parameter's unpredictability and to confirm the viability of the suggested approach.We use CenterNet-FPN as the detection model in the ablation study.Accurate detection results are difficult to acquire because the HRSC2016 dataset comprises a high number of ships with large aspect ratios.We use HRSC2016 as our dataset for ablations.

Influence of Different Hyper-Parameter Value
We studied the impact of the σ (standard deviation) of the Gaussian distribution.A different constant for the σ has been established experimentally.The results are shown in Table 1.We observe that the AP only varies somewhat while adjusting σ in a specific range (from 4 to 10), which indicates that the choice of σ is robust in this range.The best performance is obtained by the model when σ = 6, as shown in Table 1.Therefore, we fix the σ to 6 in all the subsequent experiments.

Effectiveness of CGD
We conducted a number of baseline experiments, including direct regression-based angle prediction (Smooth L1), indirect regression-based angle prediction (trigonometric functions), CSL-based angle classification, and our CGD approach, to verify the effectiveness of our method.The above-mentioned methods share the same network structure except for the different orientation branches.The results are shown in Table 2, and using a regression method to directly predict the angle of the rotated object achieves 85.39% mAP07 and 90.25% mAP12 on the HRSC2016 test set.The indirect regression-based trigonometric functions loss enhances the direct regression approach by 2.43% mAP07 and 3.42% mAP12 by removing the discontinuous boundary problems brought on by the angular periodicity.However, the angular value output by the model can lie in any period, causing the multi-value problem.The CSL-based algorithm has a better result than the angle indirect regression-based method, because it can naturally aggregate multi-values across periods into their equivalence classes to fix the multi-value issue.The CSL-based algorithm achieves 89.98% mAP07 and 95.13% mAP12 on the HRSC2016 test set.As our CGD eliminates the discontinuous boundary issues caused by the angular periodicity and the multi-valued issues caused by out of the defined range, it achieves 90.52% mAP07 and 97.76% mAP12.Finally, in Table 2, we present the running speed of our CGD method and the baseline method.The running speed here refers to the time it takes to process four images in one run on an NVIDIA GeForce RTX 3090 GPU.As shown in Table 2, our CGD method also has an advantage in terms of speed compared to the baseline methods (0.3912 ms vs. 0.4950 ms, 0.4606 ms, 0.5910 ms).All experimental results demonstrate that the CGD's overall performance is superior to the baseline.To validate the effectiveness of our approach, we compared other state-of-the-art methods on the HRSC2016 and DOTA datasets.The performances of all methods were taken from a single model without cross-model test augmentation such as model ensemble by default.
The results of the HRSC2016 dataset are shown in Table 3.Our CGD obtains 90.61% and 98.14% mAP based on R-101-FPN when using the VOC07 metric and the VOC12 metric, respectively.These results are quite competitive when compared to the most recent state-of-the-art methods.We visualize the detection results in Figure 3. R-101-FPN 88.20 -DRN [13] Hourglass-104 -92.70 GWD [12] R-101-FPN 89.43 -DAL [14] R-101-FPN 89.77 -VGL [16] DLA34-DCN 89.78 -S 2 ANet [23] R-101-FPN 90.17There are numerous categories of complexity scenes in the DOTA dataset.We assessed the performance of state-of-the-art rotation object detection methods on DOTA dataset.We report the results of oriented detectors in Table 4.Our CGD surpasses existing advanced oriented detectors, achieving 76.41% mAP based on R-50-FPN and 77.34% mAP based on R-101-FPN.Our models also achieve the best results in some very challenging categories, such as bridge (BR), ship (SH), storage tank (ST), harbor (HA), and swimming pool (SP).We visualize some detection results and show them in Figure 4.

Conclusions
In this paper, we analyzed the limitations of regression-based and classificationbased oriented object detectors.To address their issues, we propose a circular Gaussian distribution (CGD)-based method for angular prediction.Our method's key insight is converting the labeled angle into a discrete circular Gaussian distribution spanning a single minimal positive period as ground truth, and letting the model predict the distribution.Then, to improve the overall efficiency of the detection model, we design an oriented object detector based on the CenterNet method.The experimental results on the challenging HRSC2016 and DOTA datasets indicated that our proposed CGD could achieve superior performance over the state-of-the-art competitors.In particular, our approach outperforms state-of-the-art competitors with improvements of 1.92% and 1.04% in terms of AP points on the HRSC2016 and DOTA datasets, respectively.It is worth noting that our proposed CGD method can be applied to any task that requires angle prediction, including 3D object detection, panoramic object detection, and so on.In the future, we plan to expand the application of the CGD method to additional tasks.

Figure 1 .
Figure 1.(a) The angular space is a 1−D periodic space with multi-solution issue (marks of the same shapes on the different rings represent equivalent angles) and boundary issue (adjacent marks of different shapes on the same ring are far away in this space).Regression paradigms directly predict angles across multiple periods, while classification paradigms predict the equivalence classes to which angles belong.(b) Our proposed circular Gaussian distribution (CGD)-based method for angular prediction.

Figure 2 .
Figure 2. Overall architecture of our method to detect rotated objects in images.We propose a circular Gaussian distribution (CGD)-based method for angular prediction.(Best viewed by zooming in).

Figure 3 .
Figure 3.The visualization results of our method on the HRSC2016 dataset.(Best viewed by zooming in).

Figure 4 .
Figure 4.The visualization results of our method on the DOTA dataset.(Best viewed by zooming in).

Table 1 .
Comparison between different standard deviation σ for Gaussian label on HRSC2016 dataset.The CenterNet-FPN model is used as the detector.

Table 2 .
Comparison between different losses for the rotated bounding box on HRSC2016 dataset.The CenterNet-FPN model is used as the detector.

Table 3 .
Comparison with state-of-the-art methods on the HRSC2016 dataset.

Table 4 .
Comparison with state-of-the-art methods on the DOTA dataset.