1. Introduction
Great performance has been achieved in object detection (Faster-RCNN [
1], SSD [
2], YOLO [
3], RetinaNet [
4], etc.) in natural images, and object detection in aerial images has attracted more attention recently given the advances in remote sensing. Object detection in aerial images aims to locate objects of interest (e.g., vehicles and ships) on the ground, and recognize their types. In the object detection of natural scenes, objects are generally observed from the horizontal view angle and labeled as horizontal bounding boxes. Aerial images are typically taken from a bird’s eye view, such as DOTA [
5] as shown in
Figure 1 and HRSC2016 [
6], which means that the objects are often small in size and arbitrary oriented.
Specifically, the challenges in object detection in aerial images are analyzed with respect to the following:
Scale variations. Due to the spatial resolutions of sensors, the size of objects in aerial images is often small. Furthermore, there are shape variations within the objects of the same category, which causes scale variations problems in detection.
Dense targets. It is typical that some targets in aerial images are densely arranged, such as ships in a harbor or vehicles in a parking lot. Dense scenes require methods to extract distinguishing features to identify each target.
Arbitrary orientations. Objects in natural scenes are generally oriented upward, while objects in aerial images are often oriented arbitrarily.
In addition, some real-time detection scenarios illustrate difficulties in aerial images. For example, detection of embedded devices on UAVs (Unmanned Aerial Vehicles) or satellites brings about challenges in that computational complexity must be taken into consideration, so that calculations are less time-consuming.
In general, methods for object detection can be divided into two categories, namely two-stage methods and one-stage methods, which are usually judged by whether they regress objects directly or refine the detection results step by step.
Benefiting from the work of R-CNN [
7], some studies propose outstanding two-stage oriented object detectors (RRPN [
8], TextBoxes++ [
9], RoITrans [
10], SCRDet [
11] FOTS [
12], etc.), which have achieved great performances with aerial images such as the DOTA dataset or on natural scene-text detection such as MSRA-TD500.
However, their higher computational complexity may not allow for the required efficiency of real applications. Hence, some one-stage detectors [
2,
4,
13,
14] have been put forward to exploit the strength of fully convolutional layers (FCN) and feature pyramid networks (FPN) [
15]. TextBoxes++ [
9] effectively utilizes multi-layer features to detect orientated scene text. However, there still exists the feature misalignment question between the receptive field and objects.
To approach the questions of anchor-based methods outlined above, some one-stage anchor-free methods (CornerNet [
16], ExtremeNet [
17], CenterNet [
18], FCOS [
19], FoveaBox [
20], etc.) have been shown to detect horizontal objects by key points detection or per-pixel prediction and have achieved extremely promising performance. Some studies have managed to detect orientated objects in such an anchor-free fashion. By the power of the deep neural network, these anchor-free methods have shown potential in the trade-off between computational complexity and performance.
In this paper, we propose a new one-stage anchor-free orientated objects detector for aerial images. The method works in a per-pixel prediction fashion to predict the axis of objects, which is the line that connects the head and tail of the objects, while their width is vertical to the axis. In addition, a new aspect-ratio-aware orientation centerness (OriCenterness) method is proposed to better weigh the importance of positive pixel points so as to guide the network to distinguish foreground objects from a complex background. The proposed method was evaluated on the public aerial images datasets DOTA [
5] and HRSC2016 [
6], achieving better performance compared to most other one-stage methods and many two-stage anchor-based detectors. It shows potential to be applied in real-time detection situations with less computational complexity compared with anchor-based methods. Our contributions are as follows:
We propose a new one-stage anchor-free detector for orientated objects, which locates objects by predicting their axis and width. This detector not only simplifies the format of detection but also avoids elaborating hyperparameters, and reduces the computational complexity compared with anchor-based methods.
We design a new aspect-ratio-aware orientation centerness method to better weigh the importance of positive pixel points in different scale and aspect ratio labeled boxes, thus the method is able to learn discriminative features to distinguish foreground objects from a complex background.
3. Results
In this section, we first compare the proposed method with several published one-stage and two-stage orientated detectors separately, as shown in
Table 2 and
Table 3. The results in
Table 2 show that our method performs better than those one-stage anchor-based orientated detectors based on SSD [
2], YOLO [
35], or RetinaNet-R [
4]. When compared with the one-stage anchor-free detector IENet, our method outperforms the method by 8.84% according to mAP. The best performance was achieved on
Ground Tracked (GTF), Small Vehicle (SV), Ship, Storage Tank (ST), Roundabout (RA), and Swimming Pool (SP) by our method. Visualization of detection results on DOTA are given in
Figure 4.
We also compared the proposed method with several two-stage orientated detectors, as shown in
Table 3. The results in
Table 3 show that our method performs better than many two-stage anchor-based methods such as FR-O [
5], R-DFPN [
36], R
CNN [
37], and RRPN [
8] according to mAP. Although the method cannot achieve as good performance as some two-stage anchor-based detectors such as ICN [
38] and RoI Transformer [
10], it still performs better in 6 of 15 categories
(Baseball Diamond (BD), Small Vehicle (SV), Large Vehicle (LV), Ship, Storage Tank (ST), and Swimming Pool (SP)) than ICN and 4 of 15 categories
(Basketball Court (BC), Storage Tank (ST), Roundabout (RA), and Swimming Pool (SP)) than RoI Transformer.
We also evaluated the proposed method on the HRSC2016 dataset, and compared it with several two-stage anchor-based and one-stage anchor-free orientation detectors, as shown in
Table 4. The method without OriCenterness could achieve 73.91% according to mAP and 78.15% after adding OriCenterness, and comparisons show that our method performs better than some two-stage anchor-based methods such as BL2 [
21] and R
CNN [
37]. When compared with a one-stage anchor-free detector such as IENet, our method outperforms the method by 3.14% according to mAP. In addition, there is a 4.24% increase in performance after using OriCenterness, as shown in
Table 4. Visualization of the detection results on HRSC2016 are given in
Figure 5.
4. Discussion
4.1. Effectiveness of OriCenterness
As discussed in
Section 2.3.4, OriCenterness is able to alleviate the problem where original centerness drops sharply from the target’s center to the edge for large aspect ratio objects, and OriCenterness can better weigh the importance of positive pixel points to guide the network to learn discriminative features. We conducted an ablation study on DOTA to prove the effectiveness of OriCenterness. As shown in
Table 5, the checkmark in the OriCenterness column denotes we adopted OriCenterness, and the short dash denotes that we adopted the transformation of original centerness as [
19], which is adapted to orientated objects. The results show that there is a 3.76% increase in mAP after using OriCenterness for the ResNet50 backbone. When the backbone is ResNet101, there is a 0.48% increase in mAP after using OriCenterness. For the objects such as
bridge, harbor, and some
ships, whose aspect ratios are usually large, there is a substantial increase in performance.
Furthermore, we visualize the prediction of OriCenterness and original centerness on test set of DOTA in
Figure 6. The first column is images of bridge, harbor, and storage tank from the testing data with their ground truth. The second column is visualization of original centerness adapted for orientated objects. The third column is our proposed OriCenterness visualization. Prediction results are both taken from F3 in ResNet-101 FPN architecture with a resolution of 100 × 100, and the value is from 0 to 1. The higher is the value, the closer it is to red. The result in figure shows that our network with OriCenterness is able to learn more explicit significance for pixel points to distinguish the foreground and background compared with original centerness adapted for orientated objects. Not only objects with large aspect ratio such as bridge and harbor can obtain a more significant prediction, but also the centerness prediction for some square objects such as the storage tank is more significant.
4.2. Speed–Accuracy Trade-Off
Speed–accuracy trade-off results for our method on DOTA are shown in
Table 6. The results show that the proposed method could achieve a 2% improvement after substituting Resnet50 with the Resnet101 backbone network, while there is almost no additional computation consumption during the inference stage. Results of other methods tested on different devices are also listed in
Table 6. For the two-stage anchor-based detector R3Det, its inference speed is 4 fps on 2080Ti gpu, while the proposed method is 14 fps on Titan Xp whose performance is inferior to 2080Ti. For one-stage anchor-free detector IENet, although there are advantages for the method according to inference speed, our method outperforms the method by 8.8% according to mAP.
4.3. Advantages and Limitations
For anchor-based orientated detectors such as RetinaNet-R, hyperparameters relevant to anchors include the anchor base size, ratio, scales per feature level, angle, and foreground and background IOU thresholds. To fit as many different orientated objects as possible, the number of predefined anchors ranges from 45 (3 scales × 3 ratios × 5 angle) to 105 (3 scales × 5 ratios × 7 angle) on each pixel point of one feature level, and there are about 600,000 to 1,400,000 anchors total for an 800 × 800 input image resolution. Then, the IOU between each anchor with each target will be calculated during the training stage. Some exploratory experiments on the RetinaNet-R method for the DOTA dataset indicated that these hyperparameters of anchors are sensitive to the detection performance. For example, minor changes in the anchor base size and number of scales could bring about a 7% improvement according to mAP.
In contrast, the proposed anchor-free detector does not need to set such elaborate anchors. The results in
Table 2,
Table 3 and
Table 4 show that this anchor-free method could achieve a competitive performance according to mAP compared with anchor-based methods, on the DOTA and HRSC2016 datasets. When the proposed method is compared with other methods, it was found that a better performance can be achieved on
Storage Tank (ST), Roundabout (RA), and Swimming Pool (SP). The similarity of these categories is their shapes are circle or square, and this is likely to cause the boundary discontinuity of the rotation angle, such that the angle may change from
to
abruptly for anchor-based methods. This may cause unstable optimization during the training stage [
11]. We solved this question by predicting the axis, which is determined by label information specifically, and we avoided predicting the target angle explicitly.
There are also some limitations of this method. Firstly, the axis learning relies on high quality label data, which requires labeled vertices of oriented boxes are arranged in a clockwise order, with the first labeled point being the top left corner of the box. However, there is no guarantee that all images will be well labeled, and in fact, there are some noise labels within the DOTA dataset, whose top left corners are not the first point. We have added some data calibration in the data preprocess stage, and found that it could bring about 3% improvement according to mAP. On the one hand, additional information such as the center coordinate or angle of the label could be considered to be introduced to calibrate the noise data. On the other hand, we are going to apply the method to natural scene object detection. Further, there are deficiencies of the proposed method according to mAP compared with other state-of-the-art orientated detectors. We will aim to be less dependent on high quality label data and continue to improve the method in future.
5. Conclusions
In this paper, we propose an effective one-stage anchor-free detector for aerial images. We conducted several experiments on the DOTA and HRSC datasets to prove the effectiveness of one-stage anchor-free detection. The results show that our method achieves a better performance according to mAP compared with most of other one-stage orientated detectors, as well as many two-stage anchor-based orientated detectors, with fewer hyperparameters. The speed-accuracy trade-off results show that the proposed method is more computationally efficient compared with some anchor-based methods, which shows the potential of the method to be applied in real-time detection, such as real-time inference on the embedded devices of UAVs or satellites. Further, we propose a new OriCenterness to better weigh positive pixel points to guide the network to learn discriminative features from a complex background, which brings improvements for objects with a large aspect ratio according to mAP. While the method simplifies orientated object detection there are some limitations, such as requirements for high quality label data and deficiencies compared with other state-of-the-art orientated detectors according to mAP. In future work, we will seek to continue to improve the method, and explore the potential of the method in real-time detection applications.