Multi-Oriented Object Detection in High-Resolution Remote Sensing Imagery Based on Convolutional Neural Networks with Adaptive Object Orientation Features

: In high-resolution earth observation systems, object detection in high spatial resolution remote sensing images (HSRIs) is the key technology for automatic extraction, analysis and understanding of image information. With respect to the multi-angle features of object orientation in HSRIs object detection, this paper presents a novel HSRIs object detection method based on convolutional neural networks (CNN) with adaptive object orientation features. First, an adaptive object orientation regression method is proposed to obtain object regions in any direction. In the adaptive object orientation regression method, five coordinate parameters are used to regress the object region with any direction. Then, a CNN framework for object detection of HSRIs is designed using the adaptive object orientation regression method. Using multiple object detection datasets, the proposed method is compared with some state-of-the-art object detection methods. The experimental results show that the proposed method can more accurately detect objects with large aspect ratios and densely distributed objects than some state-of-the-art object detection methods using a horizontal bounding box, and obtain better object detection results for HSRIs.


Introduction
In high-resolution earth observation systems, object detection in high spatial resolution remote sensing images (HSRIs) is the key technology for automatic extraction, analysis and understanding of image information [1][2][3]. It also plays an important role in the application of high-resolution earth observation systems to ocean monitoring, precision strike and military reconnaissance [4][5][6]. Object detection for HSRIs refers to the process of determining whether there are objects of interest and locating the objects of interest in the image [7]. In this paper, the objects detected are artificial geographical objects (e.g., storage-tanks, cars or airplanes) that have a clear boundary and have nothing to do with the HSRI background.
For object detection of HSRIs, scholars have carried out a lot of research. Mostly object detection methods usually use a three-stage mode of ① extracting object candidate regions, ② obtaining the features of object candidate regions, and ③ classifying object candidate regions using the features to detect objects in HSRIs [7]. For example, Xiao et to improve the detection accuracy of the densely packed objects with a large aspect ratio, as shown in Figure 1b. For using OBB to detect objects in HSRIs, some studies have been carried out. For example, Ding et al. [36] designed a Rotated Region of Interests (RRoI) learner to transform a Horizontal Region of Interest (HRoI) into an RRoI. The designed RRoI transformer was embedded into an object detector for orient object detection. Li et al. [37] proposed a feature-attentioned object detection framework to detect orient objects in HSRIs. The proposed framework consisted of three components: feature-attentioned feature pyramid networks, multiple receptive fields attention-based RPN, and proposal-level attentionbased ROI module. Yang et al. [38] proposed a multi-category rotation detector for small, cluttered and rotated objects. In the rotated detector, the supervised pixel attention network and the channel attention network were jointly explored for small and cluttered object detection by suppressing the noise and highlighting the object's features. For more accurate rotation estimation, the IOU constant factor was added to smooth L1 loss to address the boundary problem for the rotating bounding box. Wang et al. [39] provided a semantic attention-based mask oriented bounding box representation for multi-category object detection for HSRIs. In the proposed oriented object detector, an inception lateral connection network was used to enhance the FPN. Furthermore, a semantic attention network was adopted to provide the semantic feature to help distinguish the object of interest from the cluttered background effectively. Compared with HBB, object detectors based on OBB are more suitable for object detection of HSRIs. Therefore, object detection based on OBB has become a research hotspot.
With respect to the multi-angle features of object orientation in HSRIs object detection, this paper presents a novel HSRIs object detection method based on CNN with adaptive object orientation features (CNN-AOOF). First, an adaptive object orientation regression method is proposed. Then, a CNN framework for object detection in HSRIs is designed using the adaptive object orientation regression method.
The main contributions of this paper are as follows: 1. An HSRI object detection dataset with OBB, WHU-RSONE-OBB, is established and published to promote the development of object detection for HSRIs. 2. An adaptive object orientation regression method is proposed to obtain object regions in any direction. 3. An object detection framework based on CNN with adaptive object orientation features is designed to detect various objects for HSRIs. 4. The proposed method can more accurately detect objects with large aspect ratios and densely distributed objects than object detectors using a horizontal bounding box.
The rest of this paper is organized as follows. In Section 2, the CNN-AOOF framework is introduced in detail. In Sections 3 and 4, the datasets are described, and experimental results are discussed and analyzed. In Section 5, the experimental results are summarized, and the conclusions are drawn.

Materials and Methods
In this paper, the CNN-AOOF framework is obtained using two steps. First, the adaptive object orientation regression method is proposed. Second, the CNN-AOOF framework is designed using the adaptive object orientation regression method.

The Adaptive Object Orientation Regression Method
At present, single-stage object detectors (such as YOLO [17], YOLOv2 [20] and SSD [18]) and two-stage object detectors (such as Fast-RCNN [15] and Faster-RCNN [16]) use four parameters (x, y, w, h) to train and regress the coordinates of the object region. x and y are the coordinates of the center point of the object region. w and h are the width and height of the object region. The object region that is trained and regressed using four parameters (x, y, w, h) is the horizontal bounding box, which is difficult to effectively couple the object region in HSRIs, as shown in Figure 1a. In order to well couple the object region in HSRIs, the adaptive object orientation regression method is proposed in this paper.
In the adaptive object orientation regression method, five parameters (x, y, w, h, ) are used to train and regress the object region, as shown in Figure 2a. x and y are the coordinates of the center point of the object region. w and h are the width and height of the object region. In remote sensing image processing, the upper left corner is the coordinate origin (0, 0), the horizontal axis is the X axis, and the vertical axis is the Y axis, as shown in Figure 3. represents the angle of the clockwise rotation from the X axis to the Y axis. Therefore, in the adaptive object orientation regression method, is the angle between the corner point with the smallest y value in the four corner point coordinates of the object region and the X axis. The value range of is (0, /2]. When is /2, is the angle between the point P1 and the X axis, as shown in Figure 2b. w and h are the length of |P3P4| and |P4P1|, respectively. The object region that is trained and regressed using five parameters (x, y, w, h, ) is an arbitrary-oriented bounding box, which can well couple the object region in HSRIs.  In the adaptive object orientation regression method, five parameters (x, y, w, h, ) of the object region are trained and regressed based on the anchor, as shown in Figure 4. In Figure 4, the dotted rectangle is the anchor at the position (i, j) of the output feature map. In the process of training and regressing the object region, five parameters (x, y, w, h, ) of the object region at the position (i, j) are calculated as follows: where ( , , , ℎ , ) are the regressed five parameters (x, y, w, h, )of the object region. and ℎ are w and h of the anchor, respectively, as shown in Figure 4.   In the adaptive object orientation regression method, the four corner point coordinates of the object region can be obtained using five parameters (x, y, w, h, ) of the object region. The oriented bounding box can be drawn using the four corner point coordinates of the object region. The calculation formula is as follows: where ( 1, 1 ), ( 2, 2 ), ( 3, 3 ) and ( 4, 4 ) are the coordinates of the four points 1 , 2, 3 and 4 of the object region, respectively.

CNN-AOOF Framework Design
Using the adaptive object orientation regression method, a novel object detection framework based on CNN-AOOF for object detection in HSRIs is proposed. The CNN-AOOF framework is a single-stage object detector, as shown in Figure 6. In the CNN-AOOF framework, darknet-53 [21] is used to generate the feature maps. The size of the input image in darknet-53 is 416 pixels × 416 pixels. In the CNN-AOOF framework, the object region is trained and regressed based on the anchor on three different scale feature maps. On the feature map with the size of 13 pixels × 13 pixels, object candidate regions are trained and regressed based on three anchors at each position of the feature map. The size of the three anchors are 116 pixels × 90 pixels, 156 pixels × 198 pixels, and 373 pixels × 326 pixels, respectively, as shown in the blue dotted rectangle in Figure 6. The feature map with the size of 13 pixels × 13 pixels is upsampled and combined with the feature map with the size of 26 pixels × 26 pixels to form a new feature map with the size of 26 pixels × 26 pixels. On the new feature map with the size of 26 pixels × 26 pixels, object candidate regions are trained and regressed based on three anchors at each position of the feature map. The size of the three anchors are 30 pixels × 61 pixels, 62 pixels × 45 pixels, and 59 pixels × 119 pixels, respectively, as shown in the green dotted rectangle in Figure 6. The new feature map with the size of 26 pixels × 26 pixels is upsampled and combined with the feature map with the size of 52 pixels × 52 pixels to form a new feature map with the size of 52 pixels × 52 pixels. On the new feature map with the size of 52 pixels × 52 pixels, object candidate regions are trained and regressed based on three anchors at each position of the feature map. The size of the three anchors are 10 pixels × 13 pixels, 16 pixels × 30 pixels, and 33 pixels × 23 pixels, respectively, as shown in the red dotted rectangle in Figure 6. In the CNN-AOOF framework, a multi-scale training method is used. Three object candidate regions are generated based on three anchors at each position of three different scale feature maps. If the intersection-over-union (IOU) overlap of the anchor and a ground truth box is the greatest among that of all anchors and a ground truth box, a positive label is assigned to the anchor. If a positive label is not assigned to the anchor, a negative label will be assigned to the anchor. In training the CNN-AOOF framework, there are × × × ( + 6) predicted value outputs on each feature map. The loss function of the CNN-AOOF framework is calculated as follows: where is the training loss of CNN-AOOF framework. , and are the training loss of coordinates, class and confidence of generated object regions based on anchors, respectively.
is the width and height of the feature map. is the number of anchors at each position of the feature map.
indicates whether a positive label is assigned to label anchor at position (i, j) of the feature map. If a positive label is assigned to label anchor, is 1, otherwise is 0. and ℎ are the width and height of the ground truth box corresponding to the label k anchor at position (i, j) of the feature map, respectively. ( , , , ℎ , ) are the five parameters of the ground truth box. ( * , * , * , ℎ * , * ) are the framework output values to calculate five parameters of the generated object region based on label k anchor. and ℎ are the width and height of the label k anchor, respectively.
is the classification number of the object. * is the framework output value of different classifications of the generated object region based on label k anchor. * is the framework output value of the object confidence of generated object region based on label k.
In the testing process of the CNN-AOOF framework, all x, y, , confidence and classification values among all the output values of the CNN-AOOF framework are processed using Formula (11). Then the five parameters of the generated object region based on the anchor at each position of the feature map are obtained using Formulas (1)- (5). If the confidence of the generated object region is greater than the threshold, it is retained, otherwise, it is removed. The confidence threshold of various object detection results of CNN-AOOF is set to 0.05 for quantitative evaluation. The classification of retained generated object regions is determined based on classification output values of the CNN-AOOF framework. To reduce redundancy, the non-maximum suppression (NMS) algorithm is applied to retained generated object regions based on their confidence. The IOU threshold is set to 0.3 in the NMS algorithm. After NMS, the object detection result of an HSRI is obtained.

Results
Some state-of-the-art object detection algorithms (Faster-RCNN, CNN-SOSF, YOLOv2 and YOLOv3) have been effectively applied to object detection for HSRIs. To examine the object detection effectiveness of CNN-AOOF, four HSRI object detection datasets (WHU-RSONE-OBB, UCAS-AOD, HRSC2016 and DOTA) are used to compare CNN-AOOF with Faster-RCNN, CNN-SOSF, YOLOv2 and YOLOv3. CNN-AOOF is based on the darknet framework and programmed using C++. The experiments are carried out on a server with Inter(R) Xeon(R) CPU E5-2667 v4 @ 3.20 GHz, NVIDIA Quadro M4000 (8 GB GPU memory), 16 GB RAM, and Windows 10 operating system.

Object Detection for WHU-RSONE-OBB
Large-scale object detection datasets are the basis and key for supporting object detection methods based on CNN to achieve high performance [40,41]. Therefore, an object detection dataset with OBB for HSRI, WHU-RSONE-OBB, is established and made publication to promote the development of HSRI object detection in this paper. In WHU-RSONE-OBB, images were obtained from SuperView1 images, Tianditu, and Google Earth images. In WHU-RSONE-OBB, there are 5977 images. The size of images in WHU-RSONE-OBB ranges from 600 pixels × 600 pixels to 1372 pixels × 1024 pixels. The spatial resolution of images ranges from 0.5 m to 0.8 m. There are three kinds of geospatial objects (airplane, storage-tank and ship) in WHU-RSONE-OBB, and object samples are labeled using OBB. The number of the three kinds of geospatial objects in WHU-RSONE-OBB is shown in Table 1. In this paper, mean average precision (mAP) is used as the evaluation criteria for object detection results of object detectors [42]. If the IOU of the bounding box of object detection result and bounding box of ground-truth is equal to or greater than 0.5, the object detection result is considered correct, and vice versa. The larger the mAP value is, the higher accuracy of the object detector becomes, and vice versa. The mAP is obtained using the following formula: where i is the label of an object class. is the class number of the detected objects. is the average precision of label i class. Its value is the area under the precision-recall curve (PRC), as shown in Figure 7. In WHU-RSONE-OBB, we randomly select 4781 images as the training set, 598 images as the validation set, and 598 images as the testing set. Using WHU-RSONE-OBB, CNN-AOOF and some state-of-the-art object detection algorithms are trained and tested. Table 2 shows the quantitative comparison results of CNN-AOOF and some state-of-theart object detection algorithms. In Table 2, the AP values of airplane, storage-tank and ship are 0.9857, 0.8831 and 0.792, respectively, in the object detection results using CNN-AOOF. The AP values of various objects using CNN-AOOF are greater than those of other object detection algorithms. Moreover, the mAP value of CNN-AOOF is the largest among the five object detection algorithms. These show that CNN-AOOF can obtain more accurate object detection results than other object detection algorithms for HSRIs in WHU-RSONE-OBB.  Figure 8 shows the PRCs of the five object detection algorithms for object detection results of WHU-RSONE-OBB. In Figure 8, for airplane, storage-tank and ship, the PRC areas of CNN-AOOF are greater than those of the other object detection algorithms. The experimental results show that CNN-AOOF outperforms the other object detection algorithms, and can obtain more accurate object detection results.  Table 3 shows the average time consumptions of five object detection algorithms for per image object detection in WHU-RSONE-OBB. In Table 3

Object Detection for UCAS-AOD
To further verify the object detection effectiveness of CNN-AOOF, UCAS-AOD [43] is used to compare CNN-AOOF with some state-of-the-art object detection algorithms (Faster-RCNN, CNN-SOSF, YOLOv2 and YOLOv3). UCAS-AOD is an HSRI object detection dataset that contains two kinds of objects: airplane and car. Object samples are labeled using OBB. Images are cropped from Google Earth. There are 1510 images in UCAS-AOD, and the size of images ranges from 1280 pixels × 659 pixels to 1372 pixels × 941 pixels. In line with [44,45], we randomly select 1060 images for training and 450 images for testing. Table 4 shows the quantitative comparison results of the five object detection algorithms for UCAS-AOD. In Table 4, the AP values of airplane and car are 0.9488 and 0.8996, respectively, in the object detection results using CNN-AOOF. The AP values of two kinds of objects using CNN-AOOF are greater than those of other object detection algorithms. Moreover, the mAP value of CNN-AOOF is the largest among the five object detection algorithms. The experimental results show that CNN-AOOF is superior to the other four object detection algorithms for UCAS-AOD. Table 4. Performance comparisons of the five object detection algorithms in terms of AP values for UCAS-AOD dataset.

Airplane
Car mAP Faster-RCNN [16] 0.9270 0.7582 0.8426 CNN-SOSF [33] 0.9339 0.7965 0.8652 YOLOv2 [20] 0.7426 0.1501 0.4463 YOLOv3 [21] 0.9414 0.8805 0.9109 CNN-AOOF 0.9488 0.8996 0.9242 Figure 9a,b are the PRCs of the five object detection algorithms for airplane and car in UCAS-AOD, respectively. In Figure 9, we can see that for airplane and car, the PRC areas of CNN-AOOF are greater than those of the other object detection algorithms. The experimental results show that CNN-AOOF outperforms the other object detection algorithms, and can obtain more accurate airplane and car detection results for UCAS-AOD dataset.
(a) (b) Figure 9. PRCs of the five object detection algorithms for (a) airplane, and (b) car in the UCAS-AOD dataset.

Object Detection for HRSC2016
HRSC2016 [46] dataset is used to compare CNN-AOOF with the other object detection algorithms (Faster-RCNN, CNN-SOSF, YOLOv2 and YOLOv3) to verify the object detection effectiveness of CNN-AOOF. HSRC2016 is a ship detection dataset of HSRI. Ship samples are labeled using OBB. Images are cropped from Google Earth. The size of images ranges from 300 pixels × 300 pixels to 1500 pixels × 900 pixels. There are 1061 images in HSRC2016, including 436 images for training, 181 images for validation, and 444 images for testing. Table 5 shows the quantitative comparison results of the five object detection algorithms for HSRC2016. In Table 5, the AP values of the ship in the object detection results of the five algorithms are 0.8349, 0.8301, 0.423, 0.8144 and 0.8567, respectively. The AP value of the ship using CNN-AOOF is greater than that of the other four object detection algorithms. The experimental results show that CNN-AOOF outperforms the other four object detection algorithms for HRSC2016. Table 5. Performance comparisons of the five object detection algorithms in terms of AP values for the HSRC2016 dataset.

Ship
Faster-RCNN [16] 0.8349 CNN-SOSF [33] 0.8301 YOLOv2 [20] 0.4230 YOLOv3 [21] 0.8144 CNN-AOOF 0.8567 Figure 10 is the PRC of five object detection algorithms for ship in HSRC2016. In Figure 10, we can see that for ship, the PRC area of CNN-AOOF is greater than that of the other four object detection algorithms. The experimental results show that CNN-AOOF is superior to Faster-RCNN, CNN-SOSF, YOLOv2 and YOLOv3, and can obtain more accurate ship detection results for the UCAS-AOD dataset.

Object Detection for DOTA
DOTA [40] is a multi-category object detection dataset for HSRIs. There are 2806 images in the dataset. The training set, validation set and test set account for 1/3, 1/6 and 1/2 of the data set, respectively. The images range from about 800 pixels × 800 pixels to 4000 pixels × 4000 pixels. There are 15 kinds of objects (plane (PL), ship (SH), storage-tank (ST), baseball diamond (BD), tennis court (TC), basketball court (BC), ground track field (GTF), harbor (HA), bridge (BR), large vehicle (LV), small vehicle (SV), helicopter (HC), roundabout (RA), soccer ball field (SBF) and swimming pool (SP)) in the dataset. In DOTA, object samples are labeled using OBB. Table 6 shows the quantitative comparison results of CNN-AOOF and some state-ofthe-art object detection algorithms (RoI Trans, SCRDet, Li et al., Mask OBB). In Table 6, the mAP values of five object detection algorithms are 0.6956, 0.7261, 0.7328, 0.7533 and 0.7571, respectively. The mAP of CNN-SOSF is 0.7571, which is the largest value among the five object detection algorithms. The experimental results show that CNN-SOSF outperforms the other four object detection algorithms, and can obtain better object detection results for the DOTA dataset.

Discussion
In this section, CNN-AOOF is compared with other object detection algorithms by using visual evaluation. In two-stage object detectors, the object detection accuracy of CNN-SOSF is better than that of Faster-RCNN for WHU-RSONE, UCAS-AOD and HSRC2016 datasets. In single-stage object detectors, the object detection accuracy of YOLOv3 is better than that of YOLOv2. Therefore, CNN-AOOF is compared with CNN-SOSF and YOLOv3 which have greater detection accuracy by using visual discrimination. Figure 11 shows some object detection result samples of CNN-SOSF, YOLOv3 and CNN-AOOF for WHU-RSONE, UCAS-AOD and HSRC2016 datasets. Figure 11a-c are object detection results of CNN-SOSF, YOLOv3 and CNN-AOOF, respectively.
In Figure 11(a1,b1), due to the dense distribution of the airplanes, the airplanes indicated by the yellow arrow cannot be detected correctly using CNN-SOSF and YOLOv3. In Figure 11(c1), airplanes are correctly detected using CNN-AOOF.
In Figure 11(a2,b2), due to the ship with large aspect ratios, the ship indicated by the yellow arrow cannot be detected correctly using CNN-SOSF and YOLOv3. There are large redundant areas in the detection results of other ships. In Figure 11(c2), ships are correctly detected using CNN-AOOF.
In Figure 11(a3), due to the large aspect ratios and dense distribution of the ships, the ships indicated by the yellow arrow cannot be detected correctly using CNN-SOSF. In Figure 11(b3), the storage-tank indicated by the yellow arrow cannot be detected correctly using YOLOv3. In Figure 11(c3), ships and storage-tanks are correctly detected using CNN-AOOF.
In Figure 11(a4,b4), due to the ships with large aspect ratios, the ships indicated by the yellow arrow cannot be detected correctly using CNN-SOSF and YOLOv3. There are large redundant areas in the detection results of other ships. In Figure 11(c4), ships are correctly detected using CNN-AOOF.
In Figure 11(a5,b5), due to the ships with large aspect ratios, the ships indicated by the yellow arrow cannot be detected correctly using CNN-SOSF and YOLOv3. There are large redundant areas in the detection results of other ships. In Figure 11(c5), ships are correctly detected using CNN-AOOF.
In Figure 11(a7,b7), due to the dense distribution of cars, the cars indicated by the yellow arrow cannot be detected accurately using CNN-SOSF and YOLOv3. In Figure  11(c7), dense cars are detected accurately using CNN-AOOF.
In Figure 11(a8), due to the dense distribution of cars, the cars indicated by the yellow arrow cannot be detected accurately using CNN-SOSF. In Figure 11(b8,c8), cars are detected accurately using CNN-AOOF.
The experimental results show that CNN-SOSF and YOLOv3 are difficult to accurately detect objects with large aspect ratios and densely distributed objects because they use horizontal bounding boxes to detect objects. However, CNN-AOOF uses OBB to detect objects, and can accurately detect objects with large aspect ratios and densely distributed objects. Therefore, CNN-AOOF is superior to CNN-SOSF and YOLOv3 for WHU-RSONE-OBB, UCAS-AOD and HSRC2016 datasets.

Conclusions and Future Work
With respect to the multi-angle features of object orientation in HSRIs object detection, a novel HSRIs object detection method based on convolutional neural networks with adaptive object orientation features (CNN-AOOF) is proposed in this paper. First, an adaptive object orientation regression method is proposed to obtain object regions in any direction. Then, a CNN framework for object detection of HSRIs is designed using the adaptive object orientation regression method. To verify the object detection effectiveness of CNN-AOOF, WHU-RSONE-OBB, UCAS-AOD, HSRC2016, and DOTA datasets are used to qualitatively and quantitatively compare CNN-AOOF with some state-of-the-art object detection algorithms. The experimental results show that CNN-SOSF is superior to other state-of-the-art object detection algorithms, and can accurately detect objects with large aspect ratios and densely distributed objects for different object detection datasets of HSRIs. Object anchor scales are the vital factor affecting object detection results of HSRIs. In future work, how to adaptively adjust object anchor scales in the proposed method for different object detection tasks will be studied to obtain more accurate object detection results.