Object Detection in Remote Sensing Images Based on Improved Bounding Box Regression and Multi-Level Features Fusion

: The objective of detection in remote sensing images is to determine the location and category of all targets in these images. The anchor based methods are the most prevalent deep learning based methods, and still have some problems that need to be addressed. First, the existing metric (i


Introduction
Great progress made on remote sensing technologies means that a large number of remote sensing images (RSIs) with high spatial or spectral resolution are available. Thus, the valuable information represented in RSIs requires better analysis methods to meet the needs of many military and civilian On the other hand, the existing methods which adopt a hierarchical deep network have several problems in the feature extraction of region proposals. In existing methods, the RPN firstly generates a large number of region proposals from all the layers. Next, the size of each region proposal is quantified to determine the feature level to which the proposals belong. Finally, the features of each region proposal are extracted from the selected feature level. Obviously, this approach has two problems: (1) two proposals with similar sizes are likely to be assigned to different levels due to quantization error; (2) each proposal just maps to a single feature map, which does not make full use of the advantages of multi-level features. In fact, the multi-level features are used only at the RPN stage and are not considered during the subsequent feature extraction of region proposals.
To resolve the above problems, a novel ODRSIs method based on improved bounding box regression and multi-level features fusion is proposed in this paper. First, a new metric named GIoU [43] is introduced, which can tackle both overlapping and nonoverlapping cases between two bounding boxes. This solves the problem that existing metrics (i.e., IoU) fail to measure the distance for the nonoverlapping cases for ODRSIs. Then, to solve the problem that the existing bounding box regression loss could not directly optimize the metric in ODRSIs, a new bounding box regression loss named IGIoU loss is proposed, which can optimize the new metric (i.e., GIoU) directly and solve the aforementioned constant gradient problem involving GIoU loss [43]. Furthermore, a multi-level features fusion (MLFF) module is proposed to combine the multi-level features for each proposal to overcome the weakness of multi-level features methods not make full use of all available advantages.
The main contributions of this paper are as follows: (1) A new metric (i.e., GIoU) for ODRSIs is adopted to replace the existing metric (i.e., IoU). The GIoU overcomes the problem that the IoU cannot measure the distance when the predicted bounding box and the ground truth box are nonoverlapping. (2) A novel bounding box regression loss named IGIoU loss is proposed. The proposed IGIoU loss can optimize the new metric (i.e., GIoU) directly, which thus solves the problem that the existing bounding box regression loss cannot directly optimize the metric in ODRSIs. Furthermore, it can overcome the problem that existing GIoU loss [43] cannot adaptively change the gradient based on the GIoU value. (3) A multi-level features fusion module named MLFF module is proposed. The proposed MLFF can be incorporated into the existing hierarchical deep network, which can address the problem that the existing methods which adopt a hierarchical deep network do not make full use of the advantages of multi-level features for the feature extraction of region proposals.

Methods
The overall architecture of our method is illustrated in Figure 1. First, given a remote sensing image with an arbitrary size as an input, the FPN was adopted as the backbone network, which yields multi-scale feature maps at different levels. Next, the obtained multi-scale feature maps were fed into an MLFF module, which pools features via RoIAlign [44] (i.e., an alternative to RoI pooling [33]) across levels p 2 to p 5 for each proposal and then fuses them by concatenating the pooled features along the channel dimension. Finally, the fused features of each proposal were utilized for bounding box regression and classification. Note that our IGIoU loss was adopted to replace the smooth L1 loss used in the original FPN for bounding box regression.
Remote Sens. 2020, 12, 143 4 of 21 the problem that the existing methods which adopt a hierarchical deep network do not make full use of the advantages of multi-level features for the feature extraction of region proposals.

Multi-Level Features Fusion
In the original FPN, each proposal generated from RPN was mapped back to a kth level feature map (i.e., P k in Figure 1a) and multi-level feature maps based on the size of the proposal. The level k was defined by: where · denotes the rounding down operation, w and h separately represent the width and height of the proposal, k 0 was the level corresponding to the proposal with wh = 224 2 , 224 × 224 was the size of the input layer of ResNet50 (the backbone network). The default value of k 0 was 4, which is analogous to Faster R-CNN with a ResNet backbone that employs C 4 as a single-scale feature map. As shown in Equation (1), the smaller proposals were assigned to lower levels. Similarly, the larger proposals were assigned to higher levels. However, there were some problems for this assigning scheme, for instance, the proposals with similar size were possibly assigned to different level feature maps because of the quantization error of Equation (1). Furthermore, the advantages of the hierarchical features representation of FPN were not fully utilized since the feature of every proposal was only extracted from the single level feature map of FPN.
A MLFF module is proposed in this paper to tackle the aforementioned problems. The framework of MLFF is given in Figure 1b. The feature maps of all levels (i.e., P 2 ∼ P 5 ) were used by a MLFF module for feature extraction of each proposal. The details of the MLFF module are presented as follows.
First of all, each proposal generated by the RPN was mapped back to the feature maps across all levels, which were denoted by pink regions throughout all levels in Figure 1a. The size and spatial location of any pink region in each level feature map can be calculated based on the size ratio between the proposal and feature maps. The top-left and right-down coordinates of any pink region in ith level feature map can be obtained by using the following equations: where (x TL i , y TL i ) and (x RD i , y RD i ) denote the top-left and right-down coordinates of any pink region in an ith level feature map, respectively, (p TL , q TL ) and (p RD , q RD ) denote the top-left and right-down coordinates of region proposals in the input image, respectively, (w img , h img ) denotes the width and height of the input image, (w i , h i ) denotes the width and height of the ith level feature map, and Round(·) denotes the rounding operation. Afterwards, the four level pink regions of each proposal were transformed into four groups of 7 × 7 feature maps (denoted as F 2 , F 3 , F 4 , and F 5 , as shown in Figure 1b) through the RoIAlign operation [44], and the fused features of each proposal can be obtained by the following equation: where F denotes the fused features of each proposal and ⊕ denotes the concatenation operation along channel dimension. The convolutional operation with 7 × 7 kernel was imposed on F to obtain the FC 1 , and the FC 1 was followed by a fully connected layer (i.e., FC 2 ) for bounding box regression and classification.

Generalized Intersection over Union
The IoU is a normalized measure which was adopted for evaluating the proximity of two bounding boxes. The IoU was insensitive to the scales of the bounding boxes and has been widely used to ODRSIs. As shown in Figure 2, the IoU between ground truth box B GT and predicted bounding box B PT can be calculated as follows: where area(B GT ∩ B PT ) and area(B GT ∪ B PT ) denote the area of intersection and union between B GT and B PT respectively. Note that IoU was suitable for evaluating the proximity of two bounding boxes shown in Figure 2a, however, the area(B GT ∩ B PT ) was always zero in the case shown in Figure 2b. In other words, the IoU could not measure the distance when two bounding boxes were nonoverlapping. In this paper, a new metric (i.e., GIoU) was adopted to address this problem. The definition of GIoU was given as follows: where B EC and area(B EC ) denote the smallest enclosing box of B GT and B PT and its area, respectively. As shown in Figure 2, the IoU was inversely proportional to the distance between B GT and B PT when they were overlapping, and the IoU remained at zero when the they were nonoverlapping. In contrast, the area(B EC ) was always proportional to the distance of two bounding boxes. In summary, as shown in Equation (5), the GIoU monotonously decreased with the distance between B GT and B PT , regardless of whether or not the two bounding boxes were overlapping. Apparently, the GIoU can overcome the aforementioned shortcoming of IoU. Afterwards, the four level pink regions of each proposal were transformed into four groups of 7 x 7 feature maps (denoted as F2, F3, F4, and F5, as shown in Figure 1(b)) through the RoIAlign operation [44], and the fused features of each proposal can be obtained by the following equation: where F denotes the fused features of each proposal and ⊕ denotes the concatenation operation along channel dimension.
The convolutional operation with 7 x 7 kernel was imposed on F to obtain the FC1, and the FC1 was followed by a fully connected layer (i.e., FC2) for bounding box regression and classification.

Generalized Intersection over Union
The IoU is a normalized measure which was adopted for evaluating the proximity of two bounding boxes. The IoU was insensitive to the scales of the bounding boxes and has been widely used to ODRSIs. As shown in Figure 2, the IoU between ground truth box GT B and predicted bounding box PT B can be calculated as follows: and PT B respectively. Note that IoU was suitable for evaluating the proximity of two bounding boxes shown in Figure 2 In other words, the IoU could not measure the distance when two bounding boxes were nonoverlapping. In this paper, a new metric (i.e., GIoU) was adopted to address this problem. The definition of GIoU was given as follows: where EC B and EC area B ( ) denote the smallest enclosing box of GT B and PT B and its area, respectively. As shown in Figure 2, the IoU was inversely proportional to the distance between GT B and PT B when they were overlapping, and the IoU remained at zero when the they were nonoverlapping. In contrast, the EC area B ( ) was always proportional to the distance of two bounding boxes. In summary, as shown in Equation (5), the GIoU monotonously decreased with the distance between GT B and PT B , regardless of whether or not the two bounding boxes were overlapping. Apparently, the GIoU can overcome the aforementioned shortcoming of IoU.

Bounding Box Regression Based on IGIoU Loss
The bounding box regression loss of existing object detection methods in RSIs is usually adopted to smooth L1 or L2 loss. Despite this, the two loss functions do not directly optimize the metric. Specifically, the smooth L1 or L2 loss is used to optimize the four independent parameters of the predicted bounding box, while the IoU emphasizes the overlapping degree between two bounding boxes. Therefore, it was necessary to adopt a more reasonable loss function to perform the bounding box regression.
Incorporating the IoU into bounding box regression loss was a logical consideration. However, as mentioned in the last section, the IoU remains zero when two bounding boxes are nonoverlapping, and the bounding box regression cannot be implemented in this case if a IoU based loss function is adopted. Therefore, designing the bounding box regression loss function based on IoU was impractical.
To address this problem, Rezatofighi et al. [43] propose a GIoU loss in natural image object detection, which incorporates the GIoU into the bounding box regression loss. The formulation of GIoU loss was given as follows: where L GIoU denotes the GIoU loss, and the curve of L GIoU is shown in Figure 3. Obviously, the GIoU loss can alleviate the aforementioned shortage which existed in IoU based loss by directly optimizing the metric (i.e., GIoU). However, as shown in Figure 3, the GIoU loss has a constant gradient during the whole training process, which restricts the effect of bounding box regression to some extent. As a matter of fact, the strength of training should be enhanced when the predicted bounding box is far away from the ground truth box. In other words, the absolute value of the gradient should be increased when the GIoU is small. In addition, the value of the bounding box regression loss should decrease with the GIoU. Following above analysis, an improved GIoU loss (IGIoU loss) was proposed for the bounding box regression in this paper. The formulation of IGIoU loss was given as follows: where L IGIoU denotes the IGIoU loss. The curve of L IGIoU is also shown in Figure 3 for intuitive comparison between the GIoU loss and IGIoU loss.

Bounding Box Regression Based on IGIoU Loss
The bounding box regression loss of existing object detection methods in RSIs is usually adopted to smooth L1 or L2 loss. Despite this, the two loss functions do not directly optimize the metric. Specifically, the smooth L1 or L2 loss is used to optimize the four independent parameters of the predicted bounding box, while the IoU emphasizes the overlapping degree between two bounding boxes. Therefore, it was necessary to adopt a more reasonable loss function to perform the bounding box regression.
Incorporating the IoU into bounding box regression loss was a logical consideration. However, as mentioned in the last section, the IoU remains zero when two bounding boxes are nonoverlapping, and the bounding box regression cannot be implemented in this case if a IoU based loss function is adopted. Therefore, designing the bounding box regression loss function based on IoU was impractical.
To address this problem, Rezatofighi et al. [43] propose a GIoU loss in natural image object detection, which incorporates the GIoU into the bounding box regression loss. The formulation of GIoU loss was given as follows: where GIoU L denotes the GIoU loss, and the curve of GIoU L is shown in Figure 3. Obviously, the GIoU loss can alleviate the aforementioned shortage which existed in IoU based loss by directly optimizing the metric (i.e., GIoU). However, as shown in Figure 3, the GIoU loss has a constant gradient during the whole training process, which restricts the effect of bounding box regression to some extent. As a matter of fact, the strength of training should be enhanced when the predicted bounding box is far away from the ground truth box. In other words, the absolute value of the gradient should be increased when the GIoU is small. In addition, the value of the bounding box regression loss should decrease with the GIoU. Following above analysis, an improved GIoU loss (IGIoU loss) was proposed for the bounding box regression in this paper. The formulation of IGIoU loss was given as follows: where IG IoU L denotes the IGIoU loss. The curve of IG IoU L is also shown in Figure 3 for intuitive comparison between the GIoU loss and IGIoU loss.  As shown in Figure 3, both L IGIoU and L GIoU monotonously decrease with GIoU, the absolute value of gradient of L IGIoU also monotonously decreases with GIoU, and the gradient of L GIoU does not change with GIoU. Apparently, the use of IGIoU is better for bounding box regression. Specifically, the larger absolute value of a gradient is required when the distance between predicted bounding boxes and ground truth bounding boxes is large (i.e., GIoU is small). As shown in Figure 3, the absolute value of gradient of L IGIoU is larger than that of L GIoU when GIoU is small, which coincides with the above analysis. A similar conclusion can be derived when GIoU was large. In summary, the designing of L IGIoU was more reasonable than L GIoU from the perspective of theoretical analysis.

Dataset
To validate the effectiveness of our method, various experiments on the NWPU VHR-10 [42,45,46] and DIOR benchmark dataset were conducted with a total of 3775 instances and 192,472 instances available, respectively. The details of both datasets are listed in Figure 4.    The NWPU VHR-10 dataset includes 800 very-high-resolution RSIs acquired from Vaihingen datasets [47] and Google Earth. The dataset was divided into two parts: one was a positive set that includes 650 images, each of which covered at least one of the 10 categories, with some examples listed in Figure 5: (a) airplane, (b) baseball diamond, (c) basketball court, (d) bridge, (e) ground track field, (f) harbor, (g) ship, (h) storage tank, (i) tennis court, and (j) vehicle. The other one was a negative set including 150 images that did not contain any objects belonging to the above categories. The negative set was intended for weakly supervised learning methods [48] or semi-supervised learning methods [49]. Therefore, the negative set was not utilized in our experiments. The positive set was divided into training, validation, and testing sets, which include 20%, 20%, and 60% of the images of the positive set, respectively [37,42].

Evaluation Metrics
The evaluation metrics adopted by this paper were similar to the metrics used on MS COCO [53]. Specifically, the average precision (AP) under multiple thresholds and mean average precision (mAP) were adopted to quantitatively evaluate the experimental results. Note that both GIoU and IoU were utilized as metrics to comprehensively demonstrate the advantages of our method, which include: (1) The ability to determine the curve of AP with a different threshold. The AP was calculated by using the area under the precision-recall curve (p-r curve) [54][55][56]. The precison and recall were Considering the fact that the number of the samples of the NWPU VHR-10 dataset is limited, the DIOR dataset recently proposed by Li et al. [50] was also utilized to verify the effectiveness and generalization of the proposed method in this paper. The DIOR dataset is a large scale benchmark, the size of which is comparable to another well-known large-scale DOTA dataset [51,52]. Specifically, the DIOR dataset includes 23,463 images and 192,472 object instances covered by 20 categories, where each category contains around 1200 images. The 20 categories cover all categories in the NWPU VHR-10 dataset as well as other ten categories, with some examples listed in Figure 5: (k) airport, (l) chimney, (m) dam, (n) expressway service area, (o) expressway toll station, (p) golf course, (q) overpass, (r) stadium, (s) train station, and (t) wind mill. The training set, validation set, and testing set include 5862 images, 5863 images, and 11,738 images, respectively [50].

Evaluation Metrics
The evaluation metrics adopted by this paper were similar to the metrics used on MS COCO [53]. Specifically, the average precision (AP) under multiple thresholds and mean average precision (mAP) were adopted to quantitatively evaluate the experimental results. Note that both GIoU and IoU were utilized as metrics to comprehensively demonstrate the advantages of our method, which include: (1) The ability to determine the curve of AP with a different threshold. The AP was calculated by using the area under the precision-recall curve (p-r curve) [54][55][56]. The precison and recall were defined as follows: where TP, FP, and FN separately represent the number of true positives, false positives, and false negatives, respectively. Obviously, the precision denotes the ratio of correctly detected results (i.e., true positives) out of all the detected results, and the recall denotes the ratio of true positives that were correctly detected. The detection result was considered to be a true positive when the metric (i.e., GIoU or IoU) exceeded a fixed threshold, otherwise it was treated as false positive. Obviously, the value of AP was changed with different thresholds. Therefore, the curve of AP with a different threshold (i.e., 0.5-0.95) was adopted for comprehensively evaluating the effectiveness of our method, as shown in Figures 6-9.
(2) mAP. In general, the AP50 denotes the AP value when threshold was 0.5, meaning the definitions of AP55, ..., AP90, and AP95 are similar to AP50. The mAP is the mean value of AP50-AP95. The mAP, AP50, and AP75 were utilized for quantitative evaluation of our method, as shown in Tables 1-4. Note that the "mAP" used on PASCAL VOC [57] was quite different to the mAP used in this paper, as it was equivalent to the AP50.
In essence, compared with the traditional "mAP" used in PASCAL VOC, the evaluation metrics used in this paper are richer and can evaluate our method more comprehensively.

Implementation Details
(1) Data Augmentation. The total of training and validation samples of NWPU VHR-10 was 260, which was insufficient for training the FPN network. Therefore, a data augmentation strategy was adopted to enlarge the scale of the training samples. First of all, each of training samples was rotated 90 • clockwise, and this process was repeated four times to obtain the 4× training samples. Then, each training image was horizontally flipped to obtain the 8× training samples, including 2080 samples. Considering the large number of training samples (i.e., 5862 training samples and 5863 validation samples), the samples augmentation was not imposed on the DIOR dataset.
(2) Parameters Setting. The ResNet50 model pretrained on ImageNet was adopted as the backbone network in this paper. The shorter side of each image was resized to 800 pixels, while the longer side was no more than 1333 pixels. The threshold of non-maximum suppression (NMS) was 0.7 in both the training and inference stages of RPN, and the threshold of NMS was 0.5 in both the training and inference stages of the detection network. In the training stage of RPN, each layer (i.e., P2-P5 shown in Figure 1) generated 2000 proposals and a total of 2000 proposals was left after various elimination strategies (such as NMS). In the inference stage of RPN, each layer generated 1000 proposals and a total of 1000 proposals was finally left. The whole model was optimized by using stochastic gradient descent (SGD) [58]. The above settings were set by referring to the program provided by Facebook AI Research (Available at: https://github.com/roytseng-tw/Detectron.pytorch.).
All the experiments were conducted on PyTorch framework, and running on a workstation with two E5-2650V4 CPUs (i.e., a total of 2.2 GHz 12 × 2-cores), 512 GB memory, and 8 NVIDIA RTX Titan GPUs (i.e., a total of 24 GB × 8 memory).

Comparison Methods
To validate the effectiveness of the proposed IGIoU loss and MLFF module, six methods were used for quantitative comparisons, which were denoted as FPN(baseline), FPN+MLFF, FPN+GIoU, FPN+IGIoU, FPN+MLFF+GIoU, and FPN+MLFF+IGIoU, respectively, as shown in the first column of Tables 1 and 3. The FPN [35] was adopted as the baseline in this paper. The FPN+MLFF was obtained by incorporating the MLFF module into the FPN, which was used to validate the effectiveness of the MLFF module. The FPN+GIoU replaced the bounding box regression loss adopted by FPN (i.e., smooth L1 loss) with GIoU loss [43] given in Equation (6). Similarly, the FPN+IGIoU adopted the proposed IGIoU loss given in Equation (7). The comparisons between FPN+IGIoU and FPN, as well as FPN+IGoU and FPN+GIoU were used for validating the superiority of our IGIoU loss. The FPN+MLFF+GIoU and FPN+MLFF+IGIoU separately used the GIoU loss and IGIoU loss for the bounding box regression based on the FPN+MLFF model, which were utilized to verify the combined effect of the MLFF module and proposed IGIoU loss.
Furthermore, in order to evaluate the overall performance of proposed method, four state-of-the-art methods including Faster R-CNN [33], Mask R-CNN [44], FPN [35], and PANet [59] were used for comparison with the proposed method, as shown in the first column of Tables 2 and 4.

Results
In this section, various quantitative evaluations on the NWPU VHR-10 and DIOR datasets is given according to the evaluation metrics mentioned in Section 2.2.2.

Evaluation of Proposed IGIoU Loss and MLFF Module
To validate the effectiveness of proposed IGIoU loss and MLFF module, the quantitative comparisons between our method and five other methods on NWPU VHR-10 dataset are listed in Table 1. As mentioned in Section 2.2.2, both GIoU and IoU based metrics were used for more comprehensive evaluation. The Table 1 shows that the FPN+MLFF is superior to FPN in six evaluation metrics, which demonstrates the effectiveness of the MLFF module. The comparisons between FPN+IGIoU and FPN, as well as FPN+IGIoU and FPN+GIoU show that the IGIoU loss is better than the existing smooth L1 loss and GIoU loss. The performance of FPN+MLFF+IGIoU is better than that of FPN+IGIoU and FPN+MLFF, which indicates that the combination of the MLFF module and IGIoU loss is effective. A similar conclusion also applies to MLFF module and GIoU loss. A comparison between FPN+MLFF+IGIoU and FPN+MLFF+GIoU demonstrates that the final combination of an MLFF module and IGIoU loss is superior to the combination of an MLFF module and GIoU loss. In essence, the overall performance of proposed method (FPN+MLFF+IGIoU) on NWPU VHR-10 dataset is the best among the six methods examined, and effectiveness of the MLFF module, IGIoU loss, and their combination are also validated.
To further evaluate our method, the curves of AP with different GIoU (IoU) thresholds of all the comparison methods are shown in Figure 6. Obviously, regardless of the GIoU threshold or IoU threshold, the curves of FPN+MLFF+IGIoU are better than for the other five methods, especially in the range of the 0.65 to 0.8 of GIoU (IoU) threshold. In a word, the effectiveness of the proposed method can be validated for various thresholds, and its superiority is more obvious when the threshold is relatively high.

Comparison with the State-of-the-Art Methods
To further validate the overall performance of the proposed method, four state-of-the-art methods were compared wth the proposed method on NWPU VHR-10 dataset, as shown in Table 2. Similar to in Section 3.1.1, both GIoU and IoU were adopted as the metrics.

Comparison with the State-of-the-Art Methods
To further validate the overall performance of the proposed method, four state-of-the-art methods were compared wth the proposed method on NWPU VHR-10 dataset, as shown in Table 2. Similar to in Section 3.1.1, both GIoU and IoU were adopted as the metrics. The Table 2 demonstrates that the proposed method can obtain an absolute gain of 1.7% and 1.4% in terms of GIoU mAP and IoU mAP, respectively, which validates the overall performance of the proposed method. Note that the advantage of the proposed method in terms of AP75 is more obvious, which demonstrates that the proposed method can improve the precision of object localization, the details of which can be seen in Section 4.
Similar to in Section 3.1.1, the curves of AP with different GIoU (IoU) thresholds of all the comparison methods are shown in Figure 7. It can be seen that the proposed method is superior to the state-of-the-art methods, especially when the threshold is in the range of 0.65-0.85, which coincides with the analysis of the previous paragraph.  The Table 2 demonstrates that the proposed method can obtain an absolute gain of 1.7% and 1.4% in terms of GIoU mAP and IoU mAP, respectively, which validates the overall performance of the proposed method. Note that the advantage of the proposed method in terms of AP75 is more obvious, which demonstrates that the proposed method can improve the precision of object localization, the details of which can be seen in Section 4. Similar to in Section 3.1.1, the curves of AP with different GIoU (IoU) thresholds of all the comparison methods are shown in Figure 7. It can be seen that the proposed method is superior to the state-of-the-art methods, especially when the threshold is in the range of 0.65-0.85, which coincides with the analysis of the previous paragraph.

Evaluation of Proposed IGIoU Loss and MLFF Module
The quantitative comparisons completed in Section 3.1.1 were also implemented on the DIOR dataset, as shown in Table 3 and Figure 8. Apparently, similar conclusions can be obtained by comparing Table 3 with Table 1. The only difference is that the advantage of the proposed method over the baseline method using the DIOR dataset is not as obvious as the advantage over the NWPU VHR-10 dataset. This may be because the DIOR dataset is larger than the NWPU VHR-10 dataset and its testing images are more challenging to use. Table 3. Comparison with baseline methods on the DIOR dataset in terms of six evaluation metrics. Bold fonts denote the best results. As illustrated in Figure 8, regardless of the GIoU threshold or IoU threshold, the curves of FPN+MLFF+IGIoU are better than for the other five methods, especially in the range of 0.6 to 0.8 of the GIoU (IoU) threshold. Moreover, it can be found that the AP value of our method on the DIOR dataset at different thresholds is more balanced than the AP value on NWPU VHR-10 dataset by

Evaluation of Proposed IGIoU Loss and MLFF Module
The quantitative comparisons completed in Section 3.1.1 were also implemented on the DIOR dataset, as shown in Table 3 and Figure 8. Apparently, similar conclusions can be obtained by comparing Table 3 with Table 1. The only difference is that the advantage of the proposed method over the baseline method using the DIOR dataset is not as obvious as the advantage over the NWPU VHR-10 dataset. This may be because the DIOR dataset is larger than the NWPU VHR-10 dataset and its testing images are more challenging to use. As illustrated in Figure 8, regardless of the GIoU threshold or IoU threshold, the curves of FPN+MLFF+IGIoU are better than for the other five methods, especially in the range of 0.6 to 0.8 of the GIoU (IoU) threshold. Moreover, it can be found that the AP value of our method on the DIOR dataset at different thresholds is more balanced than the AP value on NWPU VHR-10 dataset by comparing Figure 8 with Figure 6. Consequently, the superiority of the proposed method is further validated by considering the challenges and size of the DIOR dataset. comparing Figure 8 with Figure 6. Consequently, the superiority of the proposed method is further validated by considering the challenges and size of the DIOR dataset.

Comparison with the State-of-the-Art Methods
Comparisons similar to Section 3.1.2 were also implemented on the DIOR dataset, as shown in Table 4 and Figure 9, and similar conclusions can be obtained by comparing Table 4 with Table 2

Comparison with the State-of-the-Art Methods
Comparisons similar to Section 3.1.2 were also implemented on the DIOR dataset, as shown in Table 4 and Figure 9, and similar conclusions can be obtained by comparing Table 4 with Table 2. The only difference is that the advantage of the proposed method over the state-of-the-art methods on the DIOR dataset is not as obvious as the advantage on the NWPU VHR-10 dataset. The reason behind this was analyzed in Section 3.2.1. As shown in Figure 9, the overall performance of the proposed method is superior to the four state-of-the-art methods, especially when the threshold is in the range of 0.7-0.9. The reason for this can also be seen in Section 4.
Remote Sens. 2019, 11, x FOR PEER REVIEW 16 of 22 only difference is that the advantage of the proposed method over the state-of-the-art methods on the DIOR dataset is not as obvious as the advantage on the NWPU VHR-10 dataset. The reason behind this was analyzed in Section 3.2.1. As shown in Figure 9, the overall performance of the proposed method is superior to the four state-of-the-art methods, especially when the threshold is in the range of 0.7-0.9. The reason for this can also be seen in Section 4.

Subjective Evaluation
To intuitively demonstrate the performance of our method, the detection results of our method on DIOR dataset are visualized. As shown in Figure 10, the detected objects are enclosed by the red bounding box. The classes of detected objects and the corresponding probabilities are also attached to the bounding boxes. Figure 10 demonstrates that our method has an excellent performance on different classes regardless of the scales and orientations of the objects.

Subjective Evaluation
To intuitively demonstrate the performance of our method, the detection results of our method on DIOR dataset are visualized. As shown in Figure 10, the detected objects are enclosed by the red bounding box. The classes of detected objects and the corresponding probabilities are also attached to the bounding boxes. Figure 10 demonstrates that our method has an excellent performance on different classes regardless of the scales and orientations of the objects.

Subjective Evaluation
To intuitively demonstrate the performance of our method, the detection results of our method on DIOR dataset are visualized. As shown in Figure 10, the detected objects are enclosed by the red bounding box. The classes of detected objects and the corresponding probabilities are also attached to the bounding boxes. Figure 10 demonstrates that our method has an excellent performance on different classes regardless of the scales and orientations of the objects.

Discussion
As shown in Table 1 and Table 3, although the proposed method has the best performance in terms of mAP, AP50, and AP75 based on GIoU (IoU), the advantage of the proposed method over other methods is different in three evaluation metrics. Specifically, the advantage of the proposed method is the most obvious in terms of AP75, followed by mAP, and finally by AP50. In other words, the advantage is the most obvious when the threshold of GIoU (IoU) is relatively high. As a matter of fact, the above observations are in accordance with evaluations in terms of curves of AP with different GIoU (IoU) thresholds. As shown in Figure 8 (evaluations on DIOR dataset), the advantage of the proposed method is obvious when the GIoU (IoU) threshold is in the range of 0.6-0.8. In addition, the performance of the proposed method only has a slight improvement when the GIoU (IoU) threshold is in the range of 0.5-0.6 or 0.8-0.95. Similar observations can also be obtained from Figure 6 (evaluations on the NWPU VHR-10 dataset).
The major contributions of this paper focus on the improvement of bounding box regression, which will improve the precision of object localization. Therefore, the advantage of the proposed method over other methods is obvious when the metric of object localization is relatively strict (e.g., the IoU or GIoU threshold is in the range of 0.6-0.8 for the DIOR dataset). However, the advantage no longer exists if the metric is loose or too strict (e.g., the IoU or GIoU threshold is in the range of 0.5-0.6 or 0.8-0.95 for the DIOR dataset).

Conclusions
A novel ODRSIs method based on improved bounding box regression and multi-level features fusion was proposed in this paper. First, a new metric named GIoU, which considers both cases of overlapping and non-overlapping between two bounding boxes, was employed to tackle the problem that IoU can not measure the distance in the case of nonoverlapping between two bounding boxes. Second, a novel bounding box regression loss named IGIoU loss was proposed, which can not only optimize metrics (i.e., GIoU) directly but also overcomes the problem that existing GIoU based bounding box regression loss cannot adaptively change the gradient based on the GIoU value. Finally, to handle the problem that the feature extraction scheme of region proposals of the existing method cannot make full use of multi-level features, an MLFF module was proposed and incorporated into the existing hierarchical deep network. The quantitative evaluations on the DIOR Figure 10. Subjective evaluations on 10 testing images from the DIOR dataset.

Discussion
As shown in Tables 1 and 3, although the proposed method has the best performance in terms of mAP, AP50, and AP75 based on GIoU (IoU), the advantage of the proposed method over other methods is different in three evaluation metrics. Specifically, the advantage of the proposed method is the most obvious in terms of AP75, followed by mAP, and finally by AP50. In other words, the advantage is the most obvious when the threshold of GIoU (IoU) is relatively high. As a matter of fact, the above observations are in accordance with evaluations in terms of curves of AP with different GIoU (IoU) thresholds. As shown in Figure 8 (evaluations on DIOR dataset), the advantage of the proposed method is obvious when the GIoU (IoU) threshold is in the range of 0.6-0.8. In addition, the performance of the proposed method only has a slight improvement when the GIoU (IoU) threshold is in the range of 0.5-0.6 or 0.8-0.95. Similar observations can also be obtained from Figure 6 (evaluations on the NWPU VHR-10 dataset).
The major contributions of this paper focus on the improvement of bounding box regression, which will improve the precision of object localization. Therefore, the advantage of the proposed method over other methods is obvious when the metric of object localization is relatively strict (e.g., the IoU or GIoU threshold is in the range of 0.6-0.8 for the DIOR dataset). However, the advantage no longer exists if the metric is loose or too strict (e.g., the IoU or GIoU threshold is in the range of 0.5-0.6 or 0.8-0.95 for the DIOR dataset).

Conclusions
A novel ODRSIs method based on improved bounding box regression and multi-level features fusion was proposed in this paper. First, a new metric named GIoU, which considers both cases of overlapping and non-overlapping between two bounding boxes, was employed to tackle the problem that IoU can not measure the distance in the case of nonoverlapping between two bounding boxes. Second, a novel bounding box regression loss named IGIoU loss was proposed, which can not only optimize metrics (i.e., GIoU) directly but also overcomes the problem that existing GIoU based bounding box regression loss cannot adaptively change the gradient based on the GIoU value. Finally, to handle the problem that the feature extraction scheme of region proposals of the existing method cannot make full use of multi-level features, an MLFF module was proposed and incorporated into the existing hierarchical deep network. The quantitative evaluations on the DIOR and NWPU VHR-10 datasets demonstrate the effectiveness of the proposed method. Specifically, incorporating MLFF, IGIoU loss, and their combination into the baseline method separately achieves absolute gains of 0.7%, 1.4%, and 2.2% or so in terms of COCO mAP on the DIOR dataset, and achieves an absolute gain of 1.0%, 2.0%, and 2.7% on the NWPU VHR-10 dataset, respectively. Moreover, the evaluations in terms of the curves of AP with different thresholds demonstrate that the advantage of the proposed method over other methods is obvious when the threshold is relatively high (e.g., IoU or GIoU threshold is in the range of 0.6-0.8 for the DIOR dataset), which indicates that the proposed method can improve the precision of object localization. Moreover, the comparison between four state-of-the-art methods and the proposed method demonstates that the overall performance of the proposed method has achieved a state-of-the-art level performance.
The GIoU employed in this paper can be applied to other methods in ODRSIs, and the proposed IGIoU loss can also be used as the alternative to existing bounding box regression loss in other object detection methods. Moreover, the proposed MLFF module can be easily embedded into the two-stage object detection methods where the backbone is a hierarchical deep network. Our future works involve extending the proposed method to detect the objects with oriented bounding boxes proposed by Xia et al. [51] and Ding et. al. [52], as well as conducting evaluations on the DOTA dataset.