High Quality Object Detection for Multiresolution Remote Sensing Imagery Using Cascaded Multi-Stage Detectors

: Deep-learning-based object detectors have substantially improved state-of-the-art object detection in remote sensing images in terms of precision and degree of automation. Nevertheless, the large variation of the object scales makes it difﬁcult to achieve high-quality detection across multiresolution remote sensing images, where the quality is deﬁned by the Intersection over Union (IoU) threshold used in training. In addition, the imbalance between the positive and negative samples across multiresolution images worsens the detection precision. Recently, it was found that a Cascade region-based convolutional neural network (R-CNN) can potentially achieve a higher quality of detection by introducing a cascaded three-stage structure using progressively improved IoU thresholds. However, the performance of Cascade R-CNN degraded when the fourth stage was added. We investigated the cause and found that the mismatch between the ROI features and the classiﬁer could be responsible for the degradation of performance. Herein, we propose a Cascade R-CNN++ structure to address this issue and extend the three-stage architecture to multiple stages for general use. Speciﬁcally, for cascaded classiﬁcation, we propose a new ensemble strategy for the classiﬁer and region of interest (RoI) features to improve classiﬁcation accuracy at inference. In localization, we modiﬁed the loss function of the bounding box regressor to obtain higher sensitivity around zero. Experiments on the DOTA dataset demonstrated that Cascade R-CNN++ outperforms Cascade R-CNN in terms of precision and detection quality. We conducted further analysis on multiresolution remote sensing images to verify model transferability across different object scales.


Introduction
Object detection in remote sensing images plays an important role in several civilian and military applications, such as urban planning, geographic information system updating, and search-and-rescue operations. Compared with the traditional methods (templatematching-based methods [1,2], knowledge-based methods [3,4], etc.), the deep-learningbased methods automatically extract features from raw data by shifting the burden of manual feature design to the underlying learning system, enabling a more powerful feature representation to extract higher semantic levels of feature maps. With this advantage, deeplearning-based detection approaches have achieved great success in both the computer vision and remote sensing community [5,6]. 3 of 18 To overcome this limitation, we propose a new ensemble strategy for cascaded classification by taking the RoI features produced by the same stage for classification, rather than uniformly using those from the final stage. The final classification results are obtained by integrating the classifier outputs of all stages. In addition, the loss function of bounding box regression [26] is modified to improve sensitivity, allowing the cascaded regressor to further converge as the number of stages increases. The modified cascade structure is denoted as Cascade R-CNN++ throughout this paper.
The main contributions of this study are as follows: (1) we investigated the causes of performance degradation in cascaded detectors when more stages are added, (2) we propose a new ensemble strategy to minimize the mismatch between the classifiers and input RoIs at inference and to improve classification accuracy, and (3) we propose a modified loss function for bounding box regression to enable further convergence of bounding box regression with more stages built. The proposed Cascade R-CNN++ approach can achieve state-of-the-art detection performance on the remote sensing dataset DOTA [27,28]. It can be implemented in most cases where region-proposal-based methods are needed. In experiments with multiresolution remote sensing images, the proposed approach outperforms Cascade R-CNN both in detection quality and precision.
The rest of this paper is arranged as follows. Section 2 reviews previous studies most relevant to this research. Section 3 introduces the employed dataset and evaluation metrics. Section 4 analyzes the reasons for performance degradation when more stages are added in Cascade R-CNN. Section 5 describes the proposed method, Cascade R-CNN++. Section 6 presents experimental results with discussion, and Section 7 draws the conclusions.

Related Works
In the computer vision field, deep-learning-based detectors can be generally divided into two categories. The first is one-stage approaches that are more efficient with simpler structures, represented by YOLO [29][30][31], single-shot detection (SSD) [32], and Reti-naNet [33]. The other category is two-stage approaches (i.e., region-proposal-based methods), represented by region-based convolutional neural networks (R-CNNs) [26], Fast R-CNN [34], Faster R-CNN [35], Feature Pyramid Network (FPN) [36] and Cascade R-CNN [24]. In the second category, the multi-scale region proposals are produced firstly, followed by the feature extraction and bounding box regression procedure. Although one-stage models have achieved high precision in object detection, two-stage methods can generally be more flexible and extensible across different computer vision tasks, such as objection detection, instance segmentation and key point detection. Thus, this research focuses on two-stage detectors and aims to alleviate the problem limiting the further extension of the cascaded structure.
R-CNN [26] was proposed in 2014. It employs a two-stage structure for object detection, combining region proposals with CNN extracted features. R-CNN employs selective search algorithms to generate approximately 2000 candidate region proposals from the input image and applies CNNs to create feature vectors for each object proposal. The performance of R-CNN was validated on natural scene images using the PASCAL VOC 2012 dataset, reaching a mean Average Precision (mAP) of 53.3%. Fast R-CNN [34] improves computational efficiency by integrating the three training stages in R-CNN, achieving a mAP of 68.4% on PASCAL VOC 2012 test. R-CNN and Fast R-CNN both employ the selective search approach to generate object proposals, which is more computationally expensive. Faster R-CNN [35] replaces the selective search algorithm with a region proposal network (RPN), introducing anchors to identify region proposals using a fully convolutional network and significantly reduces time consumption. It achieved detection accuracy of 70.4% mAP on PASCAL VOC 2012 dataset. RPN has a fixed receptive field size, where objects are of various scales.
Only using the topmost feature layer for proposal generation can lead to missed detection of small objects. FPN [36] can extract top-down multiscale feature layers for RPN to generate region proposals. As different layers have different receptive fields, a  [36]. More recent studies also contributed to improve the feature pyramid for object detection in optical remote sensing images, such as aware feature pyramid network (AFPN) [37] and Feature Enhancement Network (FENet) [38], achieving 74.3% and 74.89% mAP (PASCAL VOC metric) on DOTA-v1.0 dataset, respectively.
Besides two-stage architectures, multistage detectors, including Cascade R-CNN [24,25], have recently been proposed. Cascade R-CNN uses a three-stage structure and can achieve better performance than two-stage detectors through cascaded bounding box regression and an ensemble of cascaded classification results. In the design of cascaded detectors, regressed bounding boxes from the previous stage act as region proposals for the current stage to progressively improve the quality of region proposals in the cascaded structure. Linearly increased IoU thresholds (0.5, 0.6, and 0.7) are used for training at each stage to better match the quality of input proposals to train high-quality detectors. Cascade R-CNN obtained 38.9% AP on MS-COCO 2017. However, degraded performance is observed when the fourth stage is added to Cascade R-CNN [25].

Datasets and Evaluation Metrics
The DOTA-v1.5 dataset [27], which contains 2806 images and 403,318 instances, was employed in this study, in Sections 4, 6 and 7. It consists of 16 categories of objects, i.e., airplane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, large vehicle, small vehicle, helicopter, roundabout, soccer ball field, swimming pool, and container crane. The proportions of the training, validation, and testing sets are 1/2, 1/6, and 1/3, respectively.
Another remote sensing dataset, NWPU VHR-10 [6,12,39], was also used in the comparison with state-of-the-art detectors in Section 6. NWPU VHR-10 dataset is a publicly available geospatial object detection dataset. It contains 800 very-high-resolution (VHR) remote sensing images cropped from Google Earth and Vaihingen dataset and annotated by experts. The dataset consists of 10 categories of objects, including airplane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, and vehicle.
In this research, Intersection over Union (IoU) was used to measure the amount of overlap between the predicted and ground truth bounding box. It is a ratio from 0 to 1 that specifies the accuracy of object localization. The IoU threshold used in training defines the detection quality. The metrics of AP, AP 50 , AP 75 , AP 90 , AP S , AP M and AP L , as defined in the metric standard of MS COCO object detection challenge [40], were taken to assess the detection precision. The abovementioned metrics have been widely used to evaluate object detection tasks.

Causes of Performance Degradation in a Four-Stage Cascade R-CNN
Cascade R-CNN [25] uses three-stage cascaded detectors to progressively improve the IoU distribution of training samples. Higher precision of object localization can be achieved through cascaded regression. However, when the fourth stage is added, the metrics of AP, AP 50 , AP 60 , AP 70 , and AP 80 all decrease, and only AP 90 slightly increases. Herein, we investigate the causes of the performance degradation.

Cascaded Bounding Box Regression
We compare the IoU distribution of training samples among different stages in Cascade R-CNN ( Figure 1). The training samples were generated by original Cascade R-CNN with an extended five-stage structure on the training set of DOTA-v1.5. The IoU distribution of the first stage was analyzed on the region proposals, i.e., the output of Region Proposal Network (RPN). The IoU distributions of the second, third, fourth and fifth stages were plotted using the output of the bounding box regression of the previous stage. Thus, the IoU distribution of the fifth stage indicates the quality of bounding boxes obtained from Remote Sens. 2022, 14, 2091 5 of 18 the fourth stage, and so on. The IoU thresholds of 0.5, 0.6, 0.7, and 0.75 were used for the first~fourth stages, as empirically set up in literature [25]. For the fifth stage, an empirical threshold of 0.85 was chosen. Please note that the IoU threshold at the fifth stage does not affect the output of bounding box regression of the fourth stage, thus has no effects on the precision of bounding box regression from the first to fourth stages. from low to high IoU values, and no performance decrease is observed. This suggests that the quality of the input proposals is still being improved even in the fifth stage. In other words, the output of bounding box regression from the first, second, third, and fourth stages were all improved over the corresponding training samples. Thus, cascaded bounding box regression, even under the empirical settings of IoU threshold, is not responsible for the degraded performance of the fourth-stage model of Cascade R-CNN.
In addition, it is found that the improvement of the IoU distribution is not linear and it slows down with the increase of stages. Thus, the IoU thresholds at each stage should not be increased linearly in the cascaded structure. Further tuning of the thresholds has the potential to further improve the precision of bounding box regression.  As shown in Figure 1, from the first to fifth stage of the detector, the IoU distribution of training samples is improved, with the histogram peak gradually moving from low to high IoU values, and no performance decrease is observed. This suggests that the quality of the input proposals is still being improved even in the fifth stage. In other words, the output of bounding box regression from the first, second, third, and fourth stages were all improved over the corresponding training samples. Thus, cascaded bounding box regression, even under the empirical settings of IoU threshold, is not responsible for the degraded performance of the fourth-stage model of Cascade R-CNN.
In addition, it is found that the improvement of the IoU distribution is not linear and it slows down with the increase of stages. Thus, the IoU thresholds at each stage should not be increased linearly in the cascaded structure. Further tuning of the thresholds has the potential to further improve the precision of bounding box regression.

Mismatch between RoI Features and the Classifier
In Cascade R-CNN, each classifier is trained using RoI features produced by the same stage. At inference, the RoI features produced by the final stage are sent to all trained classifiers, the outputs of which are then averaged to obtain the final classification results ( Figure 2). RoI features provided by the final stage are usually the closest match to the real object as they are produced using the most accurate bounding box after cascaded

Mismatch between RoI Features and the Classifier
In Cascade R-CNN, each classifier is trained using RoI features produced by the same stage. At inference, the RoI features produced by the final stage are sent to all trained classifiers, the outputs of which are then averaged to obtain the final classification results (Figure 2). RoI features provided by the final stage are usually the closest match to the real object as they are produced using the most accurate bounding box after cascaded regression. However, the most accurate RoIs may not be the best match to the classifiers trained in previous stages. with the classification components grayed out. "Conv" denotes the convolution layer; "pool" denotes the pooling layer; "H1", "H2", and "H3" represent the network heads; "B0" denotes region proposals generated by RPN; "B1", "B2", and "B3" represent the bounding boxes in each stage; and "C1", "C2", and "C3" are the classifiers in each stage.
To investigate the impacts of the mismatch, under the same conditions, we compared the detection precision measurements of different combinations of classifiers and RoI features. First, a five-stage example of Cascade R-CNN was built. Each classifier was trained using the RoI features produced by the same stage. Then, the detection precision measurements of different combinations were compared. For instance, the RoI features produced in the fifth stage were sent to the classifier of the first stage, and the precision was calculated using a single pair of 1# stage and 5# RoI only. The same procedure was applied to other combinations, as listed in Table 1. The experiment was carried out using original settings of Cascade R-CNN with ResNet50 backbone. The settings of IoU threshold are the same as Section 4.1, i.e., 0.5, 0.6, 0.7, 0.75, and 0.85 for the first to fifth stages in sequence.
From Table 1, we can see that the combination of 1# classifier and 1# RoI outperforms the pair of 1# classifier and 5# RoI, the pair of 2# classifier and 2# RoI outperforms the combination of 2# classifier and 5# RoI, and the pair of 3# classifier and 3# RoI outperforms that of 3# classifier and 5# RoI. However, the pair of 4# classifier and 4# RoI achieves similar performance to that of 4# classifier and 5# RoI. The reason could be that the improvement of IoU distribution is not significant after the fourth stage, Workflow of cascaded bounding box regression, with the classification components grayed out. "Conv" denotes the convolution layer; "pool" denotes the pooling layer; "H1", "H2", and "H3" represent the network heads; "B0" denotes region proposals generated by RPN; "B1", "B2", and "B3" represent the bounding boxes in each stage; and "C1", "C2", and "C3" are the classifiers in each stage.
To investigate the impacts of the mismatch, under the same conditions, we compared the detection precision measurements of different combinations of classifiers and RoI features. First, a five-stage example of Cascade R-CNN was built. Each classifier was trained using the RoI features produced by the same stage. Then, the detection precision measurements of different combinations were compared. For instance, the RoI features produced in the fifth stage were sent to the classifier of the first stage, and the precision was calculated using a single pair of 1# stage and 5# RoI only. The same procedure was applied to other combinations, as listed in Table 1. The experiment was carried out using original settings of Cascade R-CNN with ResNet50 backbone. The settings of IoU threshold are the same as Section 4.1, i.e., 0.5, 0.6, 0.7, 0.75, and 0.85 for the first to fifth stages in sequence. From Table 1, we can see that the combination of 1# classifier and 1# RoI outperforms the pair of 1# classifier and 5# RoI, the pair of 2# classifier and 2# RoI outperforms the combination of 2# classifier and 5# RoI, and the pair of 3# classifier and 3# RoI outperforms that of 3# classifier and 5# RoI. However, the pair of 4# classifier and 4# RoI achieves similar performance to that of 4# classifier and 5# RoI. The reason could be that the improvement of IoU distribution is not significant after the fourth stage, which implies that 5# stage produces RoI features similar to that of 4# RoI. Thus, 4# classifier exhibits similar performance using either 4# or 5# RoI.
The above findings suggest that the mismatch between the classifier and RoI features is the main cause of performance degradation in Cascade R-CNN when more stages are added. The bounding box predicted by the primary stage detection head is coarse, resulting in a large difference between the RoI features and the target instance features, while the last stage's RoI features are much closer to the target instance features. However, as the primary stage's classifier was trained using the primary stage's RoI features, it is difficult for the primary stage's head to predict an ideal result with the last stage's RoI features.

Proposed Method: Cascade R-CNN++
In this section, we propose the Cascade R-CNN++ approach by modifying Cascade R-CNN. First, we propose a new ensemble strategy for the classifier and RoI features at inference. Second, we propose an improved loss function for bounding box regression to achieve high sensitivity around zero, which allows further convergence with more stages added. A five-stage example of the proposed Cascade R-CNN++ is shown in Figure 3. RPN is adopted for region proposal generation.
which implies that 5# stage produces RoI features similar to that of 4# RoI. Thus, 4# classifier exhibits similar performance using either 4# or 5# RoI. The above findings suggest that the mismatch between the classifier and RoI features is the main cause of performance degradation in Cascade R-CNN when more stages are added. The bounding box predicted by the primary stage detection head is coarse, resulting in a large difference between the RoI features and the target instance features, while the last stage's RoI features are much closer to the target instance features. However, as the primary stage's classifier was trained using the primary stage's RoI features, it is difficult for the primary stage's head to predict an ideal result with the last stage's RoI features.

Proposed Method: Cascade R-CNN++
In this section, we propose the Cascade R-CNN++ approach by modifying Cascade R-CNN. First, we propose a new ensemble strategy for the classifier and RoI features at inference. Second, we propose an improved loss function for bounding box regression to achieve high sensitivity around zero, which allows further convergence with more stages added. A five-stage example of the proposed Cascade R-CNN++ is shown in Figure 3. RPN is adopted for region proposal generation. Workflow of cascaded bounding box regression (blue lines) embedded with a modified regression loss function. "Conv" denotes the backbone convolution layer; "pool" is the RoI pooling layer; "H1", "H2", "H3", "H4", and "H5" Workflow of cascaded bounding box regression (blue lines) embedded with a modified regression loss function. "Conv" denotes the backbone convolution layer; "pool" is the RoI pooling layer; "H1", "H2", "H3", "H4", and "H5" are the network heads; "B0" denotes region proposals; "B1", "B2", "B3", "B4", and "B5" represent the bounding boxes in each stage; and "C1", "C2", "C3", "C4", and "C5" denote the classifiers in each stage.

New Ensemble Strategy for Classification
As defined in [25], a classifier is denoted as a function h(x i ) that assigns a feature vector x i to one of the m + 1 classes, where 0 is the background, and each of the remaining values represents a class. The output of a classifier is an m + 1 dimensional vector, with the maximum value indicating the category in which the object in the bounding box belongs to. The classifier is trained by minimizing the cross-entropy loss R cls as follows: where i is the index of the feature vector, N cls is the number of feature vectors, (x i , y i ) are the training samples, x i is the feature vector, y i is the class label, and L cls is the classical cross-entropy loss function. As shown in Figure 3a, in the proposed ensemble strategy, instead of using RoI features from the last stage for all classifiers, at inference, we use RoI features produced at the same stage as the input of the classifier. In other words, the same RoI features are used for the same classifier during both training and inference. The classification results of each stage are then integrated by averaging to generate the final classification result.

Modified Loss Function for Bounding Box Regression
At each stage, a bounding box regressor is used to gradually move the candidate proposals closer to the ground-truth position by minimizing the offsets between the real and candidate bounding boxes. An input proposal p can be transformed into a predicted ground-truth box g through the following transformation: where p = p x , p y , p w , p h denotes the position of the input proposal and g = g x , g y , g w , g h is the predicted ground-truth box. ∆ = t x /c x , t y /c y , t w /c w , t h /c h represents the distance vector, i.e., the minor adjustments performed by the bounding box regressor. c x , c y , c w , c h are the weights affecting the magnitude of the distance vector, the weights c x , c y , c w , c h are initially set as (10,10,5,5) and progressively increase with the increase of stages. As the bounding box regressor implements fine-tuning over the offset vector ∆, these values are usually very small. Thus, normalization is performed to ∆ [25,35,36,41].
For an image patch x j , the loss function of bounding box regression used in Cascade R-CNN [25] can be expressed as follows: where R loc is the cross-entropy loss of bounding box regression, j is the index of a candidate proposal, N loc is the number of candidate proposals, and L loc denotes the S1 smooth L 1 function [24]: L LOC (a,b) = ∑ i∈{x,y,w,h} where f x j , p j is the bounding box regression function, and p j = p j x , p j y , p j w , p j h is the jth candidate proposal with four coordinates, i.e., the center position p j x , p j y , and the box width and height p j w , p j y . g j represents the predicted ground-truth box and is specified in the same way (g j = g j x , g j y , g j w , g j h ). Henceforth, unless needed, we ignore the superscript j for simplicity.
To enable further convergence in the cascaded regression with more stages, we improve the loss function for the bounding box regression to achieve higher sensitivity around zero. The modified regression loss function is defined as where sgn is the signum function and the exponent 4/3 of the terms (t k /c k ) 4/3 , and k ∈ {x, y, w, h} is designed to increase the nonlinearity and maintain a tradeoff between sensitivity and the gradient of the loss. The input t x , t y , t w , t h is plotted against the bounding box output offsets (g x − p x , g y − p y , g w /p w , g h /p h ) for the original and modified loss functions with different weights c i , i ∈ {x, y, w, h} for different stages (Figures 4 and 5). The modified loss function has smoother curves around zero, indicating that the regressor has a smaller step size (i.e., higher sensitivity) when the offsets between the candidate bounding box and the groundtruth approach zero. This modification enables the further convergence of the cascaded bounding box regressor as the stages increase. This is demonstrated in Section 6. where ( , ) is the bounding box regression function, and = ( , , , ℎ ) is the th candidate proposal with four coordinates, i.e., the center position ( , ), and the box width and height ( , ). represents the predicted ground-truth box and is specified in the same way ( = ( , , , ℎ ) ). Henceforth, unless needed, we ignore the superscript for simplicity.
To enable further convergence in the cascaded regression with more stages, we improve the loss function for the bounding box regression to achieve higher sensitivity around zero. The modified regression loss function is defined as ( ) where sgn is the signum function and the exponent 4 3 ⁄ of the terms( / ) 4/3 , and ∈ { , , , ℎ} is designed to increase the nonlinearity and maintain a tradeoff between sensitivity and the gradient of the loss.
The input ( , , , ℎ ) is plotted against the bounding box output offsets ( − ， − , ⁄ , ℎ ℎ ⁄ ) for the original and modified loss functions with different weights , ∈ { , , , ℎ} for different stages (Figures 4 and 5). The modified loss function has smoother curves around zero, indicating that the regressor has a smaller step size (i.e., higher sensitivity) when the offsets between the candidate bounding box and the ground-truth approach zero. This modification enables the further convergence of the cascaded bounding box regressor as the stages increase. This is demonstrated in Section 6.   The effects of taking different exponent values in the loss function are illustrated in Figure 6. We can see that a larger exponent value corresponds to a higher sensitivity of the loss function around zero, but the convergence rate is slower. The exponent term of 4 3 ⁄ is a tradeoff between the sensitivity and convergence rate. The value is an empirical value chosen after multiple experiments.

Implementation Details
The experiments were carried out on DOTA-v1.5 dataset [27], using several popular baseline detectors, including Faster R-CNN [35], FPN [36], and RetinaNet [33] with ResNet50 backbone [42]. TITAN X Pascal with eight GPUs was used for training and testing. All experiments were implemented on the Detectron codebase [43], which is The effects of taking different exponent values in the loss function are illustrated in Figure 6. We can see that a larger exponent value corresponds to a higher sensitivity of the loss function around zero, but the convergence rate is slower. The exponent term of 4/3 is a tradeoff between the sensitivity and convergence rate. The value is an empirical value chosen after multiple experiments.  The effects of taking different exponent values in the loss function are illustrated in Figure 6. We can see that a larger exponent value corresponds to a higher sensitivity of the loss function around zero, but the convergence rate is slower. The exponent term of 4 3 ⁄ is a tradeoff between the sensitivity and convergence rate. The value is an empirical value chosen after multiple experiments.

Implementation Details
The experiments were carried out on DOTA-v1.5 dataset [27], using several popular baseline detectors, including Faster R-CNN [35], FPN [36], and RetinaNet [33] with ResNet50 backbone [42]. TITAN X Pascal with eight GPUs was used for training and testing. All experiments were implemented on the Detectron codebase [43], which is powered by the Caffe2 deep learning framework. As most of the state-of-the-art object detection methods provided by the Detectron codebase do not predict oriented

Implementation Details
The experiments were carried out on DOTA-v1.5 dataset [27], using several popular baseline detectors, including Faster R-CNN [35], FPN [36], and RetinaNet [33] with ResNet50 backbone [42]. TITAN X Pascal with eight GPUs was used for training and testing. All experiments were implemented on the Detectron codebase [43], which is powered by the Caffe2 deep learning framework. As most of the state-of-the-art object detection methods provided by the Detectron codebase do not predict oriented bounding box (OBB), for ease of comparison with other detectors, the horizontal bounding box (HBB) annotation is used throughout this paper. With reference to the settings of Detectron [43], all images in the dataset were cropped to 600 pixels × 1000 pixels. Stochastic gradient descent with momentum (set as 0.9) was adopted as the training method. Training was started at a learning rate of 0.02, which was decreased to 0.002 and 0.0002 in 120 k and 160 k iterations, respectively, and completed in 180 k iterations. The penalty factor was 0.0001. Each synchronized GPU held one image per iteration. We also warmed up our training using a smaller learning rate of 0.005 × 0.3 for the first iteration [44]. We used up to 2000 RoIs for training and 1000 RoIs for testing. The RoI-Align technique [41] was also adopted.
In two-stage detectors, due to the low quality of input region proposals, an IoU threshold of 0.5 is widely used, as a standard compromise between the goals of effective training and less noisy detection. In Cascade R-CNN, due to the progressively improved IoU distribution of input proposals, to achieve higher quality of training, the IoU threshold was empirically set as 0.6 and 0.7 for the second and third stages, respectively. In a cascaded structure, with an increase in the number of stages, increasing the IoU threshold linearly will result in too few positive samples. In the case of Cascade R-CNN++, to ensure effective training at each stage, we propose a method to automatically determine the specific IoU threshold for all stages from the second stage onward, according to the IoU distribution of the training samples. The threshold for the first stage remains the same (i.e., 0.5) as two-stage detectors and Cascade R-CNN. To guarantee effective training, at least 20% of the training samples should be positive. Thus, starting from the second stage and so on, the IoU threshold can be automatically determined using the 20% quantile of IoU distributions. Following this method, the IoU thresholds for a five-stage implementation of Cascade R-CNN++ was determined as U = {0.5, 0.673, 0.723, 0.743, 0.745}. Using the determined new thresholds, the proposed Cascade R-CNN++ was re-trained.
In addition, we decide whether to add one more cascade stage by evaluating the IoU distribution of the training samples. A new stage detector will not be added if the IoU distribution is not improved compared with the previous stage. For example, in this study, the 20% quantile of IoU of the sixth stage is 0.745, indicating that there is no need to add the sixth stage.

Stage-Wise Comparison
On the DOTA dataset, we compared the detection performance of three-, four-, and five-stage Cascade R-CNN++, all with the ResNet-50 backbone. The results are shown in Table 2, where AP indicates the average precision and AP 50 , AP 75 , and AP 90 indicate the detection precision using IoU thresholds of 0.5, 0.75, and 0.9, respectively. AP S , AP M , and AP L indicate the detection precision for small, medium, and large objects, respectively. As shown in Table 2, the overall detection precision is improved with an increase in the stage, and there was no significant reduction in the inference speed as measured by FPS (Frames Per Second).

Ablation Experiments of the Proposed Modifications
The detection performance of Cascade R-CNN++ was further analyzed by ablation experiments using the ResNet-50 backbone. The results are shown in Table 3. The cascade detectors with both the new ensemble strategy and the modified loss function for bounding box regression achieved the best precision, with significant improvement in AP 90 . Table 3. Ablation experiments for a five-stage Cascade R-CNN++ on the DOTA-v1.5 dataset. "Ens" indicates using the new ensemble strategy for classification at inference. "Reg" indicates using the modified loss function for bounding box regression at both training and inference, all with Resnet-50 backbone.

Comparison with State-of-the-Art Detectors
The performance of the proposed Cascade R-CNN++ was compared with that of stateof-the-art two-stage/multi-stage detectors on DOTA-v1.5 dataset, as detailed in Table 4. Entries denoted by * used enhancements including multi-scale training/inference and SoftNMS, as in [43,45]. Table 4. Comparison with popular baseline detectors on DOTA-v1.5 dataset, all with Resnet-50 backbone. Entries denoted by * used enhancements of multi-scale training/inference and SoftNMS, as in [43,45]. As shown in Table 4, the proposed Cascade R-CNN++ achieved higher precision in all the metrics than Faster R-CNN, FPN, RetinaNet and the original Cascade R-CNN. The most significant improvement is found on AP 90 , followed by the detection precision on medium and small objects (i.e., AP M and AP S ). These results prove that the proposed method is effective and outperforms state-of-the-art detectors, especially in high-quality detection. In the experiment, we also implemented the Cascade R-CNN++ with multi-scale training/inference and softNMS, denoted by Cascade R-CNN++ * in Table 4. With these enhancements, Cascade R-CNN++ surpassed Cascade R-CNN by 4.6 points. It is worth noting that other recent studies explored improvements from different perspectives (e.g., feature pyramids), suggesting that the cumulative effect of these enhancements can further improve the performance of multi-stage detectors in the fields of remote sensing object detection, instance segmentation, key point detection, etc.

AP
Comparison was also conducted on another remote sensing dataset, NWPU VHR-10 (Table 5). Training was started at a learning rate of 0.01, which was decreased to 0.001 and 0.0001 in 30 k and 40 k iterations, respectively, and completed in 50 k iterations. Other implementation settings are the same as the experiments on DOTA-v1.5. From Table 5, we can see that the proposed Cascade R-CNN++ yielded the best performance on NWPU VHR-10 dataset, especially on high-quality detection revealed by AP 75 and AP 90 . The detection precision on small objects achieved a 4.8% improvement over the original Cascade R-CNN approach.

Model Transferability on Multiresolution Remote Sensing Images
Further analysis was conducted to compare the transferability of the proposed model and the original Cascade R-CNN across multiresolution remote sensing images. Images in the remote sensing dataset DOTA-v1.5 were upscaled by different factors (e.g., two, three, and four). The detection model trained by original resolution images was directly used for inference to simulate most cases that detectors trained with limited data variability are employed to detect objects in images of different resolutions. As previously described, the DOTA-v1.5 dataset contains 16 categories of objects, such as airplane, ship, storage tank, large vehicle, small vehicle, etc. Here we take the detection of airplanes and harbors as examples. The performance achieved on object detection across multiresolution images are shown in Figure 7 (single object) and Figure 8 ( From Table 5, we can see that the proposed Cascade R-CNN++ yielded the best performance on NWPU VHR-10 dataset, especially on high-quality detection revealed by AP75 and AP90. The detection precision on small objects achieved a 4.8% improvement over the original Cascade R-CNN approach.

Model Transferability on Multiresolution Remote Sensing Images
Further analysis was conducted to compare the transferability of the proposed model and the original Cascade R-CNN across multiresolution remote sensing images. Images in the remote sensing dataset DOTA-v1.5 were upscaled by different factors (e.g., two, three, and four). The detection model trained by original resolution images was directly used for inference to simulate most cases that detectors trained with limited data variability are employed to detect objects in images of different resolutions. As previously described, the DOTA-v1.5 dataset contains 16 categories of objects, such as airplane, ship, storage tank, large vehicle, small vehicle, etc. Here we take the detection of airplanes and harbors as examples. The performance achieved on object detection across multiresolution images are shown in Figure 7 (single object) and Figure 8    as indicated by the white ellipse in Figure 8c,g, when images were upscaled by a factor of three, Cascade R-CNN++ could detect most of the small objects, whereas Cascade R-CNN missed most of the small objects. When images were upscaled by a factor of four, as in Figure 8d,h, both detection models miss-detected a number of small objects. In the detection of airplanes (Figure 8d), it is noticed that Cascade R-CNN miss-detected almost all small airplanes, whereas Cascade R-CNN++ could still detect several small airplanes. We conducted the above experiments on all of the 16 categories of objects on all test images in DOTA-v1.5, using upscale ratios of 1, 3/2, 2, 5/2, 3, 24/7, and 4. The boxplots of the obtained IoU after bounding box regression are shown in Figure 9. The upscale ratio of 3/2 refers to the scenario that images were upscaled by a factor of two and downscaled by a factor of three. It is the same with other ratios.
From Figure 9, we can see that for remote sensing images upscaled by different ratios, Cascade R-CNN++ yielded higher IoU than Cascade R-CNN. The improvement in the IoU distribution obtained by Cascade R-CNN++ became more significant with an increase in the upscale ratio. These results indicate that Cascade R-CNN++ achieves higher detection quality than Cascade R-CNN in multiresolution remote sensing images. As we can see from Figure 7, in single and large object detection, The IoUs obtained by Cascade R-CNN++ are slighted better than those of Cascade R-CNN. In multi-object detection (Figure 8), the performance of Cascade R-CNN decreases rapidly with an increase in upscale factors, whereas Cascade R-CNN++ exhibits much better transferability across different resolution images. In particular, for small object detection, as indicated by the white ellipse in Figure 8c,g, when images were upscaled by a factor of three, Cascade R-CNN++ could detect most of the small objects, whereas Cascade R-CNN missed most of the small objects. When images were upscaled by a factor of four, as in Figure 8d,h, both detection models miss-detected a number of small objects. In the detection of airplanes (Figure 8d), it is noticed that Cascade R-CNN miss-detected almost all small airplanes, whereas Cascade R-CNN++ could still detect several small airplanes.
We conducted the above experiments on all of the 16 categories of objects on all test images in DOTA-v1.5, using upscale ratios of 1, 3/2, 2, 5/2, 3, 24/7, and 4. The boxplots of the obtained IoU after bounding box regression are shown in Figure 9. The upscale ratio of 3/2 refers to the scenario that images were upscaled by a factor of two and downscaled by a factor of three. It is the same with other ratios.
From Figure 9, we can see that for remote sensing images upscaled by different ratios, Cascade R-CNN++ yielded higher IoU than Cascade R-CNN. The improvement in the IoU distribution obtained by Cascade R-CNN++ became more significant with an increase in the upscale ratio. These results indicate that Cascade R-CNN++ achieves higher detection quality than Cascade R-CNN in multiresolution remote sensing images.

Discussion
In this section, we discuss the impacts of IoU thresholds on the detection performance of cascaded detectors.
The majority of the region proposals produced by RPN or selective search have low quality, showing distribution concentrated around low IoU values. A high threshold will lead to too few positive samples, resulting in model overfitting. A low threshold will produce noisy detections. The IoU threshold of 0.5 is a standard compromise widely used in object detection models. This value is used as the initial threshold for the first stage, both in Cascade R-CNN and the proposed Cascade R-CNN++ model.
In the original Cascade R-CNN, the IoU thresholds for the second and third stages were empirically set as 0.6 and 0.7, increasing linearly by a step size of 0.1. In Section 4.1, it was found that with the number of cascades increases, the improvement in the IoU distribution of training samples becomes smaller. Thus, for the five-stage model in Section 4, the IoU thresholds for the fourth and fifth stages were empirically set as 0.75 and 0.85. We conducted a comparison on the detection performance under different IoU threshold settings on the proposed Cascade R-CNN++ structure, with results shown in Table 6, where the "empirical thresholds" are represented by = {0.5,0.6,0.7,0.75,0.85}, and the "auto determined thresholds" are denoted by = {0.5,0.673,0.723,0.743,0.745}, which are the thresholds automatically determined by the 20% quantile of IoU distribution at each stage except the first stage, as described in Section 6.1. In Table 6, it is shown that under the same conditions, auto determined thresholds achieved better overall performance. It is likely that the thresholds estimated using the IoU distribution better matches the quality of training samples, and thus achieved a better balance between effective training and high-quality detection.

Discussion
In this section, we discuss the impacts of IoU thresholds on the detection performance of cascaded detectors.
The majority of the region proposals produced by RPN or selective search have low quality, showing distribution concentrated around low IoU values. A high threshold will lead to too few positive samples, resulting in model overfitting. A low threshold will produce noisy detections. The IoU threshold of 0.5 is a standard compromise widely used in object detection models. This value is used as the initial threshold for the first stage, both in Cascade R-CNN and the proposed Cascade R-CNN++ model.
In the original Cascade R-CNN, the IoU thresholds for the second and third stages were empirically set as 0.6 and 0.7, increasing linearly by a step size of 0.1. In Section 4.1, it was found that with the number of cascades increases, the improvement in the IoU distribution of training samples becomes smaller. Thus, for the five-stage model in Section 4, the IoU thresholds for the fourth and fifth stages were empirically set as 0.75 and 0.85. We conducted a comparison on the detection performance under different IoU threshold settings on the proposed Cascade R-CNN++ structure, with results shown in Table 6, where the "empirical thresholds" are represented by U = {0.5, 0.6, 0.7, 0.75, 0.85}, and the "auto determined thresholds" are denoted by U = {0.5, 0.673, 0.723, 0.743, 0.745}, which are the thresholds automatically determined by the 20% quantile of IoU distribution at each stage except the first stage, as described in Section 6.1. In Table 6, it is shown that under the same conditions, auto determined thresholds achieved better overall performance. It is likely that the thresholds estimated using the IoU distribution better matches the quality of training samples, and thus achieved a better balance between effective training and high-quality detection.

Conclusions
In this study, we proposed Cascade R-CNN++ as an improved cascade structure, to achieve high-quality object detection across multiresolution remote sensing image. The new model overcomes the extension problem of the original Cascade R-CNN by employing a new ensemble strategy for classification at inference, which eliminated the mismatch between the classifier and RoI features. Further, we modified the loss function of bounding box regression to achieve higher sensitivity around zero, which allowed further convergence with an increase in the cascaded stage. The effectiveness of the proposed method was verified using DOTA-v1.5 and NWPU VHR-10 datasets. Cascade R-CNN++ could achieve higher precision with an increase of stages, and significant improvements were achieved in high-quality detection (e.g., AP 90 ). We conducted further analysis on detection quality to verify model transferability across multiresolution remote sensing images. Comparing to Cascade R-CNN, the proposed Cascade R-CNN++ achieved higher IoU values on the detection of different categories of objects across multiresolution images. This trend becomes more significant as the image resolution decreases.
Owing to limited variability of remote sensing training dataset, the transferability of the deep learning model between multiresolution imagery is essential for remote sensing object detection. "Training once, apply to multiscale" is the ultimate goal. The cascade structure and loss function presented in this paper can help the model to improve transferability across multiresolution images. They are independent components that can further be applied to another multistage model. In the future, we will explore the use of cascaded structure in other tasks, such as instance segmentation and key point detection.