MsRi-CCF : Multi-Scale and Rotation-Insensitive Convolutional Channel Features for Geospatial Object Detection

Geospatial object detection is a fundamental but challenging problem in the remote sensing community. Although deep learning has shown its power in extracting discriminative features, there is still room for improvement in its detection performance, particularly for objects with large ranges of variations in scale and direction. To this end, a novel approach, entitled multi-scale and rotation-insensitive convolutional channel features (MsRi-CCF), is proposed for geospatial object detection by integrating robust low-level feature generation, classifier generation with outlier removal, and detection with a power law. The low-level feature generation step consists of rotation-insensitive and multi-scale convolutional channel features, which were obtained by learning a regularized convolutional neural network (CNN) and integrating multi-scaled convolutional feature maps, followed by the fine-tuning of high-level connections in the CNN, respectively. Then, these generated features were fed into AdaBoost (chosen due to its lower computation and storage costs) with outlier removal to construct an object detection framework that facilitates robust classifier training. In the test phase, we adopted a log-space sampling approach instead of fine-scale sampling by using the fast feature pyramid strategy based on a computable power law. Extensive experimental results demonstrate that compared with several state-of-the-art baselines, the proposed MsRi-CCF approach yields better detection results, with 90.19% precision with the satellite dataset and 81.44% average precision with the NWPU VHR-10 datasets. Importantly, MsRi-CCF incurs no additional computational cost, which is only 0.92 s and 0.7 s per test image on the two datasets. Furthermore, we determined that most previous methods fail to gain an acceptable detection performance, particularly when they face several obstacles, such as deformations in objects (e.g., rotation, illumination, and scaling). Yet, these factors are effectively addressed by MsRi-CCF, yielding a robust geospatial object detection method.


Introduction
In recent years, the successful launch of optical broadband (multispectral) and very high resolution (VHR) RGB satellites has made the spaceborne remote sensing images available on a large and even global scale.This has attracted increasing interest in the analysis and interpretation of optical remote sensing images (RSIs), including activities such as classification and recognition [1][2][3], object detection and tracking [4,5], and spectral unmixing [6,7].In particular, geospatial object detection [8][9][10][11][12] has gained considerable attention, owing to the great applications to hazard response, urban monitoring, and management.In [8], Cheng et al. roughly categorized the geospatial object detection approaches into template matching-, knowledge-, object-, and learning-based methods.Notably, the objects in remote sensing datasets inevitably suffer from complex image deformations (e.g., multi-resolution, illumination, direction variation, occlusion, etc.).By ignoring the embedding of local and global information, these methods fail to obtain highly distinguishable semantic information.This can lead to a major challenge for extracting discriminative and generalized features.
With a powerful learning ability, deep learning-based techniques [13][14][15] have been widely applied to geospatial object detection.Deep neural networks (DNN) have been proven to be effective for extracting hierarchical feature representation (from low-level to high-level) [16][17][18].Nevertheless, the limited receptive fields to multi-resolution images and the sensitivity to rotation behaviors prevent these networks from performing better [19][20][21][22].Therefore, the elaborate design of robust features with regard to scaling and rotation plays a critical role in the detection task.Recently, some advanced methods [19,[23][24][25] have been accordingly proposed, but their solutions may be effective only for an individual issue mentioned above.For instance, a deep adaptive proposal network [19] was established by jointly considering low-level and high-level outputs to enhance feature representation.Cheng et al. [23] learned a new rotation-invariant layer on the basis of existing convolutional neural network (CNN) architectures with a new loss function.Chen et al. [24,25] presented a hybrid CNN to extract multi-scale features for vehicle detection through satellite images.To date, CNN (or perhaps DNN in general) continues to deepen, increasing from the 8 layers of AlexNet [26] to the 152 layers of ResNets [27] within 3 years.Although these state-of-the-art deep networks have achieved competitive detection results by utilizing a variety of feature maps from the original input to the output of the soft-max layer, their concepts are dramatically affected by additional time and space costs.Therefore, it is important to develop a relatively light-weighted network architecture with scale and direction robustness in the case of geospatial object detection.
AdaBoost [28,29], a typical boosting algorithm, iteratively selects weak classifiers (e.g., binary decision trees) from a pool of candidates and targets the hard examples from the previous round.Compared with the end-to-end CNN method, it has lower computation and storage costs.In this study, we designed an object detection framework with low-level multi-scale and direction-insensitive feature representations for optical RSIs to address the tedious fine-tuning of high-level connections in CNN during the adaptation of various classification/regression problems.To the best of our knowledge, this is the first work of this kind to combine the convolutional channel feature (CCF) [30] with AdaBoost and a CNN for applications to various detection tasks (e.g., pedestrian, face, edge detection, etc.).There are other extended algorithms based on using CNNs as weak classifiers [31] or weighting the input samples in order to optimally perform CNN learning [32].However, object detection based on these methods is basically conducted on street view images, and the use of remote sensing imagery is less investigated.Therefore, such approaches usually fail to work well when applied to geospatial object detection because of the complex nature of geospatial data, including variations in scaling and direction.
To this end, we propose a novel geospatial object detection framework by using multi-scale and rotation-insensitive convolutional channel features (MsRi-CCF), as illustrated in Figure 1.Diverging from the CCF, we started with rotation-insensitive feature learning to alleviate the performance degradation due to large-scale object rotation.To locate and recognize differently sized objects more effectively, we then modeled the multi-scale feature representation by integrating multi-resolution convolutional maps.Prior to feeding these features into the AdaBoost classifier, the outlier removal method was used to screen out the high-quality samples for training.Such a strategy can effectively correct the bias and variance of the trained classifier caused by the outliers, yielding a more robust detector.In the test phase, a fast feature pyramid was embedded to achieve fast yet approximately lossless finely sampled feature extraction.More specifically, the main highlights of our work can be summarized as follows: • Proposal of a geospatial object detection framework by jointly investigating robust low-level feature generation, classifier generation with outlier removal, and detection with a power law which can simultaneously block large ranges of scale, directional variation, and interference of pseudo-label samples; • Generation of robust low-level feature maps which are based on adding two modules to the original CNN, namely, the rotation-insensitive descriptor and multi-scale convolutional channel feature.We implemented these modules by adding the regularization constraint to the objective function of the network model.These features were generated in an extended and complementary way to ensure the integrity of the information.• In order to suppress the influence of outliers on its exponential loss function, the Gamma Mixture Model (GaMM) outlier removal method is introduced to minimize the classification error caused by pseudo-label samples, among other factors.The remainder of this paper is organized as follows.In Section 2, we introduce the proposed MsRi-CCF framework.Experimental results on a satellite dataset and NWPU VHR-10 are presented in Section 3. Finally, Section 4 concludes our work and briefly discusses possible future works.

Methodology
The novel MsRi-CCF object detection framework for optical RSIs consists of three phases, including robust low-level (shallow) feature generation, classifier generation with outlier removal, and detection with a power law.The architecture of the proposed MsRi-CCF framework is illustrated in Figure 1.In the proposed method, due to the limited size of the training sets, we rotated, flipped, rescaled, and processed the hue and saturation in advance, and we then fed them to the revised CNN for automatic feature learning.The specific workflow is summarized in Algorithm 1.

Algorithm 1 MsRi-CCF Detector
Input: The set of training samples for the current class, D = (x 1 , y 1 ) , (x 2 , y 2 ) , For robust low-level feature generation, two submodules, namely, the rotation-insensitive descriptor and multi-scale aggregated descriptor, were designed and linked to the original VGG-16 network (To maintain consistency and comparability, we started with the same VGG-16 architecture for the CCF as the feature extractor.Furthermore, we aimed to improve the robustness to scaling and rotation rather than aggressively pursuing performance gain.The ResNet is proven to be effective for reducing the training error of very deep networks.For not-so-deep networks, plain networks and ResNet should not largely differ.As a trade-off, the VGG-16 network was applied in our case.)to avoid direction variation and a large scale range, which are usually caused by different shapes and structures.The detailed framework of the module, illustrated in Figures 2 and 3, allows for increasing the step size of the direction rotation and the depth of the network to a certain extent, thereby improving the distinguishability of the features and the generalization performance of the framework.In detail, the regularized constraint term, inspired by Reference [23], was embedded in the objective function of the network model to realize the rotation-insensitive (RI) property.Feature maps in multiple medial layers (low-level feature maps) were fed into AdaBoost for multiple-scale object detection.Compared with the original CNN feature maps, low-level feature maps are neither abstract nor sensitive to edge information.

Original Image
After rotation ( ) More specifically, the features in the shallow layer were applied to small-scale objects while deeper features were used for large-scale ones.
In the next step, considering that the loss function of the boosting decision tree is an exponential loss function, we adopted a probabilistic outlier model which is tightly integrated into the learning algorithms in order to minimize the error caused by manually annotated labels, among other factors, as shown with the yellow line in Figure 1.Lastly, in the detection phase, given a new test image, a fast feature pyramid generated by a power law was used to learn the low-level feature maps and classify each sliding window to generate its class and bounding box.It is worth mentioning that the power law [33] on the scale was used to accelerate the feature pyramid generation, whose details are introduced in Section 2.3.

Rotation-Insensitive Feature Representation
CNN is sensitive to direction variations when attempting to recognize the objects of interest.Following the architecture of [23], we propose to augment the data by rotating the training samples with multiple rotation angles and by horizontally flipping them to become mirror images.Then, we embedded a regularized constraint term into the objective function of the network model, which explicitly forces the feature representation of the training samples before and after rotation to map closely to each other, marking the learned features rotation-insensitive. Figure 2 illustrates the architecture for extracting rotation-insensitive convolutional channel features.The resulting regularized term can be formulated by where X is the samples before rotation, g φ X is the samples after rotation (In theory, more rotation angles should provide a better result.We found, however, that this could weigh the network down and degrade the performance.In our case, we empirically and experimentally determined the number of rotation angles ranging from 0 • to 180 • at a 45 • interval.),and N is the total number of initial training samples in X. F a (x i ) represents the feature maps of the specific layer; F a (g φ x i ) represents the average feature maps on this layer for K directional samples attached to each sample, and it is defined as where K is the total number of rotation transformations for each x i ∈ X.
Obviously, the specific feature maps can be approximated as the rotation-insensitive feature maps when Equation (1) takes the minimum value.To this end, a new loss function with a regularization constraint term is defined by the following formula.It is noted that we mark the weight here as net W I so as to distinguish it from the weight of AdaBoost.
where ϕ, θ denotes a subclassifier and its weight; represents the output of the strong classifier of the previous iteration; x i is the ith sample; y i is the ground truth of ith sample.We can easily see that the objective function defined by Equation (3) minimizes the detection loss, including the loss of classification (the first term of Equation ( 3)) and the loss of automatic feature generation (the second term of Equation ( 3)).In this paper, we solve this optimization problem by using the stochastic gradient descent (SGD) method [34], which has been widely used in complicated optimization problems, such as neural network training.

Multi-Scale Convolutional Channel Feature
The objects in the optical RSIs have different sizes, and the within-class object sizes differ greatly since the images are taken from a bird's-eye view.A good descriptor should be able to tolerate different variations in object size.For feature extraction using the classic CNN, all objects have a single perceptual field on a particular layer, which will result in incomplete feature representation of multi-scale objects and reduce the generalization ability of the network.To this end, a multi-scale convolutional channel feature was designed by using low-level feature maps to detect small objects and high-level feature maps to detect large objects.Closely related to the requirements of the optical RSIs, we redesigned the parameters and layers of the network after fine-tuning.Compared with the original deep CNNs, this can reduce at least half of the parameters.Figure 3 shows the detailed architecture of the multi-scale convolutional channel feature.

Classifier Generation with Outlier Removal
The traditional end-to-end CNNs need to manually design the overweight hyperparameters, e.g., convolution kernel size, depth and width of the convolutional network, etc., which is time-consuming and laborious and, even using a pretrained network, some of the hyperparameters still need to be debugged or redesigned according to the requirements of new training samples.To address the tedious fine-tuning of high-level connections in CNN during the adaptation to various classification/regression problems, we adopted an AdaBoost method to classify and locate the low-level feature map representations.AdaBoost [28,29], a typical boosting algorithm, iteratively selects weak learners (binary decision trees) from a pool of weak candidates and targets the hard examples from the previous round.Compared with SGD's use of an end-to-end CNN method, the binary decision tree has lower computation and storage costs.The number of its hyperparameters, e.g., max_depth, class_weight, etc., is much smaller than that for CNN.Using an optimized code, the training decision stump (depth = 1) trains about 4608 features (3 × 3 × 512) and 10,000 iterations, requiring about 70 ms on a single core computer and only about 7 ms on a 12 core computer, with no need for a graphics processing unit (GPU).It is worth noting that the boosting algorithm performs poorly on classification tasks with outliers, and the generalization error of classifiers is constrained.The best reference [35] in 5th International Conference on Learning Representations (ICLR2017) demonstrated that a powerful depth model can easily fit completely random pixels (for example, Gaussian noise) with almost zero training errors.However, as the noise level increases, the testing error of the classifier can severely deteriorate.To this end, we introduced a probabilistic outlier model for weights ω using the Gamma Mixture Model (GaMM) [36][37][38], including two mixtures to represent inliers and outliers, respectively, i.e., p (ω|outlier ∪ inlier, Θ) where Θ denotes the parameter set p l, α l , β l ; l = 1, 2 , and p 1 + p 2 = 1.The Expectation-Maximization (EM) algorithm [36,37] method was employed to estimate the parameter set Θ of GaMM (please refer to [38,39] for more details).The distribution of inlier and outlier samples can be estimated to calculate their posterior probabilities.Theoretically, relatively large losses can be considered the outlier, which is referred to as p (l = 1).Based on the Bayesian posterior, its posterior probability is calculated by Figure 4 shows the details of our classifier generation.Outlier removal is performed after the sample weight update, and the number of iterations is designed according to the actual requirements.It is tightly integrated into the AdaBoost algorithms and can fundamentally correct the bias and variance of the trained classifier caused by the outliers.

Detection with Power Law
Traditional detection is the output of sliding windows on the finely sampled image pyramid.It has higher accuracy but often suffers from expensive computation.The CNN-based object detection approach is the output of the object proposal, which improves the speed and performance of the method by presetting a small number of suspected candidate samples.Comparing the pros and cons of these two approaches and inspired by Reference [40], we adopted sliding windows with a power law [33] to accelerate the generation of fast feature pyramids.It is expressed as where F is the convolutional channel feature of input image, and R(F, s) is a resampled feature of F by s. κ is a scaling factor to be estimated.Using Equation ( 7), we can quickly obtain the feature pyramid using the given κ calculated in the training phase, and the obtained feature maps are subjected to object detection by using the sliding windows.MsRi-CCF detects the objects on three different scales, as illustrated in Figure 3.More specifically, we set the sizes of sliding windows as 3 × 3, 6 × 3, and 3 × 6.
Please note that the parameter setting, e.g., the number of scales and the size of the sliding windows, is determined by minimizing the performance loss on the validation set.Since the sliding windows are performed on the shallow-layer feature maps, the amount of calculation is greatly reduced.The final detection result is non-maximum suppression and thresholded output.

A Quick Look at Illustrative Examples
Figure 5 illustrates some representative examples to clarify the effectiveness and superiority of MsRi-CCF under three different conditions.The first and second rows show the detection results with multi-scale and rotation, respectively.The original CCF not only produces a false positive (in blue) but also a false negative (in red), leading to a relatively poor detection performance.YOLO2 outperforms the CCF in the multi-scale case, although some objects are still missing.Unfortunately, both CCF and YOLO2 fail to effectively detect the rotated objects.It is obvious that compared with the above two methods, the proposed MsRi-CCF is better able to handle the multi-scaled and rotated objects.The complex scenes are prone to generate clusters of false positives and false negatives, as shown in the last row of the CCF and YOLO2 (see Figure 5), while MsRi-CCF benefits from outlier removal, reducing false retrievals by a large margin.
(1) NWPU VHR-10 dataset: This dataset is a very high resolution (VHR) optical remote sensing image dataset.It consists of two acquisition modes: color images with a spatial resolution of 0.5-2 m obtained from Google Earth and infrared images with a spatial resolution of 0.08 m obtained from the Vaihingen dataset (the Vaihingen data was provided by the German Society for Photogrammetry, Remote Sensing and Geoinformation (DGPF)).(Please refer to [8,41].)In this dataset, there are 650 images with 10 class objects, namely, baseball diamond, ground track field, basketball court, airplane, ship, storage tank, tennis court, harbor, bridge, and vehicle.Table 1 shows the size of each class object.
(2) Satellite dataset: This is a small dataset of optically remotely sensed vehicles used by Heitz et al. in ECCV 2008 [42].The dataset was acquired from Google Earth.Each image is a color image of 792 × 636, containing 1319 vehicle objects labeled manually with an average size of 45 × 45.The vehicle objects have a large direction variation and a small range of scale.It is noted that the presence of obstructions and low resolution increase the difficulty of vehicle detection.

Experimental Setup
Due to the limited number of the training samples, data augmentation is a feasible solution for effective network training.For the two datasets, the rotation and mirror operations were performed to enlarge the training set.More specifically, we rotated the training images with different angles ranging from 0 • to 180 • at a 45 • interval.We also converted the training images to the HSV (hue, saturation, value) color space as a preprocessing step for improving the robustness to illumination and atmospheric effects.The negative images were randomly selected from the set of images without a detected object in the current class.In our work, 60% of the samples were assigned to the training set and the rest compose the test set.To stably evaluate the performance of the proposed method, we conducted five-fold cross-validation and report an average result across the folds below.
In addition, all the experiments were implemented using the TensorFlow framework and carried out by a PC with an Intel single Core i7 CPU, NVIDIA GTX-1070 GPU (4 GB memory), and 32 GB RAM.The PC operating system is Ubuntu 15.04.

Evaluation Criteria
Analogous to an evaluation method for object detection, the precision-recall curve (PRC) and average precision (AP) were adapted to quantitatively evaluate the detection performances.More precisely, when the intersection-over-union (IoU) overlap rate between the detected bounding box and the ground truth exceeds 50%, the detection result is the predicted result (true positive (TP)).If multiple detection results overlap with the same ground truth, the highest overlap rate is the predicted result; otherwise, a false negative (FN) results.Therefore, the final precision (P) is computed by TP TP + FP , and the recall (R) rate is TP TP + FN .AP is a global indicator to assess the performance of the method.Moreover, we evaluated the detection performance of the proposed MsRi-CCF in comparison with seven state-of-the-art baselines.
• The collection of part detector (COPD) [41] is composed of a set of representative and discriminative linear support vector machine (SVM) classifier part detectors.In our experiments, we adopted the original setting for fair comparison.• The Exemplar-SVM detector [43] adopts template integration instead of a single template to realize object detection.In our experiments, we used a sizing heuristic method for each sample to create an 8-pixel-sized descriptor based on its ground truth bounding box.• The fast feature pyramid [40] is a fast object detection framework which estimates features at a coarsely sampled set of scales.In our experiments, this applies to all three channel features, namely, color, gradient magnitude, and gradient orientation.• The convolutional channel feature (CCF) [30] is a light-weight model with deep representations.
In our experiments, we used a VGG-16 model as the feature extractor.• Bag of visual words and SVM classifier (BOW-SVM) [44] is a simplified representation achieved by transforming the text into a "bag of words".In our experiments, we still represented each image block as a histogram with a similar visual vocabulary generated by a k-means algorithm.• You only look once (YOLO1) [45] performs the object detection task, which consists of determining the location on the image where certain objects are present, as well as classifying those objects with a single feed-forward convolutional network.In our experiments, we adopted the detection network from darknet-24, which has 24 convolutional layers followed by 2 fully connected layers.• YOLO9000 (YOLO2) [46] is an enhancement of YOLO1.It removes the fully connected layers and uses anchor boxes to predict bounding boxes.In our experiments, we adopted the detection network from darknet-19, which has 19 convolutional layers.

Parameter Setting
In general, the hyperparameters in MsRi-CCF are determined by maximizing the performance on the validation set.Besides that, we also provide a more specific discussion and analysis on the selection of feature maps and the rate of outlier removal in the following subsections.

Feature Map Selection
The distinguishability of the feature maps is very important for designing a classifier.The depth of the network, the number of parameters, and the convergence speed of parameter estimation all directly affect the speed and performance of the network.However, scale and direction variation of optical RSIs make it difficult to directly fine-tune using pretrained networks of natural scene images.Therefore, an additional seven-layer network was designed to reduce the sensitivity of the network to scale and direction variation, thereby improving the generalization capabilities of the network.An inception module with 1 × 1 convolution [47] was also introduced to improve the expressive ability of the network and extend the network's depth and width without increasing computational costs.We used Rectified Linear Unit (ReLU) as the activation function.It is fast, promotes sparsity in the network, and reduces the likelihood of a vanishing gradient.Table 2 demonstrates the structure of the convolutional feature extractor, which has eight convolutional layers in total, where the conv_3 of VGG-16 is applied to the first layer and the others are an additional seven-layer network, and stands for the rotation-insensitive descriptor.The distinguishability of feature maps in automatic feature learning determines the performance of object detection and classification.Theoretically, as the network deepens, the local distinguishability becomes greater.Considering the large-scale variation of objects in optical RSIs, we chose 3 medial layers as candidate low-level features to realize a good balance between feature representativeness and generalization ability.Since the deeper feature maps have weaker resolutions, they were considered for detecting objects with large sizes, while higher resolution layers were considered for detecting small-scale objects.Tables 3 and 4 give the precision of each class in the NWPU VHR-10 dataset and the precision for one class in the satellite dataset.The following observations are made.(1) Compared with the optimal precision of single feature maps, the AP of the 3 + 6 + 7th layer increased by about 5% with the NWPU VHR-10 dataset, but on the satellite dataset, the AP is just slightly improved.This result is due to the fact that the scale variation of the objects in the satellite dataset is small.(2) For the storage tank, tennis court, and vehicle, the precision of the 3rd layer is higher than the other layers.This is because their appearance and size are relatively simple.Otherwise, for the baseball diamond and ground track field, the highest precision is in the 7th layer.(3) Compared with the NWPU VHR-10 dataset, the precision of the vehicle in the satellite dataset is higher, which is due to the high degree of similarity between the vehicles and their spatial semantic information, which is relatively simple.

Outlier Removal in Classifier Generation
The ground truth in feature maps is the mapping of the ground truth in the original image.It is used as the positive sample input of the AdaBoost classifier.We sample or interpolate the feature maps to ensure the size consistency between objects of the same class.The addition of outlier removal can further optimize the training samples and remove the hard samples to obtain a "clean" training set.For the details on the parameter estimation and convergence rate of the GaMM distribution, refer to Reference [38,39].It is worth noting that the proportion of outliers in this paper is unknown, and we did not add extra outliers.Table 5 shows the AP under each iteration.It is shown that both datasets achieve optimal performance after the first iteration.From the conclusion in Reference [38], it is shown that the outlier ratio is less than 5%, which can be removed with only one iteration.Also, since the number of training samples is too small, as iterations increases, inliers decrease, directly affecting the performance of the classifier, especially for satellite dataset.For training samples larger than 224 × 224 × 3, we cut them into an image block set of this size and recorded the coordinates of the diagonal.In order to prevent the object from splitting, we set an overlap for objects of the same class that were larger than the average size for that class.For fairness, we adopted the same preprocessing to ensure sample consistency for all methods.Specifically, our method was computed with optimal parameters and feature maps.Figure 6 shows the PRC of the eight methods.It is shown that the precision and recall of three classes, namely, baseball diamond, ground track field, and airplane, are higher using all the listed methods.This occurs because their appearance, structure, and local semantic information are relatively distinguishable.Table 6 lists the quantitative results of the eight studied methods in terms of four different metrics: AP value, running time, as well as precision and recall for each class, while Figure 7 visually highlights some detection results for the 10 classes using the NWPU VHR-10 dataset, where each class is marked in a different color, the yellow bounding box shows false detection, and the red bounding box shows missed detection.We can conclude the following.(1) The AP value of BOW-SVM is lower than that of the other methods.This is because BOW-SVM represents each image block as a histogram of a similar visual vocabulary generated by the K-means algorithm.By ignoring the relationship of the spatial structures among local features, it can only detect objects with simple shapes, such as baseball diamond, storage tank, and ship.Although Exemplar-SVM designed the classifier for each class respectively, the generalization ability of the histogram of gradient (HOG) descriptor is sensitive to the deformation.Similarly, it is not surprising that the detection performance of the COPD algorithm and ACF are also limited by the feature representation capabilities of the HOG.(2) YOLO1 is the fastest approach, but it has a certain trade-off with detection accuracy.It has weak generalization ability for a large scale range and rotation variation of objects under a complex background.Compared with YOLO1, although YOLO2 uses multi-scale images for training and convolutional feature maps for testing, the AP value is upgraded from 0.6584 to 0.7846.However, for different aspect ratios of the same object class, the generalization ability of the algorithm is greatly downgraded.(3) Compared with the CCF, which directly investigates the VGG-16 model, the addition of rotation-insensitive descriptor and multi-scale aggregated descriptor achieves about 0.2 gains in terms of mean AP.This shows that our method is effective for detecting objects in multi-scale optical RSIs.For feature generation, we chose the convolutional layer to introduce into the next feature learning.It is more intuitive to adopt fully connected layers to perform classification and detection; however, (1) the convolutional layer is a local connection and is suitable for the input of any size, and the fully connected layer is a global connection; (2) compared with the full-connection layer, the convolutional layer shares a large number of calculations, and it can substantially reduce the amount of calculation.Moreover, the feature learning (such as edge removal and dimensionality reduction) on feature maps was added to train our detector with the boosting decision tree.This idea was inspired by the actual algorithm implementation in [40].As expected, the proposed MsRi-CCF obtains the best detection performance in terms of mean AP, despite having a relatively low running speed compared with YOLO-like methods.This can be well explained by our targeted-designed end-to-end feature learning.More specifically, the multi-scaled design effectively improves the detection performance, particularly for those with irregular sizes (e.g., Ground track field), while the embedding of rotation-invariant features is greatly conducive to detecting the objects sensitive to direction (e.g., Airplane, Vehicle).Moreover, the robustness of our detector is capable of further being enhanced, after the learned features pass through the outlier removal module.An illustrative example can be found in Figure 8.Additionally, MsRi-CCF performs more efficiently, with a decrease of about 1 s per image, than the CCF with the great support of fast feature pyramid modeling and our proposed multi-scale strategy.

Performance Analysis on Satellite Dataset
Figure 9 shows the PRC of the eight different detection algorithms, and Table 7 correspondingly lists the running times, as well as precision (P value) and recall (R value).Visually, a showcase is also given in Figure 10.The green, red, and blue bounding boxes represent the true positive, false positive, and missed detection, respectively.Overall, BOW-SVM and Exemplar-SVM are only robust to vehicles with similar shape variation, and its generalization ability is relatively weak.The HOG descriptor used in the ACF and COPD algorithm is sensitive to object rotation, which leads to a limited precision.YOLO2 is an improved version of YOLO1, which enhances the generalization of YOLO1 for scale transformation and direction variation.Unfortunately, the multi-resolution objects and the narrow distance between them degrade the detection performance of YOLO-based networks.In the CCF, the VGG-16 network framework is explored for feature extraction, yet it is sensitive to multi-scale and multi-direction effects in optical RSIs and cannot achieve desirable detection results.Not surprisingly, the performance of the MsRi-CCF is superior to that of the others.Similar to the NWPU VHR-10 dataset, the learned features in MsRi-CCF is robust against rotation behavior with the satellite dataset, as the rotation-insensitive term is regularized in our network, while the use of multi-scaled feature maps can reduce the rate of missed detection of the larger or smaller objects.It should be noted that the biggest challenge with this dataset is the black vehicles that are obscured by the tree, as they are difficult to distinguish from the ground.A straightforward way to address this problem is to train a more robust classifier by removing the "bad" samples (outliers), just as the outlier removal was used in our framework.Furthermore, although MsRi-CCF cannot beat the YOLO-based approaches in running speed, it is much faster than the original CCF and some previous methods owing to our efficiency-oriented improvement (e.g., fast feature pyramid, multi-scale feature design).

Robustness Analysis to Noises
To intuitively evaluate the robustness of MsRi-CCF, we investigated the detection performances of three representative algorithms on two datasets by adding Gaussian white noise in different ranges of signal-to-noise-ratios (SNRs), from 10 to 50 dB with a 10 dB interval.As can be seen in Figure 8, the CCF sharply degrades in performance with a decrease in SNR and is more sensitive to noise attack than YOLO2.On the other hand, there is a comparatively stable trend in MsRi-CCF.This demonstrates that the outlier removal strategy could play a role in correcting the decision boundary to some extent.

Discussion on the Selection of Feature Extractor in MsRi-CCF
The feature extractor in the proposed MsRi-CCF consists of a deep neural network, such as AlexNet, VGG, or ResNet.Table 8 lists the performance comparisons for the three network architectures used as the feature extractor for the satellite and NWPU VHR-10 datasets.As observed, AlexNet runs faster than the other two (VGG-16 and ResNet-34), yet its detection precision is considerably lower

Figure 1 .
Figure 1.The architecture of the proposed multi-scale and rotation-insensitive convolutional channel features (MsRi-CCF) method.The feature generation step in the training phase is detailed in Figures2 and 3.These generated features are then fed into the AdaBoost classifier with outlier removal (see Figure4for more details) for the final classification and localization.In the test phase, a fast feature pyramid is applied for the final predictions.

Figure 2 .Figure 3 .
Figure 2. The detailed architecture of the rotation-insensitive convolutional channel features.

where net W I = net w 1 ,
net w 2 , ..., net w a , B I = b 1 , b 2 , ..., b a , θ and ϕ.The first term J B (θ, ϕ) in Equation (3) is the additive model of exponential loss function.It is designed to minimize classification errors for a given training samples and is computed by

Figure 4 .
Figure 4.The detailed framework of classifier generation with outlier removal.

Figure 5 .
Figure 5. Visual comparison of three different methods (CCF, YOLO2, and MsRi-CCF) with regard to multi-scale, direction variation, and outliers.

Figure 6 .
Figure 6.Precision-recall curve (PRC) of the proposed method and seven competitive methods using the NWPU VHR-10 dataset for 10 object classes.

Figure 8 .
Figure 8. Evaluation of robustness to noise of the MsRi-CCF framework.

Figure 9 .
Figure 9. PRC of the eight competitive methods with the satellite dataset.

Figure 10 .
Figure 10.A showcase of MsRi-CCF with the satellite dataset (false detection in red, true positive in green, and missed detection in blue).

Table 1 .
The statistics of object size in the NWPU VHR-10 dataset.

Table 2 .
The network architecture in feature extraction of our MsRi-CCF.

Table 3 .
The precision of three intermediate layers for the NWPU VHR-10 dataset.

Table 4 .
The precision of three medial layers for the satellite dataset.

Table 5 .
Comparison of average precision (AP) under different iteration times of the Gamma Mixture Model (GaMM) distribution with the two datasets.

Table 6 .
Quantitative performance comparisons and average running time for the NWPU VHR-10 dataset.The optimal value is shown in bold.
Figure 7.Some visual detection results with MsRi-CCF; false positive samples are marked in yellow and true positives are in the other colors.

Table 7 .
Quantitative performance comparisons and running time with the satellite dataset.The optimal value is shown in bold.