Multi-Scale Geospatial Object Detection Based on Shallow-Deep Feature Extraction

Multi-class detection in remote sensing images (RSIs) has garnered wide attention and introduced several service applications in many fields, including civil and military fields. However, several reasons make detection from aerial images very challenging and more difficult than nature scene images: Objects do not have a fixed size, often appear at very various scales and sometimes appear in dense groups, like vehicles and storage tanks, and have different surroundings or background areas. Furthermore, all of this makes the manual annotation of objects very complex and costly. The powerful effect of the feature extraction methods on object detection and the successes of deep convolutional neural networks (CNN) extract deep features more than traditional methods. This study introduced a novel network structure and designed a unique feature extraction which employs squeeze and excitation network (SENet) and residual network (ResNet) to obtain feature maps, named a shallow-deep feature extraction (SDFE), that improves the resolution and the localization at the same time. Furthermore, this novel model reduces the loss of dense groups and small objects, and provides higher and more stable detection accuracy which is not significantly affected by changing the value of the threshold of the intersection over union (IoU) and overcomes the difficulties of RSIs. Moreover, this study introduced strong evidence about the factors that affect the detection of RSIs. The proposed shallow-deep and multi-scale (SD-MS) method outperforms other approaches for the given ten classes of the NWPU VHR-10 dataset.


Introduction
Object detection in remote sensing images (RSIs) is a framework that determines if an input aerial image contains an object belonging to the category of interest and provides the location of the predicted object inside the image and its class. Object detection in RSIs is used to detect man-made objects, such as buildings, ships, vehicles, airports and bridges. However, RSIs are different from natural imagery: Natural images are obtained from any kind of camera, from a horizontal view; RSIs are obtained from any kind of satellite, from a vertical view. The most distinctive feature of remote sensing images is their size, which is very large. Furthermore, technological developments improve the resolution of RSIs. These very high resolution (VHR) images allow for many uses in geospatial object detection. Object detection in RSIs has introduced many service applications in many fields, such as military investigation, environmental monitoring, urban traffic management, geographic information class. Then, non-maximum suppression (NMS) filters these background regions using a predefined threshold value to get the final bounding boxes (BBs).
R-CNN is costly in space and time, and to resolve this, in [33], the authors used the theory of spatial pyramid matching (SPM) [34,35] to propose spatial pyramid pooling (SPPNet). The idea behind SPPNet is that first, feature maps are obtained from each input image using a CNN. This means the feature extraction happened only one time. Then, the fixed-length of regions is generated. In this way, repeating the process of feature extraction for each region proposal is avoided. SPPNet is 20 times faster than R-CNN [1]. Therefore, to enhance the performance and speed of the feature extraction for all region proposals in R-CNN, first, the feature maps of the input image are produced using CNN layers. Then, a fixed size is extracted from each region proposal by the RoI pooling layer. After that, they are fed into the sequence of FC layers before the final operation. There are two outputs of the final operation: Classification, which uses Softmax, to obtain the category of each predicted bounding box and bounding box regression to obtain the coordinates of each detected bounding box (x, y, w, h). Fast R-CNN [36] employs the advantages of R-CNN and SPPNet and uses multi-task losses to improve accuracy. However, the detection of fast R-CNN is still limited by region proposal detection. To solve this problem, faster R-CNN [25] was developed with two stages. In the first stage, faster R-CNN uses a separate network called a region proposal network (RPN). The RPN uses fully connected (FC) to generate region proposals, and then feeds them into the RoI pooling layer in the second stage. The second stage is fast R-CNN, which is used to generate object detection for each category. In the faster R-CNN, the RPN shares the same feature maps with fast R-CNN. Hence, it optimizes the speed of the performance. Even though faster R-CNN is faster than fast R-CNN, feature extraction still needs improvement. The feature pyramid network (FPN) is presented by Lin et al. [26]. It is a feature extraction network that has three components: a bottom-up pathway (BU), a top-down pathway (TD) and lateral connections. The construction of the FPN aims to extract high-resolution and segmentation features by combining the output of the BU and TD pathways, but it takes a long time and consumes memory. At the same time, the development of computational processor devices like the graphics processing units (GPUs) have contributed to the improvement and development of image classification and recognition by introducing effective methods, like the fully convolutional network (FCN) [37], residual network (ResNet) and squeeze and excitation (SENet) [20,21,[37][38][39].
Even though deep learning methods of object detection have achieved great success in natural images, these methods are not particularly created to detect small objects in aerial images. These images are known for their large size and present several challenges. The main reasons are as follows.

1.
In remote sensing images, each object does not have a fixed size and often appears at various scales. Further, the datasets mostly were collected from different resources with different resolutions.

2.
The remote sensing image is very large and contains a lot of small objects that sometimes appear in dense groups, such as vehicles and storage tanks. This adds significant challenges for geospatial object detection methods. By using normal object detection methods, the loss of small objects in RSI is very large, so there is an urgent need to optimize the detection methods. 3.
The aerial images are enormous and overcrowded with many kinds of small objects. Therefore, the manual annotation of objects is very complex and costly. Additionally, object detection in RSIs is a small sample setting problem. Although there are methods specifically designed for small sample setting problems, like rank-1 feedforward neural network (FNN) in [40], deep learning architectures are data hungry and thus, the training samples for object detection from RSIs are inadequate for training them. 4.
The surrounding area of each different class is not the same as can be seen in Figure 1. For example, ships and airplanes mostly have a clear background and also have special colors and shapes that make them distinguishable. In contrast, many objects like vehicles and harbors do not have a special appearance or properties, so they need to be treated differently. This paper deals with all of the above issues and difficulties. An effective deep CNN framework is introduced to detect multi-scale and multi-class objects in RSIs, and this framework has the ability to deal with these very large variabilities and challenges to fit the properties of the RSIs. Our framework is based on the region proposal: Its pipeline first generates the region proposals and then classifies each proposal into categories. The method of this framework is divided into two stages. The first stage is the feature extraction and the RPN. A shallow and deep CNN is constructed to extract features from each input image. These extracted features are designed to improve the resolution, reduce the loss of small objects and optimize the localization. Then, the output of shallow-deep feature extraction (SDFE) is fed into the RPN to extract multi-scale region proposals. In the RPN, to improve the accuracy, multi-scale anchor boxes with a specific CNN filter were used on multi-scale feature maps. After that, the outputs of these two networks are combined and fed into the second stage, which is fast R-CNN for the accurate detection of each object. Since RSIs have a large scale with the limitation of manual annotations, during the training, the horizontal flip is used. The main contributions of our framework are: 1. A novel framework suitable for VHR remote sensing imagery was designed, which can detect the multi-scale and multi-class object in large-scale complex scenes. 2. A feature pyramid extraction was designed which has a button-up pathway (BU), top-down pathway (TD) and many lateral connections to connect between the BU and TD layers to get higher semantic and resolution information to correspond with the situation of remote sensing images, which contain various sizes of objects. 3. The feature map of each different layer was allocated to objects of specific scales, which increase the detection accuracy. 4. The multiple feature maps produced were combined. Therefore, the resolution increased, and multiple levels of details could be considered simultaneously. Additionally, it is more accurate to detect various sizes of objects and densely packed objects. 5. A shallow network was used which improves the localization and the time of performance (training and testing time).
The structure of the rest of this paper is given as follows. Section 2 describes the framework of the method in detail, the dataset description and evaluation metrics. An explanation of the results of the three different experiments and their comparisons are in Section 3. The conclusion and future work are given in Section 4. This paper deals with all of the above issues and difficulties. An effective deep CNN framework is introduced to detect multi-scale and multi-class objects in RSIs, and this framework has the ability to deal with these very large variabilities and challenges to fit the properties of the RSIs. Our framework is based on the region proposal: Its pipeline first generates the region proposals and then classifies each proposal into categories. The method of this framework is divided into two stages. The first stage is the feature extraction and the RPN. A shallow and deep CNN is constructed to extract features from each input image. These extracted features are designed to improve the resolution, reduce the loss of small objects and optimize the localization. Then, the output of shallow-deep feature extraction (SDFE) is fed into the RPN to extract multi-scale region proposals. In the RPN, to improve the accuracy, multi-scale anchor boxes with a specific CNN filter were used on multi-scale feature maps. After that, the outputs of these two networks are combined and fed into the second stage, which is fast R-CNN for the accurate detection of each object. Since RSIs have a large scale with the limitation of manual annotations, during the training, the horizontal flip is used. The main contributions of our framework are:

1.
A novel framework suitable for VHR remote sensing imagery was designed, which can detect the multi-scale and multi-class object in large-scale complex scenes.

2.
A feature pyramid extraction was designed which has a button-up pathway (BU), top-down pathway (TD) and many lateral connections to connect between the BU and TD layers to get higher semantic and resolution information to correspond with the situation of remote sensing images, which contain various sizes of objects.

3.
The feature map of each different layer was allocated to objects of specific scales, which increase the detection accuracy. 4.
The multiple feature maps produced were combined. Therefore, the resolution increased, and multiple levels of details could be considered simultaneously. Additionally, it is more accurate to detect various sizes of objects and densely packed objects.

5.
A shallow network was used which improves the localization and the time of performance (training and testing time).
The structure of the rest of this paper is given as follows. Section 2 describes the framework of the method in detail, the dataset description and evaluation metrics. An explanation of the results of the three different experiments and their comparisons are in Section 3. The conclusion and future work are given in Section 4.

Methods
The architecture of our approach is represented in Figure 2. It includes two stages, which depend on faster R-CNN. In the first stage, the feature extraction and the multi-scale RPN were used to extract the feature maps with high semantics information and the resolution was enhanced by using deeper and shallower networks. Additionally, a multi-scale and multi-filter (MS-MF) region proposal network was used. The operation of categorizing and locating each object is in the second stage. These two stages share the same multi-scale feature maps. These methods are introduced in detail.

Methods
The architecture of our approach is represented in Figure 2. It includes two stages, which depend on faster R-CNN. In the first stage, the feature extraction and the multi-scale RPN were used to extract the feature maps with high semantics information and the resolution was enhanced by using deeper and shallower networks. Additionally, a multi-scale and multi-filter (MS-MF) region proposal network was used. The operation of categorizing and locating each object is in the second stage. These two stages share the same multi-scale feature maps. These methods are introduced in detail.

Shallow-Deep Feature Extraction Network (SDFE)
Designing the feature extractor is very important for enhancing the detection and performance operations. A very deep model needs a large amount of training samples, and there is a comparative scarcity of labeled RSIs as a dataset. Deeper models are also costly and lead to losing small objects. Based on the requirement to extract feature maps which capture large and small objects effectively, a shallow-deep network was designed to get feature maps that combine strong features and a low loss. As it is known, deep CNN is an efficient network to get stronger features and extract large objects. Shallow CNN is used to detect small objects, reduce the computational cost and improve localization.
The pyramid architecture of the feature extractor has been widely used in many studies. He et al. [23] introduced ResNet, which is an effective network to increase the accuracy of the deeper models. However, very deep models are not suitable for all kinds of data [16,41]. The principle advantage of hierarchical structures to extract features is that each level produces a multi-scale feature with strong semantics and achieves higher resolution. There are many different kinds of architectures for pyramid feature extractors [10,26], such as the feature image pyramid [42], single feature map, pyramidal feature hierarchy and FPN [26]. The first CNN pyramid hierarchy is SSD [27]. SSD reuses the multi-scale feature maps; these maps are computed from the forward-pass of different layers. As SSD does not reuse the high-resolution map of the lower layers of its hierarchy feature Figure 3a, SSD is not ideal to detect small objects. The last layer of the bottom-up feature map has higher semantic information, but the localization performance is poor [19,26,43]. In contrast, in the feature pyramid network [26], the features of the top-down pathway are enhanced by lateral connections, which merge the output of the bottom-up pathways with the top-down pathways. Each lateral connection merges feature maps of the same spatial level from these two different pathways. As a result, high resolution and strong segmentation is obtained at the same time. In this study, the idea of FPN with ResNet 101 as the BU network is adopted.

Shallow-Deep Feature Extraction Network (SDFE)
Designing the feature extractor is very important for enhancing the detection and performance operations. A very deep model needs a large amount of training samples, and there is a comparative scarcity of labeled RSIs as a dataset. Deeper models are also costly and lead to losing small objects. Based on the requirement to extract feature maps which capture large and small objects effectively, a shallow-deep network was designed to get feature maps that combine strong features and a low loss. As it is known, deep CNN is an efficient network to get stronger features and extract large objects. Shallow CNN is used to detect small objects, reduce the computational cost and improve localization.
The pyramid architecture of the feature extractor has been widely used in many studies. He et al. [23] introduced ResNet, which is an effective network to increase the accuracy of the deeper models. However, very deep models are not suitable for all kinds of data [16,41]. The principle advantage of hierarchical structures to extract features is that each level produces a multi-scale feature with strong semantics and achieves higher resolution. There are many different kinds of architectures for pyramid feature extractors [10,26], such as the feature image pyramid [42], single feature map, pyramidal feature hierarchy and FPN [26]. The first CNN pyramid hierarchy is SSD [27]. SSD reuses the multi-scale feature maps; these maps are computed from the forward-pass of different layers. As SSD does not reuse the high-resolution map of the lower layers of its hierarchy feature Figure 3a, SSD is not ideal to detect small objects. The last layer of the bottom-up feature map has higher semantic information, but the localization performance is poor [19,26,43]. In contrast, in the feature pyramid network [26], the features of the top-down pathway are enhanced by lateral connections, which merge the output of the bottom-up pathways with the top-down pathways. Each lateral connection merges feature maps of the same spatial level from these two different pathways. As a result, high resolution and strong segmentation is obtained at the same time. In this study, the idea of FPN with ResNet 101 as the BU network is adopted. Remote Sens. 2019, 11, x FOR PEER REVIEW 6 of 19 In their study, [23] proposed ResNet as a deeper and more effective framework, providing a solution for the problems of deeper networks, which are difficult to train due to the vanishing gradient problem. By using the deep residual learning framework, residual block, it can be assumed that is the input and the true output is ( ), as shown in Figure 4.The residual is the difference between and ℋ( ): ℱ( ) ∶= ℋ( ) − . To get the original function, the equation was rearranged to be ℋ( ) = ℱ( ) + . With this kind of shortcut connections, getting neither more parameters nor complex calculations, it is easy to use and combine ResNet with any kind of network. The equation of the ResNet building block is defined as: where , , and represent the output, input and the parameters of the ℎ convolutional layers to be learned, respectively. ℱ( , ) represents the residual mapping, which is already learned. ResNet has various architectures: ResNet-50, ResNet-101 and ResNet-152. The number of each ResNet corresponds to how many residual network layers are used. In this study, ResNet-101 was used as the BU network to obtain a good balance between feature extraction and computing resources. The last blocks of Conv2, Conv3, Conv4 and Conv5, which are 3×3 convolutional kernels with {4, 8, 16, 32} strides, respectively, are named {C2, C3, C4, C5} as shown in Figure 2. To up-sample the TD feature maps, the element-wise was added to the corresponding BU map (lateral connection). Figure 5 shows the operation of merging layers. Finally, a 3×3 convolutional layer was added on each merged map to minimize the aliasing effect of the up-sampling operation and to get {P2, P3, P4, P5}, which are the final feature maps. In their study, [23] proposed ResNet as a deeper and more effective framework, providing a solution for the problems of deeper networks, which are difficult to train due to the vanishing gradient problem. By using the deep residual learning framework, residual block, it can be assumed that x is the input and the true output is H(x), as shown in Figure 4. The residual is the difference between x and To get the original function, the equation was rearranged to be H (x) = F (x) + x. With this kind of shortcut connections, getting neither more parameters nor complex calculations, it is easy to use and combine ResNet with any kind of network. The equation of the ResNet building block is defined as: where Y, X, and W i represent the output, input and the parameters of the ith convolutional layers to be learned, respectively. F (X, {W i }) represents the residual mapping, which is already learned.  In their study, [23] proposed ResNet as a deeper and more effective framework, providing a solution for the problems of deeper networks, which are difficult to train due to the vanishing gradient problem. By using the deep residual learning framework, residual block, it can be assumed that is the input and the true output is ( ), as shown in Figure 4.The residual is the difference between and ℋ( ): ℱ( ) ∶= ℋ( ) − . To get the original function, the equation was rearranged to be ℋ( ) = ℱ( ) + . With this kind of shortcut connections, getting neither more parameters nor complex calculations, it is easy to use and combine ResNet with any kind of network. The equation of the ResNet building block is defined as: where , , and represent the output, input and the parameters of the ℎ convolutional layers to be learned, respectively. ℱ( , ) represents the residual mapping, which is already learned. ResNet has various architectures: ResNet-50, ResNet-101 and ResNet-152. The number of each ResNet corresponds to how many residual network layers are used. In this study, ResNet-101 was used as the BU network to obtain a good balance between feature extraction and computing resources. The last blocks of Conv2, Conv3, Conv4 and Conv5, which are 3×3 convolutional kernels with {4, 8, 16, 32} strides, respectively, are named {C2, C3, C4, C5} as shown in Figure 2. To up-sample the TD feature maps, the element-wise was added to the corresponding BU map (lateral connection). Figure 5 shows the operation of merging layers. Finally, a 3×3 convolutional layer was added on each merged map to minimize the aliasing effect of the up-sampling operation and to get {P2, P3, P4, P5}, which are the final feature maps. ResNet has various architectures: ResNet-50, ResNet-101 and ResNet-152. The number of each ResNet corresponds to how many residual network layers are used. In this study, ResNet-101 was used as the BU network to obtain a good balance between feature extraction and computing resources. The last blocks of Conv2, Conv3, Conv4 and Conv5, which are 3×3 convolutional kernels with {4, 8, 16, 32} strides, respectively, are named {C2, C3, C4, C5} as shown in Figure 2. To up-sample the TD feature maps, the element-wise was added to the corresponding BU map (lateral connection). Figure 5 shows the operation of merging layers. Finally, a 3×3 convolutional layer was added on each merged map to minimize the aliasing effect of the up-sampling operation and to get {P2, P3, P4, P5}, which are the final feature maps. To carry out the requirements of the RSI situation, SENet [44] was added to the feature extraction to improve the performance and to combine the deep and shallow networks. The goal of SENet was to create a network that works to increase its sensitivity, as well as to keep and use the useful features and omit the others via two steps called squeeze and excitation, before feeding these features to the next operation. As shown in Figure 6, the first FC layer was followed by the ReLU function, adding the necessary nonlinearity. Its output channel complexity was also reduced by a certain ratio = 16, which gives a balance between the complexity and accuracy. The second FC was followed by a Sigmoid activation function, giving each channel a smooth output. With all these steps, adding a function to any model without additional computing costs can be achieved. SENet in our model was fed by the output of C3, then to another 3×3 convolutional.

MS-MF Region Proposal Network.
A good detector in geospatial object detection should have the ability to cover most objects and their various sizes. In [45], the authors broke down the R-CNN execution into several layers. In the anchor generation layer, a fixed number of anchors (bounding boxes) were generated. The number and shape of the anchors depend on the amount and values of scales and ratios. Then, in the region proposal layer, the operation of scoring each region proposal as a background or foreground class and obtaining corresponding bounding box regression coefficients occurred by using the 3×3 convolutional layer and was followed by two 1×1 convolutional layers. This study used a small network sliding over each feature map, i.e., the outputs of the P3, P4, SENet layers, with three types To carry out the requirements of the RSI situation, SENet [44] was added to the feature extraction to improve the performance and to combine the deep and shallow networks. The goal of SENet was to create a network that works to increase its sensitivity, as well as to keep and use the useful features and omit the others via two steps called squeeze and excitation, before feeding these features to the next operation. As shown in Figure 6, the first FC layer was followed by the ReLU function, adding the necessary nonlinearity. Its output channel complexity was also reduced by a certain ratio = 16, which gives a balance between the complexity and accuracy. The second FC was followed by a Sigmoid activation function, giving each channel a smooth output. With all these steps, adding a function to any model without additional computing costs can be achieved. SENet in our model was fed by the output of C3, then to another 3×3 convolutional. To carry out the requirements of the RSI situation, SENet [44] was added to the feature extraction to improve the performance and to combine the deep and shallow networks. The goal of SENet was to create a network that works to increase its sensitivity, as well as to keep and use the useful features and omit the others via two steps called squeeze and excitation, before feeding these features to the next operation. As shown in Figure 6, the first FC layer was followed by the ReLU function, adding the necessary nonlinearity. Its output channel complexity was also reduced by a certain ratio = 16, which gives a balance between the complexity and accuracy. The second FC was followed by a Sigmoid activation function, giving each channel a smooth output. With all these steps, adding a function to any model without additional computing costs can be achieved. SENet in our model was fed by the output of C3, then to another 3×3 convolutional.

MS-MF Region Proposal Network.
A good detector in geospatial object detection should have the ability to cover most objects and their various sizes. In [45], the authors broke down the R-CNN execution into several layers. In the anchor generation layer, a fixed number of anchors (bounding boxes) were generated. The number and shape of the anchors depend on the amount and values of scales and ratios. Then, in the region proposal layer, the operation of scoring each region proposal as a background or foreground class and obtaining corresponding bounding box regression coefficients occurred by using the 3×3 convolutional layer and was followed by two 1×1 convolutional layers. This study used a small network sliding over each feature map, i.e., the outputs of the P3, P4, SENet layers, with three types

MS-MF Region Proposal Network
A good detector in geospatial object detection should have the ability to cover most objects and their various sizes. In [45], the authors broke down the R-CNN execution into several layers. In the anchor generation layer, a fixed number of anchors (bounding boxes) were generated. The number and shape of the anchors depend on the amount and values of scales and ratios. Then, in the region proposal layer, the operation of scoring each region proposal as a background or foreground class and obtaining corresponding bounding box regression coefficients occurred by using the 3×3 convolutional layer and was followed by two 1×1 convolutional layers. This study used a small network sliding over each feature map, i.e., the outputs of the P3, P4, SENet layers, with three types of convolution filter (kernel) layers (3×3, 5×5, 7×7). This is called a multi-filter (MF). The 5×5 convolution slides over SENet, 3×3 convolution slides over P4 and 7×7 convolution slides over P3, as shown in Figure 7. Each sliding window was mapped to a lower-dimensional feature (512 features). Each sliding window position has one anchor box of convolution filter (kernel) layers (3×3, 5×5, 7×7). This is called a multi-filter (MF). The 5×5 convolution slides over SENet, 3×3 convolution slides over P4 and 7×7 convolution slides over P3, as shown in Figure 7. Each sliding window was mapped to a lower-dimensional feature (512 features). Each sliding window position has one anchor box = ( , , , ) predicted, where and are the top-left predicted region coordinates, and and are the width and the height, respectively. In the RPN, this study used the same two different scales on each different layer of the feature map to catch as many objects as possible, see Table 1. In addition, because this study deals with a multi-category dataset (NWPU VHR-10), and this kind of dataset has images with different types of resolution and object shapes, these values were set [1, 0.5, 2, 1/3., 3., 1.5, 1/1.5] to the anchor ratios and [1.,0.12] to the anchor scales to create a multi-scale (MS) anchor over each feature map. Furthermore, the anchor assigned the object as a foreground sample if the threshold of intersection over union (IoU) ≥ 0.7, and as a background, if IoU < 0.3. In the second stage, the thresholds were 0.5 and 0, respectively. Finally, 256 was set to be the size of the mini-batch. The foreground boxes were selected by computing the IoU overlap of all the anchor boxes inside the input image with all ground-truth boxes, and those boxes were marked as foreground boxes if their maximum IoU overlaps with the ground-truth box or exceeds the value of the threshold. IoU was used to evaluate object detector performance. The formula for IoU is: where the area (Bpb ∩ Bgt) is the area of the overlap between the predicted bounding box and the ground-truth bounding box area, and area (Bpb U Bgt) is the area of the union, which is the area surrounded by both the predicted bounding box and the ground-truth bounding box area.
To accelerate the operation of the RPN, only the highest score of 12000 regression boxes was taken by the NMS operation [46,47] to get 2000 proposals. Alternatively, from 6000 regression boxes, 1000 proposals were taken in the test time, also by the NMS operation.
To calculate the loss of each detection layer in this stage, it was supposed that the combination of the loss values of the classification and bounding box regression [36] from the above equation can be defined as: where ( ( ), ) = − log ( ) is the log loss for true class Y (cross-entropy loss), X is the predicted probability anchor, is the balancing parameter = 1, ≥ 1 is used to evaluate to 1 when Y ≥ 1 and 0 otherwise, and is the bounding box regression area.
We calculate the loss of the bounding box regression as the following: In the RPN, this study used the same two different scales on each different layer of the feature map to catch as many objects as possible, see Table 1. In addition, because this study deals with a multi-category dataset (NWPU VHR-10), and this kind of dataset has images with different types of resolution and object shapes, these values were set [1, 0.5, 2, 1/3., 3., 1.5, 1/1.5] to the anchor ratios and [1.,0.12] to the anchor scales to create a multi-scale (MS) anchor over each feature map. Furthermore, the anchor assigned the object as a foreground sample if the threshold of intersection over union (IoU) ≥ 0.7, and as a background, if IoU < 0.3. In the second stage, the thresholds were 0.5 and 0, respectively. Finally, 256 was set to be the size of the mini-batch. The foreground boxes were selected by computing the IoU overlap of all the anchor boxes inside the input image with all ground-truth boxes, and those boxes were marked as foreground boxes if their maximum IoU overlaps with the ground-truth box or exceeds the value of the threshold. IoU was used to evaluate object detector performance. The formula for IoU is: where the area (Bpb ∩ Bgt) is the area of the overlap between the predicted bounding box and the ground-truth bounding box area, and area (Bpb ∪ Bgt) is the area of the union, which is the area surrounded by both the predicted bounding box and the ground-truth bounding box area.
To accelerate the operation of the RPN, only the highest score of 12,000 regression boxes was taken by the NMS operation [46,47] to get 2000 proposals. Alternatively, from 6000 regression boxes, 1000 proposals were taken in the test time, also by the NMS operation.
To calculate the loss of each detection layer in this stage, it was supposed that the combination of the loss values of the classification and bounding box regression [36] from the above equation can be defined as: where L cls (p(X), Y) = − log p y (X) is the log loss for true class Y (cross-entropy loss), X is the predicted probability anchor, λ is the balancing parameter = 1, [Y ≥ 1] is used to evaluate to 1 when Y ≥ 1 and 0 otherwise, and B rb is the bounding box regression area.
We calculate the loss of the bounding box regression as the following: where

Object Detection Network
In the previous section, this study obtained the predicted region proposals. RoI pooling was used to extract the characteristics of each region proposal, and it was ideal to speed up the training and testing operations. Furthermore, RoI pooling can improve both the object classification and bounding box regression accuracy. The RPN, as was mentioned before in Section 2.2, was not essentially labeling the region box to its category. It is just used to determine if the predicted region proposal is either background or foreground (i.e., if the region proposal contains a target or not). Therefore, the benefit of the RoI pooling is to take all the output of the RPN and all feature vectors. These were cropped out from the SDFE by using array slicing, then they were resized to a fixed size with N strides [25], Figure 2. In this study, the RoI pooling layer was applied for each box with a fixed size (14,14) and 2 strides. Then, it was followed by two FC layers with 1024 neurons. Finally, the output of these last two layers was fed to two separate FC layers. One of them was used to obtain the class label predictions and the other to obtain the bounding box location for each proposal. Moreover, NMS was used in the RPN and in this stage to reduce redundancy.
Here, the loss function is similar to the loss function of the RPN, but the difference is that the classification layer of the RPN deals with only two classes, the foreground and background and the last classification layer of this stage deals with all object classes [16,19].

Experiments
This study implemented and evaluated the model by using Keras and Tensorflow, and executed it on a PC with a Core i7-4790 CPU, NVIDIA GTX-1070 (8 GB memory), 8 GB RAM and the Windows 10 operating system.

Dataset Description
Cheng et al. in [15] proposed the NWPU VHR-10 dataset of 10 classes. This dataset is used in remote object detection to detect 10 classes. Its images were collected from two different sources with two different levels of resolution: 715 images were from Google Earth and 85 images from the Vaihingen dataset. This dataset has a separate folder containing all the ground-truth files of each image in the positive image set folder which has 650 images. Each image in this folder contains at least one object belonging to these 10 classes, and the other 150 images are in the negative image set folder without any target. Each ground-truth file has all bounding boxes information of all target objects.
Bounding boxes were manually annotated. The format of each bounding box is (x1, y1), (x2, y2), where (x1, y1) represents the top-left coordinate of the bounding box and (x2, y2) represents the right-bottom coordinate of the bounding box. NWPU VHR-10 was chosen for the following reasons. First, it was collected from different resources with various resolutions. Second, it contains 10 different classes, including 124 bridges (B), 302 ships (S), 163 ground track fields (GTF), 757 airplanes (A), 390 baseball diamonds (BD), 524 tennis courts (TC), 150 basketball courts (BC), 655 storage tanks (ST), 224 harbors (H), and 477 vehicles (V). The other reason is that each class has a different size. At the same time, the same class has objects with a different size, color and appearance, as shown in Figure 8. All of that serves the purpose of this study, which is to build a model that has the ability to deal with all of these differences and challenges. In this paper, the dataset was divided into 80% for training (520 images) and 20% for testing (130 images). The network settings were an iteration of 30K, an initial learning rate of 1e-3, a momentum size of 0.9, and a weight decay of 0.0001. Additionally, during training, the horizontal flip was used. . The other reason is that each class has a different size. At the same time, the same class has objects with a different size, color and appearance, as shown in Figure 8. All of that serves the purpose of this study, which is to build a model that has the ability to deal with all of these differences and challenges. In this paper, the dataset was divided into 80% for training (520 images) and 20% for testing (130 images). The network settings were an iteration of 30K, an initial learning rate of 1e-3, a momentum size of 0.9, and a weight decay of 0.0001. Additionally, during training, the horizontal flip was used.

Evaluation Metrics
To assess the performance of the object detection approach, measures such as the precision-recall curve (PRC) and average precision (AP) were widely used [11,15,48,49]. The PRC depends on the area of overlap between the ground-truth and detection. The precision computes the fraction of detections that are true positives (TP). A recall formula was used to measure the fraction of positives that were correctly identified. The AP and mean AP (mAP) were used to compute the average value of precision over different levels of recall. Therefore, a higher AP value corresponds to better performance.
The output is TP if the value of 0.5 between the ground-truth bounding box and the predicted bounding box, which is obtained from Equation (2). Otherwise, it is false positive (FP). Furthermore, if numerous detections overlap with the same ground-truth bounding box, only one is measured as TP and the others as FP.

Experimental Results and Comparisons
This model was trained with three different architectures, as shown in Figures 2, 9 and 10 and Table 1. In the first experiment, this study trained end-to-end all backbone layers of the feature extraction and the extracted anchor boxes by using the multi-scale (MS) sliding window over each feature map of [P5, P4, P3, P2] (FPN), Figure 9. For the second stage of the implementation, which is the shallow-deep feature extraction (SDFE), the P2 layer was removed and the output of C3 was connected with the SENet layer by lateral connection. Additionally, the output of P3 with the SENet layer was not combined. Thus, this method improves the quality of the feature extraction stage and

Evaluation Metrics
To assess the performance of the object detection approach, measures such as the precision-recall curve (PRC) and average precision (AP) were widely used [11,15,48,49]. The PRC depends on the area of overlap between the ground-truth and detection. The precision computes the fraction of detections that are true positives (TP). A recall formula was used to measure the fraction of positives that were correctly identified. The AP and mean AP (mAP) were used to compute the average value of precision over different levels of recall. Therefore, a higher AP value corresponds to better performance.
The output is TP if the value of IoU > 0.5 between the ground-truth bounding box and the predicted bounding box, which is obtained from Equation (2). Otherwise, it is false positive (FP). Furthermore, if numerous detections overlap with the same ground-truth bounding box, only one is measured as TP and the others as FP.

Experimental Results and Comparisons
This model was trained with three different architectures, as shown in Figure 2, Figure 9, Figure 10 and Table 1. In the first experiment, this study trained end-to-end all backbone layers of the feature extraction and the extracted anchor boxes by using the multi-scale (MS) sliding window over each feature map of [P5, P4, P3, P2] (FPN), Figure 9. For the second stage of the implementation, which is the shallow-deep feature extraction (SDFE), the P2 layer was removed and the output of C3 was connected with the SENet layer by lateral connection. Additionally, the output of P3 with the SENet layer was not combined. Thus, this method improves the quality of the feature extraction stage and increases localization performance. Furthermore, only feature maps of P3, P4 and the SENet were used. This study improved the time performance and detection accuracy. Similarly, in the last stage, the same multi-scale was used also in this experiment, as shown in Table 1b and Figure 10. Figure 2 is the architecture design of the third experiment. It is like the second experiment with only one addition, in which a different filter was added for each different feature map (MS-MF). The description of the structure is in Section 2.2. Tables 1 and 2 show the differences between all of these structures. increases localization performance. Furthermore, only feature maps of P3, P4 and the SENet were used. This study improved the time performance and detection accuracy. Similarly, in the last stage, the same multi-scale was used also in this experiment, as shown in Table 1b and Figure 10. Figure 2 is the architecture design of the third experiment. It is like the second experiment with only one addition, in which a different filter was added for each different feature map (MS-MF). The description of the structure is in Section 2.2. Table 1 and Table 2 show the differences between all of these structures.  The results of these three different experiments are in Table 3. From this study and the results of the ten classes of NWPU VHR-10, very high accuracy detection was obtained, and from these different methods described above, the following were obtained: (1) For objects with a clear background, shape or that are not in dense groups such as airplanes and vehicles in this dataset, the first experiment with the deep feature extraction network gives very good results. (2) Alternatively, a shallow feature extraction network is the best choice to detect objects which are in dense groups, such as storage tanks and harbors. (3) Any object that has a large size with a clear shape obtained almost the same results in all experiments, like the ground track field, basketball court, tennis court and baseball diamond. (4) The last experiment with the multi-filter does not give a significant effect, increases localization performance. Furthermore, only feature maps of P3, P4 and the SENet were used. This study improved the time performance and detection accuracy. Similarly, in the last stage, the same multi-scale was used also in this experiment, as shown in Table 1b and Figure 10. Figure 2 is the architecture design of the third experiment. It is like the second experiment with only one addition, in which a different filter was added for each different feature map (MS-MF). The description of the structure is in Section 2.2. Table 1 and Table 2 show the differences between all of these structures.  The results of these three different experiments are in Table 3. From this study and the results of the ten classes of NWPU VHR-10, very high accuracy detection was obtained, and from these different methods described above, the following were obtained: (1) For objects with a clear background, shape or that are not in dense groups such as airplanes and vehicles in this dataset, the first experiment with the deep feature extraction network gives very good results. (2) Alternatively, a shallow feature extraction network is the best choice to detect objects which are in dense groups, such as storage tanks and harbors. (3) Any object that has a large size with a clear shape obtained almost the same results in all experiments, like the ground track field, basketball court, tennis court and baseball diamond. (4) The last experiment with the multi-filter does not give a significant effect, The results of these three different experiments are in Table 3. From this study and the results of the ten classes of NWPU VHR-10, very high accuracy detection was obtained, and from these different methods described above, the following were obtained: (1) For objects with a clear background, shape or that are not in dense groups such as airplanes and vehicles in this dataset, the first experiment with the deep feature extraction network gives very good results. (2) Alternatively, a shallow feature extraction network is the best choice to detect objects which are in dense groups, such as storage tanks and harbors. (3) Any object that has a large size with a clear shape obtained almost the same results in all experiments, like the ground track field, basketball court, tennis court and baseball diamond. (4) The last experiment with the multi-filter does not give a significant effect, but it gives evidence that a deep network is good to extract and detect the large size of an object, and small objects are probably lost using small filters. (5) Additionally, the third experiment was tested by applying a small filter on the SENet layer (3×3 conv.) and an unacceptable effect was obtained on the result: mAP was 87.03. This gives evidence that using small filters negatively affects small object detection and does not optimize the results. However, big filters improve the final results, especially for some targets, but they are not better than the second experiment (SD-MS) in general. In contrast, some results are good using different filters, such as airplanes, harbors and bridges, as can be seen in Table 3. The following figures plot the PRCs of each experiment method among the testing classes. The recall ratio assesses the ability to detect more targets, whereas the precision assesses the performance of detecting correct objects rather than containing many false alarms. It can be observed from Figure 11 that the SD-MS obtains the best performance for all classes. It has also shown how the model is effective for extracting and detecting the different objects and obtains extremely high AP and recall values, as can be seen also in Table 3. The FPN+MS is not sufficiently balanced to deal with all of the various classes, especially with the storage tank class, as shown in Figure 11a. The SDFE + MS-MF improves the bridge detection more than the other approaches.  As discussed previously, our purpose was to detect multi-scale and multi-class objects in RSIs which have various resolutions, many small objects and dense groups. The SD-MS model detected objects of the ten classes of the test dataset. Figure 12 shows that this model (SD-MS) detected the objects effectively. As discussed previously, our purpose was to detect multi-scale and multi-class objects in RSIs which have various resolutions, many small objects and dense groups. The SD-MS model detected objects of the ten classes of the test dataset. Figure 12 shows that this model (SD-MS) detected the objects effectively.
In addition, to quantitatively evaluate the proposed SD-MS model, this study compared it with eight existing methods: rotation-invariant CNN (RICNN) [15], region proposal networks with faster R-CNN (R-P-faster R-CNN) (R-P-F-R-CNN ) [50], deformable R-FCN (D-R-FCN) [51], collection of part detectors (COPD) [11], position-sensitive balancing (PSB) [20], deformable faster R-CNN (D-F-R-CNN) [52], recurrent detection with activated semantics (RDAS512) [53], and multi-scale CNN (MS-CNN) [19]. As can be seen from Table 4, the proposed SD-MS obtains the best mAP. It also indicates that our FPN with the MS method has the best results for the basketball court, tennis court, ground track field and vehicle classes. The best results of the ship and baseball diamond classes are obtained by SD-MS. This means that our model has had significant success in extracting and detecting the different objects. In contrast, RDAS512 is only 2% better than our FPN with MS to detect airplane objects. SD-MS is only 3.48% less than the PSB and RDAS512 methods for only bridge detection. Finally, COPD is the best model for only one object: a storage tank. As a result, the proposed SD-MS framework outperforms all comparison approaches for all ten classes of the NWPU VHR-10 dataset, which demonstrates the superiority of the proposed method compared with the eight other methods. In addition, to quantitatively evaluate the proposed SD-MS model, this study compared it with eight existing methods: rotation-invariant CNN (RICNN) [15], region proposal networks with faster R-CNN (R-P-faster R-CNN) (R-P-F-R-CNN ) [50], deformable R-FCN (D-R-FCN) [51], collection of part detectors (COPD) [11], position-sensitive balancing (PSB) [20], deformable faster R-CNN (D-F-R-CNN) [52], recurrent detection with activated semantics (RDAS512) [53], and multi-scale CNN (MS-CNN) [19]. As can be seen from Table 4, the proposed SD-MS obtains the best mAP. It also indicates that our FPN with the MS method has the best results for the basketball court, tennis court, ground track field and vehicle classes. The best results of the ship and baseball diamond classes are obtained by SD-MS. This means that our model has had significant success in extracting and detecting the different objects. In contrast, RDAS512 is only 2% better than our FPN with MS to detect airplane objects. SD-MS is only 3.48% less than the PSB and RDAS512 methods for only bridge detection. Finally, COPD is the best model for only one object: a storage tank. As a result, the proposed SD-MS framework outperforms all comparison approaches for all ten classes of the NWPU VHR-10 dataset, which demonstrates the superiority of the proposed method compared with the eight other methods.  When the model was tested in each different experiment, it was noted that in many cases, the model of the first experiment (FPN+ MS) detected targets by drawing their bounding boxes but the value of AP was very low. For this reason, the first and the second models were tested with three different IoU evaluations (0.3, 0.4, 0.5). Table 5 shows that in the first experiment, the storage tank class experienced a big effect. Its AP changed from 12.95 to 92.59. The AP of the harbor class changed from 71.14 to 89.20, and the AP of the bridge class changed from 70.75 to 84.11. These three classes were affected strongly, but the other classes did not have a notable change. In contrast, the change of AP values with the SD-MS model is very low. The AP of the storage tank class only changed from 88.45 to 99.52. Additionally, the harbor class and bridge class did not experience a big change, as shown in Table 5. Most of the 10 classes did not experience any effect by changing IoU evaluations. The mAP only changed by 3.64%, which means that the SD-MS model effectively detected the different classes and obtained the goal of RSI object detection. As shown in Figure 13, the two models detect the storage tank perfectly, but the difference is that the bounding box of SD-MS is smaller and surrounded the object more perfectly than the FPN+MS model. For that reason, the mAP of SD-MS is higher. When the model was tested in each different experiment, it was noted that in many cases, the model of the first experiment (FPN+ MS) detected targets by drawing their bounding boxes but the value of AP was very low. For this reason, the first and the second models were tested with three different IoU evaluations (0.3, 0.4, 0.5). Table 5 shows that in the first experiment, the storage tank class experienced a big effect. Its AP changed from 12.95 to 92.59. The AP of the harbor class changed from 71.14 to 89.20, and the AP of the bridge class changed from 70.75 to 84.11. These three classes were affected strongly, but the other classes did not have a notable change. In contrast, the change of AP values with the SD-MS model is very low. The AP of the storage tank class only changed from 88.45 to 99.52. Additionally, the harbor class and bridge class did not experience a big change, as shown in Table 5. Most of the 10 classes did not experience any effect by changing IoU evaluations. The mAP only changed by 3.64%, which means that the SD-MS model effectively detected the different classes and obtained the goal of RSI object detection. As shown in Figure 13, the two models detect the storage tank perfectly, but the difference is that the bounding box of SD-MS is smaller and surrounded the object more perfectly than the FPN+MS model. For that reason, the mAP of SD-MS is higher. Based on the above results and comparisons, the SD-MS model deals with most of the difficulties and challenges of RSIs. Each class has its situation and difficulties. Vehicles, ships and airplanes are small objects and sometimes appear in very complex environments. Storage tanks are small and are often in dense groups. Baseball diamonds, basketball courts, tennis courts and ground track fields appear in different sizes and colors. Shallow and deep feature extraction with multi-scale anchor boxes was used, which offered many benefits for extracting more objects and reducing the loss. Its accuracy is extremely high and has the ability to detect most of these various objects and their situations and is not affected by IoU evaluations change. In contrast, the performance time of the proposed model is long for many reasons:

1.
As known, VHR images take a long time for processing.

2.
The CPU of the device is very slow and the RAM of the GPU and of the device are both small, only 8 GB. These were important factors that slowed down processing. 3.
The structure of SDFE makes the process work in a straight line, happening one after the other in a series from C1 to P2. This also slows down the speed of the feature extraction. Therefore, a good way to improve the speed is by having the feature extraction structures work in parallel.

Conclusions
In this paper, a unique shallow-deep feature extraction framework with a multi-scale anchor size was proposed to enhance detection of the remote sensing images. The feature extractor was redesigned by combining the deep and shallow CNN based on ResNet-101 and SENet to enhance the feature extraction. After that, a different multi-scale anchor was played over each level of the feature map to improve the cropping of each different object. The experiments of our model with the NWPU VHR-10 dataset show the following: (1) Our model has the ability to extract the different object categories.
(2) Our model shows the best detection for dense groups and small objects. (3) A very effective feature extraction was provided, beating most RSIs challenges. (4) By using three different stages of the experiment, this study demonstrated various factors that affect the detection. (5) The AP values of the SD-MS model do not change if the evaluation of IoU is changed, which means its accuracy is exceptionally high and does not have many false detections. (6) Our model obtained the best AP when it was compared with the eight state-of-the-art methods. In future work, the authors will focus on improving the time of the performance and feature extraction. Additionally, focus will be placed on enhancing the IoU operation to optimize the cropping of each object.