Extended Feature Pyramid Network with Adaptive Scale Training Strategy and Anchors for Object Detection in Aerial Images

: Multi-scale object detection is a basic challenge in computer vision. Although many advanced methods based on convolutional neural networks have succeeded in natural images, the progress in aerial images has been relatively slow mainly due to the considerably huge scale variations of objects and many densely distributed small objects. In this paper, considering that the semantic information of the small objects may be weakened or even disappear in the deeper layers of neural network, we propose a new detection framework called Extended Feature Pyramid Network (EFPN) for strengthening the information extraction ability of the neural network. In the EFPN, we first design the multi-branched dilated bottleneck (MBDB) module in the lateral connections to capture much more semantic information. Then, we further devise an attention pathway for better locating the objects. Finally, an augmented bottom-up pathway is conducted for making shallow layer information easier to spread and further improving performance. Moreover, we present an adaptive scale training strategy to enable the network to better recognize multi-scale objects. Meanwhile, we present a novel clustering method to achieve adaptive anchors and make the neural network better learn data features. Experiments on the public aerial datasets indicate that the presented method obtain state-of-the-art performance.


Introduction
With the rapid development of deep convolutional neural networks (CNNs) [1] in recent years, the conventional object detection methods [2,3] have made some remarkable achievements in natural images. However, due to the huge scale variations of the vast majority of objects and the compact distribution of many small objects in remote sensing images, it still remains a tremendous challenge for locating and predicting the target objects [4,5].
In order to detect objects at different scales, a basic method is to leverage a multi-scale featurized image pyramid (Figure 1a) [6], which is popular in both manual feature-based approaches [7,8] and deep CNNs-based approaches. Strong evidence [9,10] has shown that the current standard deep detectors can benefit from a multi-scale learning strategy. However, many object detectors based on deep learning have avoided this multi-scale image pyramid representation mainly because it requires a lot of calculations and memories.
Thus, Lin et al. [11] exploited the multi-scale pyramid structure in deep CNNs to construct the Feature Pyramid Network (FPN) with a small amount of additional cost. In the FPN (Figure 1b), it adopts a bottom-up pathway, a top-down pathway and lateral connections for constructing the highlevel semantic information at each scale. This structure displays an obvious improvement as a common feature extractor in some practical applications. However, since large-scale objects are usually produced and predicted in the deeper convolution layers of the FPN, the boundaries of these objects might be too fuzzy to obtain accurate regression. Furthermore, the FPN usually predicts small-scale objects in the shallower layers with low semantic information which might not be enough to identify the class of the objects. The designer of the FPN has been aware of this problem and adopted a top-down structure with lateral connections to fuse shallow layers and high-level semantic information to relieve it. However, if the small-scale objects disappear in the deep convolution layers, the context information cues will disappear at the same time. Besides, Li et al. [12] presented a backbone network, called DetNet (Figure 1c), which is the first specifically dedicated to object detection. They pointed out that larger down-sampling factor can bring a larger effective receptive field, which is beneficial to image classification, but may damage the ability of the detector to locate. Therefore, DetNet includes the additional stages contrasted with the conventional backbone network only for classification, and retains high spatial resolution in deeper convolution layers. Due to the specifically designed backbone for object detection, DetNet is much more powerful, especially in finding the small objects and locating the boundaries of the large objects. However, it is useless for the location of small objects and has little contribution to find more ground-truth large objects.
In this work, for solving the above problems, we propose a new FPN-based structure called Extended Feature Pyramid Network (EFPN) which considers huge scale variations of object instances. In the proposed EFPN, we first design a low complexity multi-branched dilated bottleneck (MBDB) module for capturing much more semantic information. The dilated convolution [13,14] is usually blended in the convolution network model to expand the receptive field without increasing the computational complexity. Therefore, the designed MBDB module combines multiple branches with different dilated convolution layers, and is added at the lateral connections of FPN to achieve the feature maps with more details at all scales. Unfortunately, the dilated convolution is likely to cause the loss of some local information due to the increase of the receptive field, which is not beneficial for locating the objects. Recently, some researchers [15] discovered that attention mechanisms can not only focus on the object of interests, but also promote the representation of interests. Thus, we further design an attention pathway to better locate the objects and promote the accuracy of detection. Furthermore, an augmented bottom-up pathway is conducted in the designed EFPN for making shallow layer information easier to spread and further promoting the performance of small object detection.
In addition, some aerial images are too large for training the CNNs. Therefore, reducing large images by resizing is a common process for saving computing and memory costs during training. However, the resizing process may result in small objects becoming smaller and more likely to be lost in the deeper layers. For solving this problem, the general solution is to simply cut large-scale images into small chunks [5,16]. However, when the cut images include relatively large objects, such as ground track field, these objects may be broken up into small pieces and make the network hard recognize. To better ease this problem, we present an adaptive scale training strategy to try to keep the large objects intact after cutting down in the remote sensing images by designing an adaptive adjustment rate for resizing the original images before dividing these images into smaller sub-images. That is to say, an image is likely to include some relatively large objects whose size may be larger than the sub-image size (we usually set to 800 or 1000 pixels). We can first multiply the original image size by the proposed adaptive adjustment rate to make the large objects with a proper size. Then, we divide the resized image into the smaller sub-images. By adaptively resizing before cutting down the original image, we can make the large objects more intact in the sub-images and promote the recognition ability of the neural network.
Moreover, our proposed EFPN detection framework is built on the faster region-based convolutional neural network (Faster R-CNN) [2] and FPN [11]. For the anchors which are the initialization of candidate boxes in the Faster R-CNN, their aspect ratios and scales are generally set artificially and empirically to several initial values for object detection. The region proposal network (RPN) is presented by Faster R-CNN to replace the original selective search algorithm (SS algorithm) [17] to optimize the generation method for the regions of interest (ROI) [18]. The ROIs are a set of class-independent candidate boxes that may include any objects. By sliding a tiny network on the convolution feature map, the RPN can output a suit of rectangular object proposals called anchors, and each anchor is accompanied by an aspect ratio and a scale. Unlike natural images, some objects in aerial images are of very different shapes and large aspect ratios, such as bridges and harbors. Improper prior scales and aspect ratios setting generally affect the accuracy of the detection positioning. Therefore, it may not be appropriate to directly use the prior scales and the aspect ratios of natural images for remote sensing image detection. For solving this problem, we analyze the training data and propose a special clustering method to obtain the appropriate aspect ratios and scales of anchors.
We did experiments on the public aerial datasets and the results indicate that the presented method obtain state-of-the-art performance. DOTA [19] is a large-scale dataset for object detection in aerial images and the DOTA-v1.5 is the latest version of DOTA-v1.0. NWPU VHR-10 dataset [20] is a publicly available 10-class geospatial object detection dataset. RSOD [21] is an open dataset for object detection in remote sensing images.
The main contributions of this work are summarized as follows: 1. We propose a new framework called Extended Feature Pyramid Network (EFPN) for object detection in aerial images. 2. In the EFPN, we first design the multi-branched dilated bottleneck (MBDB) module in the lateral connections to capture much more semantic information. Then, for better locating the objects, we further design an attention pathway in the deeper layer of EFPN. Finally, an augmented bottom-up pathway is conducted for making shallow layer information easier to spread and further improving performance. 3. We propose an adaptive scale training strategy to try to keep the large objects intact after cutting down in the aerial images and improve the recognition ability of the presented network. Meanwhile, we develop a new clustering method for getting adaptive anchors to replace the initial values which are set artificially. 4. The presented method obtains optimal performance in the challenging DOTA-v1.5 dataset [19], NWPU VHR-10 dataset [20] and RSOD dataset [21].

Multi-scale Object Detectors.
Object detection with various aspect ratios and scales is an extremely challenging problem in the domain of computer vision. In recent years, CNN has become one of the most effective techniques for object detection. Generally, these CNN-based methods are roughly summarized to two technology paths: one-stage detection methods and two-stage detection methods. Both of these two methods apply a variety of techniques to handle the scale variation problem in multi-category target detection tasks.
In general, the one-stage detection methods are more efficient, because they can classify the predefined anchors directly and further refine them without the proposal generation step. YOLO9000 [3] is a real-time object detection system that can detect over 9000 object categories, and it simply used multi-scale training by selecting new image size randomly per 10 batches to make the neural network model scale-invariant. The single-shot multibox detector (SSD) method [22] achieved multi-scale features by fusing different scale features from different layers without adding additional computation. RetinaNet [23] applied FPN as the backbone and used the focal loss to deal with the imperfection of one-stage object detection that the network suffers from an extreme class imbalance between foreground and background during training. RefineDet [24] selected four feature layers with different stride sizes to deal with objects of different scales.
Besides, the two-stage detection methods first generate a suit of region proposals and then refine them through CNNs. Thus, they usually have better positioning accuracy than the one-stage detection methods. Faster R-CNN [2] improved the Fast R-CNN [25] by developing the RPN to replace the original SS algorithm [17] to optimize the generation method for the ROIs [18]. R-FCN [26] presented a region-based full convolution network and designed the position-sensitive score maps for accurate and efficient object detection. The unified multi-scale CNN (MS-CNN) [27] detected multi-scale objects at multiple layers. Faster FPN [11] is one of the predominant detectors for different scale object detection, which further introduced a top-down structure to promote the semantic information of low-level features. Besides, the presented EFPN is inspired by this architecture.

Dilated Convolution.
Nowadays, due to its powerful feature extraction ability, deep CNN has obtained great success in the field of object detection. However, there are still some defects in the deep CNN, especially in the design of up-sampling and pooling layers. There are some key problems in the design of upsampling and pooling layers. First of all, the up-sampling (e.g. bilinear interpolation) and pooling layers are deterministic, which means their parameters are unlearnable. Secondly, in the up-sampling and pooling process, the internal data structures and spatial hierarchy information may be lost and thus the information of small objects cannot be rebuilt.
To solve the above problems, researchers provided many effective structures and the dilated convolution [13] is one of the most excellent structures. In the dilated convolution, it injects a hole in the convolution kernel to enlarge the convolutional kernel with original weights, and the number of the injected holes is determined by one dilation parameter called dilation rate. The purpose of this structure is to provide the greater receptive field without pooling and with the same amount of calculation. The dilated convolution has the characteristics of retaining the internal data structures and avoiding the use of down-sampling. Therefore, using the combination of layers with different dilation rates can improve semantic information. Dilated convolution has been proverbially used in the field of semantic segmentation [28] to better combine local and global context information [14]. In the object detection field, DetNet [12] designed a detection backbone network by introducing dilated convolution in the deeper layers, hence it can hold the spatial resolution and expand the receptive field simultaneously. In this work, we utilize dilated convolution in the presented multibranched dilated bottleneck (MBDB) module with different dilation rates to extract richer detail information.
In the k-means, the distance is used as an evaluation index of similarity, which indicates that the closer the two objects are, the greater the similarity. Considering that the cluster is comprised of close objects, thus the clustering algorithm takes the compact and independent cluster as its ultimate goal. Mini batch k-means algorithm is a variant of k-means algorithm, which utilizes a small-batch subset of data randomly selected to reduce computing time. Using the mini-batch k-means algorithm can greatly reduce the computation time and the results are usually similar to the standard k-means algorithm. The iterative steps of mini-batch k-means can be explained as follows: (1) randomly extract some samples from the dataset to form a small-batch subset and classify them to the nearest center of mass which means the clustering flat of the dataset; (2) update the center of mass.
The mini-batch m-means has a faster convergence speed than the k-means algorithm, while keeping nearly the same clustering effect. K-means clustering has been used in some detectors to obtain the initial size of candidate boxes of the interest area. In order to automatically find the good prior anchors, YOLO9000 [3] ran a k-means clustering directly on the bounding boxes of the training data. In this paper, we obtain the adaptive scales and aspect ratios of anchors by the optimized minibatch k-means algorithm, which is liable to be realized and can effectively improve detection performance. Figure 2 is the whole architecture of the presented EFPN, which is built on Faster FPN [11] and improves it from different aspects. First, we design the multi-branched dilated bottleneck (MBDB) module in the lateral connections to capture much more semantic information. Then, we further devise an attention pathway to better locate the objects. Finally, an augmented bottom-up pathway is conducted for making shallow layer information easier to spread and further improving performance. On the whole, the proposed EFPN consists of a bottom-up pathway, lateral connections, a top-down pathway, an attention pathway and an augmented bottom-up pathway. The details are described in the following subsections. Following the EFPN, the RPN is executed at each level of the EFPN output to produce the object proposals. Unlike previous methods using the RoI Pooling operator [2], we adopt the RoI Align operator proposed by Mask R-CNN [30] to extract RoI features. The final detection results are obtained by further precise location regression and fine classification.

Bottom-up Pathway
We adopt the aggregated residual transformations for deep neural networks (ResNeXt-101 32 × 8 ) [31] as our backbone of the bottom-up pathway. Due to its superior performance in the field of image processing, it is widely used in many object detectors [4]. The backbone usually has many layers that generate feature maps with the same spatial size and we define these layers as stages. The ResNeXt-101 contains five stages. As can be seen from Figure 2, we only use the stage1, stage2, stage3, stage4 of the ResNeXt-101 in our backbone and keep these stages as the same as their original form. The outputs of the last residual block of each stage are expressed as {C2, C3, C4}, for which the strides are {4, 8, 16} pixels corresponding to the initial image. They will be extracted to construct the feature pyramid. We do not use stage1 in the pyramid, because it is memory-consuming. The reasons that we do not use the stage5 for the EFPN are as follows. For one thing, traditional backbone networks with the large down-sampling factor can bring a larger effective receptive field, which is beneficial to image classification, but may damage the ability of detector to locate. Thus, the stage5 with a scaling step of 32 is of little use in pinpointing larger objects and adding semantic information of the smaller objects that may have disappeared in this layer. For another, with the proposed MBDB module (described in the lateral connections bellow), we have enlarged the receptive field to get richer semantic information based on stage4. Thus, since the other stages are of little use to our neural network model, we can choose to discard them to save memory and computation.

Lateral Connections with MBDB Module
Considering huge scale variations of aerial object instances and single receptive fields may not effectively learn all situations. The different dilation rates can obtain different scale receptive fields without pooling and with the same amount of calculation. Thus, for the lateral connections, the proposed EFPN employs the low complexity multi-branched dilated bottleneck (MBDB) module to capture much more semantic information. The details of the MBDB module are shown in Figure 3. It can be seen that the MBDB module combines the multiple branches with the dilated convolution layers of different dilation rates to achieve the feature maps with more details at all scales. The MBDB first reduces channel dimensions by a 1 × 1 convolution layer. Then the outputs are divided equally among the three branches, and each branch is a 3 × 3 dilated convolution layer with the different dilation rates that are 3, 2 and 1 respectively. Finally, we append a 1 × 1 convolution on the incorporated feature maps of three branches for producing the final feature map with 256 dimensions. We have experimented with more dilated convolution layers and observed marginally better results. Thus, in order to achieve an approximate optimal effect without introducing too many parameters, we choose to introduce this MBDB module. At each level of the feature pyramid, the presented EFPN can hold the feature map with high spatial resolution and meanwhile retain the large receptive field due to the added MBDB module, thus it has better semantic information capture capability.

Top-down Pathway
For the top-down pathway, we first simply attach an MBDB module on C4 to generate a coarsest resolution map P4. Then, factor 2 is used to conduct spatial resolution up-sampling on the produced feature map P4. Finally, we merge the up-sampling map with its corresponding bottom-up map that has attached an MBDB module as the lateral connection. Note that we use nearest neighbor upsampling for simplicity and element-wise addition for merging here. Repeating this process until the finest resolution map is produced. The final set of feature map can be signed as {P2, P3, P4}, which with the same spatial sizes corresponding to {C2, C3, C4}, respectively.

Attention Pathway
The designed MBDB module in the lateral connections can capture much more details information. Unfortunately, some local information may be lost by the dilated convolution, which is not beneficial for locating the objects. Thus, to better locate the objects and promote the accuracy of detection, we further design an attention pathway with the attention module. Since the designed attention module can further refine the feature map, the network performs better and has better robustness to noise input. In addition, beyond the previous works [32,33], the proposed attention module can better fuse the global and the detail information at the pixel level through the novel concatenation. The details of the proposed attention module are illustrated in Figure 4. The attention module mainly consists of two parts: channel attention block (CAB) and spatial attention block (SAB). The designed details of each block are shown in Figure 4 and the whole attention process of CAB is calculated as follows: where is Rectified Liner Unit (ReLU) [34] activation function;  is element-wise dot product; ⊕ is element-wise addition; ( ) ∈ R × × is the channel attention weight; ∈ R × × represents the input feature map and represents the output feature map of the CAB. Concretely, in the CAB, the spatial dimension of the input feature is first compressed by max-pooling and average-pooling simultaneously. Then the generated max-pooling features ∈ R × × and average-pooling features ∈ R × × are followed by two weights-shared full connection layers. The size of the hidden activation layer is set to R / × × for reducing parameter overhead and a ReLU is followed by it. The reduction ratio (r0) is set as 16. The and can be computed as the following: where , , and are the height, width, channel and l-th element of feature maps, respectively. The output feature vectors of weight-shared full connection layers are merged via element-wise addition. Finally, the merged vector passes through a sigmoid function for producing our channel attention weight ( ), which can be summarized as: where denotes the sigmoid function; is ReLU activation function; ∈ R / × and ∈ R × / are shared for both max-pooling and average-pooling inputs.
In addition, the whole attention process of SAB is calculated as follows: where ( ) ∈ R × × is the spatial attention weight; represents the output feature map of the SAB and it is also the final refined output. Firstly, in the SAB, the feature map ∈ R × × and ∈ R × × are produced by the max-pooling and average-pooling processes along the channel axis, respectively. They are calculated by: ( ) = max { |0 < < }, 0 < < , 0 < < .
Then, these two produced maps are concatenated and a convolution layer is applied to reduce the dimension. Finally, a sigmoid function is added to generate the spatial attention weight ( ) ∈ R × × . In short, the spatial attention weight is computed as: where represents the sigmoid function; × is the convolution operation with a filter size of 3 × 3.

Augmented Bottom-up Pathway
Generally, low-level features are advantageous to access accurate localization information. However, there is a long path passing through even about 100 layers from shallow-level to high-level features in bottom-up pathway of the backbone. Thus, for reducing the loss of information transmission and strengthening the precise position signals existing in the shallow layers, an augmented bottom-up pathway that consists of several layers is adopted in the proposed framework. Figure 2 shows the designed augmented bottom-up pathway, which is used to produce the new feature map Mi+1 through a higher resolution feature map Mi, a coarser map Pi+1 and Ci+1. Noting that M2 is produced only by P2 and C2, and the feature maps used in this structure are always with 256channels. The details are as follows. Firstly, the 3 × 3 convolution layer with stride 2 is used for reducing the spatial dimension of each feature map Mi and meanwhile getting a down-sampling map. Each corresponding convolutional layer follows a ReLU. Then the generated down-sampling map from Mi, the feature map Pi+1 which undergoes an MBDB module and Ci+1 which undergoes a 1 × 1 convolution layer is added to produce the informative Mi+1. Repeating this process until reaching M4. Finally, in regards to reducing the aliasing effect, the 3 × 3 convolution layer is applied on each incorporated map for producing the final feature map {M2, M3, M4}.

Adaptive Scale Training Strategy
In remote sensing images, when these are many large images, we may split them into small chips to alleviate the computational and memory cost. Generally, a constant sub-size of sub-images and a constant overlap (G) are set to divide the image from left to right and top to bottom of the original image into smaller sub-images. An example of the split process is depicted in Figure 5a. From Figure  5a, the green box denotes one object in the image; h and w denote the height and width of this object, respectively. When the height and width of the object are less than the sub-size (such as 800 or 1000 pixels), the object can be divided into up to four parts and each part exists in a different sub-image (denoted by the pink, blue, yellow and red dotted boxes respectively). We can call these parts as subobjects and two adjacent parts intersect each other. The overlap value of two sub-objects is the same as that of the sub-images and denoted by G.
In this work, we propose an adaptive scale training strategy by designing an adaptive adjustment rate to resize the original images before dividing these images into smaller sub-images to keep the large object intact after cutting down and reduce the number of difficult samples for large targets. The definition and relative quantity of difficult samples have a great effect on the performance of the neural network model. When a certain category contains many difficult samples in the training data, the neural network model is difficult to learn the characteristics of this category and accurately identify it. Besides, we can use the ratio of the sub-object region to the original object region to define whether one object is a difficult sample or not. If the ratio of the sub-object region to the original object region is lower than a threshold (we generally set it to 0.7), the sub-object in the sub-image is difficult to be detected and we can call it a difficult sample, whereas we call it a simple sample.  The object denoted by the green box in Figure 5a is zoomed in to Figure 5b. Figure 5b indicates that the four sub-objects can be represented by the pink box A1, the blue box A2, the yellow box A3 and the red box A4, respectively. In addition, x and y denote the width and height of the up-left subobject respectively. So, the areas of these four sub-objects can be expressed by the following formulas: If the maximum ratio of these four sub-objects to the original object size is larger than the threshold mentioned above, the object is easy to be detected and it is called as a simple sample. Thus, the condition that this object is a simple sample can be expressed by the following inequality:
The areas of these four sub-objects ( 1, 2, 3, 4) are arbitrarily distributed, thus the is uncertain. Nevertheless, in all cases, when the minimum of is larger than the threshold , all of the is larger than the threshold . From the mathematics, the minimum of is expressed as the following: Thus, the formula (13) can be expressed as below: Therefore, for one image, we know the width and height of the object, and we can set the proper value of overlap G and threshold , then the adaptive adjustment rate ( ) can be calculated by the equation: From the mathematical reasoning, this adaptive adjustment rate is also applicable to other cases not discussed here. The adaptive adjustment rate (r) is a preset variable before image cutting, which is determined by the width (w), height (h), overlap (G) and threshold . The threshold of sub-object to the original object ratio ( ) is generally set to 0.7 to judge whether one object is a difficult sample or not. In general, when is given empirically, the adaptive adjustment rate (r) can be derived according to the width (w), height (h) and overlap (G). Using the proposed adaptive adjustment rate to resize the original image before dividing the images into smaller sub-images can ensure most of the sub-objects are the simple samples and improve the recognition ability of the neural network.

Adaptive Anchors
The aspect ratio represents the rate of the width of the anchor to its height. When an anchor is square, the scale is the side of this anchor. In practical applications, we may detect some objects with special shapes, such as bridges and harbors in remote sensing images. At this time, the general initialization of the anchor box size will have an impact on the accuracy of the final training model and we need to generate the corresponding anchor box size according to our own data instead of the default value.
For the aspect ratios of anchors, when the amount of data is very huge in some remote sensing image detection tasks, it is a huge time drain and might not be necessary if we directly apply the kmeans algorithm for obtaining the appropriate aspect ratios of anchors by clustering the aspect ratios of training data. To solve this problem, we substitute mini-batch k-means algorithm for the k-means algorithm to reduce the calculation time. However, there are still the main two problems to cluster an exact result. (1) Because the mini-batch k-means algorithm is sensitive to the selection of initial centroids, the results of each training for clustering may not be the same and precise enough; (2) the number of cluster centers should be specified in advance and the different numbers of cluster center points will have very different results. The number of artificially assigned cluster centers, which also is the number of the adaptive aspect ratios of anchors at this time, may not be the best one for the optimal results.
To address these problems, we first randomly initialize the cluster centroids for many times and calculate the values of the loss function every time. The value of the loss function represents the average Euclidean distance between every data sample and its corresponding closest cluster center. Then, the cluster centroids corresponding to the minimum result of the loss function are taken as the clustering results. The values of cluster centroids are the adaptive aspect ratios of anchors. The loss function is expressed as follows: where m represents the number of the data; ( ) is one sample and is the corresponding closest cluster center of this sample; K represents the number of cluster centers and it is a pre-defined hyperparameter that we can set it from two to eight. Finally, the Elbow Method [35] is used to determine the appropriate number of cluster centers K which also is the number of the adaptive aspect ratios of anchors here, and it is a great tradeoff between high recall and complexity of the neural network model.
For the scales of anchors, we use the mini-batch k-means algorithm with the Intersection of Union (IoU) distance [3] instead of the aforementioned Euclidean distance to obtain the adaptive settings. In [3], it expresses that if we utilize standard k-means with Euclidean distance directly, larger boxes bring forth more error than smaller boxes, but the IoU distance is independent of the box size. The IoU distance is denoted as: Remote Sens. 2020, 12, x; doi: FOR PEER REVIEW www.mdpi.com/journal/remotesensing D(box, centroid) = 1 − IoU(box, centroid).
We ran mini-batch k-means for various values of K. For one set K, as the same to get the adaptive aspect ratios of anchors, we also randomly initialize the cluster centroids for many times and calculate the value of average IoU distance every time. Then, we select the cluster centroids corresponding to the minimum of average IoU distance. With the various K and the corresponding minimum of average IoU distance, Elbow Method [35] is also used to determine the appropriate K. At this time, the cluster centroids represent the width and height of the prior anchors, and the scales of anchors can be acquired by calculating the side of the square which has the same area as the produced prior anchors.
Furthermore, the mini-batch k-means algorithm is not only simple and liable to implement, but also can greatly promote the detection performance. Hence, it can be further explored and applied to other parameters of the neural network model in future work.

Dataset and Experimental Settings
To verify the effectiveness of the presented method, we execute comparative experiments on public aerial DOTA-v1.5 datasets [19], NWPU VHR-10 dataset [20] and RSOD dataset [21]. The dataset description, implementation details and evaluation criteria will be discussed in this section. For ensuring that the distribution of training data and test data is roughly matched, 1/2 of the dataset is randomly selected as a training set, 1/6 as a verification set and 1/3 as a test set. Besides, the analysis result of the training part of the DOTA-v1.5 and DOTA-v1.0 datasets is shown in Table 1. It should be noted that the number of images in this table is not the actual number of the images of the training data, because one image may contain more than one category. As shown in Table 1, the main object instances added are small vehicles and large vehicles. In addition, the category of a container crane is added in the DOTA-v1.5. Therefore, it requires the neural network model to have more powerful capability for the detection of small objects and stronger robustness for the detection of multi-scale objects.

NWPU VHR-10 Dataset
In the NWPU VHR-10 dataset [20], it consists of 800 images (about 1000 × 1000) with 650 positive objects and 150 negative objects. It includes ten categories, which are Airplane, Ship (SH), Storage tank (ST), Baseball diamond (BD), Tennis court (TC), Basketball court (BC), Ground track field (GTF), Harbor (HR), Bridge (BR) and Vehicle. In this paper, the dataset is randomly split into a training set, a verification set, and a test set according to the proportion of 20%, 20% and 60%.

RSOD Dataset
The RSOD dataset [21] has 936 annotated images, including 4993 aircrafts, 191 playgrounds, 180 overpasses and 1586 oil tanks. In this paper, the dataset is randomly split into a training set, a verification set and a test set according to the ratio of 25%, 25% and 50%.

Implementation Details
The experiments are performed on the Detectron [36] platform. The proposed detector was trained on an Nvidia GTX 1080Ti GPU with 11 GB of RAM and optimized with synchronous stochastic gradient descent (SGD) by setting the 0.0001 for weight decay and 0.9 for momentum. In each mini-batch, there is only one image. For the DOTA-v1.5 dataset, we first chipped the images into 1024 × 1024 sub-images and set the overlap value to 200 due to the high resolution of these images. Then for those cut sub-images of DOTA-v1.5 and the other two datasets with relatively not high resolution of the image (about 1000 × 1000), the short edge was resized to 800 pixels and the long edge was limited to 1000 pixels. Finally, the proposed network was learned a total number of 180k iterations on the DOTA-v1.5 dataset. Before 140k iterations, the learning rate was 0.001 and it was reduced by a factor of 10 every next 20k iterations. Besides, for the other two datasets, we also chose this learning policy of step with decay but only a total number of 45k iterations. Before 30k iterations, the learning rate was 0.001 and it was reduced by a factor of 10 for the next 15k iterations. All the experiments were initialized with common objects in context (COCO) [37] pre-trained weights. For data augmentation, we did not execute data augmentation processing except random flipping images during training. As for ROI generation, we first picked up 10,000 proposals with the highest scores and then got 2000 ROIs at most by the NMS procedure. Furthermore, the effective Group normalization (GN) [38] and ROI-Align [30] techniques were used in the proposed EFPN.

Evaluation Criteria
In the experiments, we utilized the precision-recall curve (PRC) and the average precision (AP) as the evaluation criteria. The PRC depicts the correlation between the precision value and the recall rate which can be formulated as follows: where TP, FP and FN are the number of true positives, false positives and false negatives, respectively. In general, if the particular detector can maintain high precision with the increase of the recall rate, it is considered to be excellent in performance. The region area under the PRC is AP, which is the average precision of all recall values from 0 to 1. The mean average precision (mAP) denotes the average precision value for all categories. Note that the higher the value of AP, the better the performance of the detector. In addition, we evaluate the detections of small, medium and large objects with different scales which range from 1 pixel to 50 pixels, 50 pixels to 300 pixels and over 300 pixels, respectively. By calculating the average of the AP values of different scales in each category, the mean average precision of each scale is acquired, and the AP of the corresponding small, medium and large objects scales are represented by APS, APM, APL respectively.

Ablation for EFPN
To verify the effectiveness of each part of the presented EFPN, we compare the performance changes when separately adding the MBDB module, the attention pathway (AP) and the augmented bottom-up pathway (ABUP) to the baseline FPN on the DOTA-v1.5 validation set. The first to the fourth row of Table 2 demonstrate the comparison results. The combination strategies are FPN with the multi-branched dilated bottleneck module (FPN+MBDB), FPN with the attention pathway (FPN+AP), FPN with the augmented bottom-up pathway (FPN+ABUP). Compared with FPN, all combinations yield better results, increasing mAP by 2.46%, 2.87% and, 2.18% respectively. The EFPN achieves the best result with mAP value of 74.67%. In remote sensing images, there are objects with vastly different scales and the scale AP of some categories is small, so the average scale AP (APS, APM, APL) for all categories is generally smaller than mAP. As shown in Table 2, for the APS of each combination (FPN+MBDB, FPN+AP, FPN+ABUP), they all have a certain boost relative to FPN. In addition, we also compare the parameters (Params), computational cost (FLOPs) and average run time per image of the baseline FPN and the proposed EFPN as well as each combination (FPN+MBDB, FPN+AP, FPN+ABUP). Table 3 shows the detailed comparison results. As described in section 3.1.1, we discard the stage5 of the backbone for the EFPN to save memory because it is of little use to our neural network model. Thus, the EFPN and all combinations (FPN+MBDB, FPN+AP, FPN+ABUP) contain fewer parameters than the traditional FPN. Due to some extra operation, the floating-point operations (FLOPs) still increase for the final EFPN, but the average run time per image has only a small fluctuation. The precision-recall curves of the 5 object classes in Table 2 are shown in Figure 6. These 5 categories are selected from all 16 categories and they have a visible distinction in performances between network architectures. The recall rate assesses the ability to detect more objects, but the precision measures the ratio of correct objects to all detected objects. Therefore, as the curve decreases Remote Sens. 2020, 12, x; doi: FOR PEER REVIEW www.mdpi.com/journal/remotesensing sharply, the higher the recall rate, the better the detection effect of the class. Due to the high similarity between the object and the background, and the lack of training samples, the container cranes are poorly recognized in the FPN and the proposed method can promote the detection effect to a certain extent. For each combination (FPN+MBDB, FPN+AP, FPN+ABUP), the small objects (such as small vehicles) and the large objects (such as ground track field) both get better detection results. From Figure 6e, for the presented EFPN, the recall curves of most object classes begin to decline sharply when the recall value exceeds 0.8. It is because the proposed EFPN has stranger semantic information extraction ability and nice detection performance.

Ablation for the Adaptive Scale Training Strategy and Anchors
For assessing the validity of the adaptive scale training strategy and anchors which are described in section 3.2, we compare the baseline FPN with FPN+AS and FPN+AA on the DOTA-v1.5 validation set, where the AS and AA are used to shortly represent the adaptive scale training strategy and adaptive anchors, respectively. In addition, we also tested the three combination methods with EFPN, which were EFPN+AA, EFPN+AS and EFPN++ (EFPN+AA+AS). The comparison results are exhibited in Table 4. It should be noted from Table 4 that after the proposed methods were added, the detection results can be improved with varying degrees. Compared with FPN, the incorporated FPN+AS and FPN+AA increase the total mAP by 2.36% and 2.17%, respectively. In the combination works, the EFPN++ achieves the highest mAP value 77.17% compared with EFPN+AS and EFPN+AA, increasing the mAP by 1.55% and 1.47%, respectively. In addition, compared with FPN and EFPN, the APL of FPN+AS and EFPN+AS both have improved because the proposed adaptive scale training strategy can promote the detection of the large object. The precision-recall curves of Table 4 over the 5 classes are shown in Figure 7. From Figure 7a and Figure 7b, we can also see that the proposed FPN+AS can improve the detection of large objects, such as ground track field and soccer ball field. Moreover, Figure 7a and Figure 7c show that the proposed FPN+AA can promote the detection of objects with special shapes, such as bridge and harbor.   To compare with the existing advanced methods, we reimplement the RetinaNet [23], Faster RCNN [2] and FPN [11] on DOTA-v1.5 Dataset. All of them are applicable to multi-category object detection. For ensuring the accuracy and fairness of experimental results, all experimental data and parameter settings are strictly consistent. Table 5 displays the comparison results which are obtained by submitting the predictions of the test set images to the official DOTA-v1.5 evaluation server. From Table 5, our method achieve the best performance, which greatly exceeds the RetinaNet, Faster R-CNN, FPN by 18.33%, 13.78%, 7.97% at mAP, respectively. For the small objects, such as small vehicles, our method remarkably outperforms the FPN by 16.49% at mAP due to its stronger ability of information extraction. The detection results of EFPN++ for each class on the DOTA-v1.5 test dataset are shown in Figure 8.

Results on NWPU VHR-10 Dataset
We further compare the related advanced methods with the presented method on the NWPU VHR-10 dataset [20] and the comparison results as displayed in Table 6. From Table 6, it is remarkably shown that the presented method obtains better performance, which greatly exceeds Faster R-CNN [2] by 13.2% at mAP. In addition, our method outperforms the two advanced methods R-FCN [26] and Deformable R-FCN [39], increasing the mAP by 9.8% and 7.5%, respectively.

Results on RSOD Dataset
We also certify the effectiveness of the presented method with existing methods on the RSOD dataset [21] and Table 7 shows the comparison results. Tayara et al. [40] proposed a uniform onestage model for object detection in aerial images and produced relatively competitive results with the mAP value of 94.19%. Overall, Table 7 shows that the proposed method is able to get better performance than the state-of-the-art methods. Table 5. Comparison results on the DOTA-v1.5 test set. The abbreviation names of categories follow [19] and are described in detail in Section 4.1.1. The bold numbers represent the best detection results.

Discussion
Extensive experimental results demonstrate that the presented method has achieved excellent detection performance in the multiple remote sensing datasets. The advantages of the proposed EFPN are illustrated as follows: (1) In the remote sensing images, numerous small-scale objects may be around or below 10 pixels, and when they are missing in the deep layers, the context cues will disappear simultaneously. Therefore, simply using the traditional feature pyramid structure can no longer improve performance in this case. Like the Feature Pyramid Network, the proposed EFPN also predicts small objects in the shallower layers. The difference is that through the proposed MBDB module and the added augmented bottom-up pathway, EFPN has better performance for detecting the small objects due to the stronger semantic information extraction capability; (2) Since large-scale objects are usually produced and predicted in deeper layers, the boundaries of these objects might be too fuzzy to obtain an accurate regression. However, the proposed EFPN can retain a high spatial resolution and have a larger receptive field in deeper layers, so it is more powerful in finding more ground-truth large objects and locating the boundary of the objects.
Although the effect is obvious, the small objects which are particularly similar to the background are poorly recognized, such as container cranes. In addition, we can see from Table 1 that the sample number of the container cranes is very small. Figure 9 compares one detection result with its ground truth for the container crane. We can well see that there are some false alarms due to the high similarity between the object and the background, and the lack of training samples. In the future work, we will consider optimizing our network with stronger feature extraction ability, and adopting a better sample balance and data amplification strategy to further promote the detection performance, especially for the small-scale objects under complex background.

Conclusions
In this paper, considering the huge scale variations of the object instances in remote sensing images, we proposed an Extended Feature Pyramid Network (EFPN) which has stronger semantic information capture ability to detect multi-scale targets especially dense small targets. Through reasonable design, the proposed EFPN has fewer parameters than the original faster FPN, and can achieve better detection effect. The ablation study demonstrated the performance improvement of each component of the overall architecture. When preprocessing images and objects to be of appropriate size for training, we also proposed an adaptive scale training strategy for making the neural network better learn the features of different scale objects. In addition, due to the huge differences of the object shapes in remote sensing images, we presented a novel clustering method to obtain the adaptive scales and aspect ratios of anchors and ulteriorly improve the detection performance. Extensive experiments were performed on the open-source DOTA-v1.5 dataset, NWPU VHR-10 dataset and RSOD dataset, and the results indicate that the presented method outperformed the state-of-the-art methods on the mAP both for small objects and large objects.