A Single Shot Framework with Multi-Scale Feature Fusion for Geospatial Object Detection

With the rapid advances in remote-sensing technologies and the larger number of satellite images, fast and effective object detection plays an important role in understanding and analyzing image information, which could be further applied to civilian and military fields. Recently object detection methods with region-based convolutional neural network have shown excellent performance. However, these two-stage methods contain region proposal generation and object detection procedures, resulting in low computation speed. Because of the expensive manual costs, the quantity of well-annotated aerial images is scarce, which also limits the progress of geospatial object detection in remote sensing. In this paper, on the one hand, we construct and release a large-scale remote-sensing dataset for geospatial object detection (RSD-GOD) that consists of 5 different categories with 18,187 annotated images and 40,990 instances. On the other hand, we design a single shot detection framework with multi-scale feature fusion. The feature maps from different layers are fused together through the up-sampling and concatenation blocks to predict the detection results. High-level features with semantic information and low-level features with fine details are fully explored for detection tasks, especially for small objects. Meanwhile, a soft non-maximum suppression strategy is put into practice to select the final detection results. Extensive experiments have been conducted on two datasets to evaluate the designed network. Results show that the proposed approach achieves a good detection performance and obtains the mean average precision value of 89.0% on a newly constructed RSD-GOD dataset and 83.8% on the Northwestern Polytechnical University very high spatial resolution-10 (NWPU VHR-10) dataset at 18 frames per second (FPS) on a NVIDIA GTX-1080Ti GPU.


Introduction
Geospatial object detection makes full use of remote-sensing images with high resolution to generate bounding boxes and the specific classification scores, which means significant image analysis and understanding.The automatic and efficient object detection using satellite images has many applications in both military and civilian areas, such as airplane detection [1] and vehicle detection [2][3][4].Although numerous methods have been put forward, there are still some challenges to be solved in geospatial object detection.Firstly, the quantity and quality of remote-sensing images have undergone rapid development and made great progress, which demands fast and effective approaches to real-time object localization.Secondly, the high-resolution satellite images are slightly different from traditional digital images captured in ordinary life.The remote-sensing images are taken from the upper airspace, causing a downward perspective with orientation variations.Moreover, the changing illumination, unusual aspect ratios, dense situations and complex backgrounds make the geospatial object detection more challenging [5].Lastly, compared with the existing large-scale natural image datasets, there is a small number of well-annotated satellite images and they require expensive labor and plenty of time.Several existing geospatial datasets mostly focus on one object category, such as the Aircraft data set [6], Aerial-Vehicle data set [7], and High Resolution Ship Collections 2016 (HRSC2016) [8] for ship detection.In contrast, although the Northwestern Polytechnical University very high spatial resolution-10 (NWPU VHR-10) data set [9] contains ten different geospatial object classes, there are totally about 3600 object instances which are insufficient.Considering the application prospects and the above challenges, our contributions to geospatial object detection are significant.
Traditional object detection methods focus on the feature extraction and classification problem [10].The feature descriptors construct comprehensive feature representation from the raw images, such as local binary patterns (LBP) [11], histogram of oriented gradients (HOG) [12], bag-of-words (BoW) [13] and texture-based features [14,15].Supervised or weakly supervised learning algorithms are then employed to train the object detection model using the extracted features [16,17].Three different features, LBP, HOG and Haar-like, are applied for training the car object classifier from aerial images [4].A deformable part-based model is trained based on the multi-scale HOG feature pyramids, which shows effectiveness in object detection with remote-sensing imagery [18].For solving the challenge of detecting geospatial objects with complex shapes, the BoW model with sparse coding is presented as information representation [19].In another detection framework, the new rotation invariant HOG feature is proposed [20] for targets with complex shape.Kinds of machine learning algorithm are applied to generate the object category of each class based on the feature representation.The support vector machine (SVM) has been widely used and has a good performance in many geospatial object detection applications [21], such as airplane detection [18], and ship detection [22].For better detection of multi-class geospatial objects, a part detector composed of a set of linear SVMs is proposed, which demonstrates strong discriminant ability [9].The adaptive boosting (AdaBoost) algorithm combines a series of weak classifiers to obtain a strong classifier, and has played an important role in vehicle detection [23], ship detection [24] and airport runway detection [15].In conclusion, these machine learning methods to classify object categories and locate the objects' bounding boxes mainly rely on the designed features, which requires human prior knowledge.Although the above approaches have demonstrated impressive performance, human creativity for designing discriminative feature descriptors is still a challenge in specific geospatial object detection with remote-sensing images.
Recently, with the rapid development of deep learning, the convolutional neural network (CNN) has proven to be successful in detecting objects.Instead of designing handcrafted features, CNN architecture has a powerful ability of learning feature representations.Generally, there are two classical technology solutions in the CNN-based object detection, which are region-based methods and single shot methods.The region-based CNN (R-CNN) model [25] applies the CNN to obtain the feature representation of proposal regions that are then classified into object categories with an SVM classifier.For better computational efficiency, feature extraction, object classification and bounding boxes regression are unified to a Fast R-CNN [26] detection framework.Because the region proposals generation with selective search methods [27] is time-consuming, a Region Proposal Network (RPN) is proposed to generate detection proposals.The Faster R-CNN [28] merges RPN and Fast R-CNN into an end-to-end architecture by sharing convolutional features, which demonstrates faster computation speed as well as effective detection results.On the other hand, the single shot methods regard object detection as a regression problem that directly determines target localization and corresponding class confidence, such as You Only Look Once (YOLO) [29], YOLO9000 [30], Single Shot MultiBox Detector (SSD) [31] and Region-based Fully Convolutional Networks (R-FCN) [32].These single shot models are faster with high detection accuracy than region-based approaches.Specifically, for small objects detection, a multi-scale deconvolution fusion module [33] is designed to generate multiple features.Feature maps from different layers are combined through deconvolution module and element-wise fusion methods [34].The improved YOLOv3 [35] makes predictions at three different convolution layers.With the merger of low-level features and high-level semantic information, stronger feature representation is obtained to achieve better object detection performance, especially for small targets.
In geospatial object detection using remote-sensing images, CNNs have also been widely applied [6,[36][37][38].A single value decompensation (SVD) inspired by the CNN structure is designed for ship detection in spaceborne optical images [39].In view of exploring the semantic and spatial information in remote-sensing images, a dense feature pyramid network with rotation anchors is proposed [40].As for synthetic aperture radar (SAR) ship detection, the contextual region-based CNN with multilayer fusion is employed [41].To address object rotation variations in satellite images, a rotation-invariant CNN (RICNN) is presented through adding a rotation-invariant layer and defining a new objective function [42].Because of the scarcity of manually annotated satellite images, a pre-trained Faster R-CNN on large-scale ImageNet dataset is transferred for multi-class geospatial object detection [5].Considering the imbalanced number of targets and background samples, a hard example mining technique is implemented to improve the efficiency of training process and detection accuracy [43,44].Actually, there are many small and dense targets in remote-sensing images, which are hard to be detected.To address this issue, feature maps from different layers with various receptive fields are used to detect geospatial targets [45].The multi-scale feature maps from different CNN layers make a significant contribution to detecting multi-scale objects, especially for small objects [46].These multi-layer features are aggregated to be a single high-level feature map through the transfer connection block [47].Although the CNN-based approaches have proven to be successful and effective in detecting geospatial objects such as ships, airplanes, and vehicles, there are still some limitations and challenges of these models.Multiple down-sampling layers in the basic CNN generate high-level features with global semantic information, which also means losing lots of local details.The size of the feature maps after multiple down-sample is 1/16 of input images.The small objects with a few pixels in extent are hard to accurately detect.Another problem that object-detection methods struggle with is the target diversity.Due to the multiple resolution of remote-sensing images and difference of object categories, it is also important to improve the generalization ability of CNN-based detection models.
To tackle the above issues, a multi-scale feature fusion detector is proposed in this paper.Compared with region-based CNN models, our work is motivated by the SSD and YOLO approaches [30,33,34,48].SSD generates bounding boxes' location and classifies object categories from multiple feature maps in different layers.The feature maps with different resolutions in SSD make predictions respectively.In order to aggregate low-level and high-level features, we implement a feature fusion module that concatenates multi-scale feature maps.The low-level features with more accurate details and high-level features with semantic information are fused together to make final object predictions.Instead of the greedy non-maximum suppression (NMS), a soft-NMS strategy [49] is applied to improve detection performance.Lastly, we also construct a large-scale remote-sensing dataset for geospatial object detection (RSD-GOD) with 40,990 well-annotated instances.There are a total of 5 object categories in the RSD-GOD: airport, plane, helicopter, warship, and oiltank.The constructed RSD-GOD remote-sensing dataset is open and available to the community, and can be found at: https://github.com/ZhuangShuoH/geospatial-object-detection.
The main contributions of our work are summarized as follows: (1) We produce and release a large-scale RSD-GOD with handcrafted annotations, which can be used for further geospatial object detection development especially in martial applications.(2) We apply a single shot detection framework with the multi-scale feature fusion module for detection on three different scales.The different feature maps in different layers are merged to make object predictions, which means more abundant information is explored together.The proposed method achieves a good tradeoff between superior detection accuracy and computation efficiency.In addition, the designed network shows an effective performance at detecting small targets.(3) The soft-NMS algorithm is applied through reassigning the neighboring bounding box a decayed score, which improves the detection performance of dense objects.
The rest of this paper is organized as follows.Section 2 presents the large-scale dataset of RSD-GOD and the main framework of the feature fusion network.Section 3 shows the analysis and discussion of the experimental results.Finally, conclusions are drawn in Section 4.

Materials and Methods
2.1.Annotation and Construction of Remote-Sensing Dataset for Geospatial Object Detection (RSD-GOD)

Category Selection and Image Collection
We review recent research work of geospatial object detection, which mainly focuses on ship, plane and vehicle targets.Considering the practical applications especially in military field, five categories are selected to be annotated, including plane, helicopter, oiltank, airport and warship.Finally, we construct a large-scale remote sensing dataset for geospatial object detection, which totally contains 18,187 color images with multiple resolutions from multiple platforms like Google Earth.There are 40,990 well-annotated instances in the dataset.The width of each image is mostly about 300~600 pixels.To increase the diversity of samples, we collect these remote-sensing images from different places at different times.The horizontal bounding box (HBB) of the annotation method is widely used in natural object detection, denoted as (xmin, ymin, xmax, ymax).For the suitable transferring learning of object detection algorithms, we adopt the HBB-based labeling method for the selected geospatial targets.

Dataset Analysis and Division
Some examples of images and the corresponding annotated bounding boxes are shown in Figure 1.It is found that the RSD-GOD has three properties.First, geospatial objects have rich background information, such as different weather conditions, high illumination, low illumination and other background clutters.Second, these remote-sensing images are collected with multiple resolutions and viewpoints, which means multiple scales and angles of the same object.Third, there are some dense objects like planes and warships.It is a great challenge to deal with the complexity of annotated It is found that the RSD-GOD has three properties.First, geospatial objects have rich background information, such as different weather conditions, high illumination, low illumination and other background clutters.Second, these remote-sensing images are collected with multiple resolutions and viewpoints, which means multiple scales and angles of the same object.Third, there are some dense objects like planes and warships.It is a great challenge to deal with the complexity of annotated samples for the existing object detection algorithms.
We further analyze the constructed RSD-GOD.According to different sites of remote-sensing images, the dataset is divided into two parts.One is for training, the other is for testing.To adjust hyper-parameters of the proposed model in the training process, 30% of training samples can be randomly selected as a validation dataset.The number of instances in different categories from three sets is shown in Table 1.We conduct the statistical analysis of RSD-GOD from two points: area and aspect ratio of the bounding box.The area of the bounding box is divided into five levels: extra-small (S b < 16 2 pixels), small (16 2 < S b < 32 2 pixels), middle (32 2 < S b < 64 2 pixels), large (64 2 < S b < 96 2 pixels), and extra-large (S b > 96 2 pixels); where S b is the number of pixels in each bounding box.As shown in Figure 2a, it is found that most of the bounding boxes are in middle and large scales.Specifically, the number of extra-small and small instances is around 4500.The adequate quantity is applicable for training the deep learning-based model and is important in practical detection applications of geospatial objects with small size.The aspect ratio of bounding box is also divided into five levels and over 87% of them are distributed in 0.5∼2 .These instances with various aspect ratios strengthen the diversity of RSD-GOD.The distribution of aspect ratio is similar to real scenes, which can provide essential information for anchor-based models.

Single Shot Framework with Multi-Scale Feature Fusion
Our proposed framework is derived from YOLO and SSD that predict bounding boxes and corresponding object categories in a single-shot network.Motivated by the development tendency of computer vision, a deeper base CNN named Darknet-53 is applied to extract features.Considering the challenge of detecting small objects, a multi-scale feature fusion technique is applied.The design of anchors from Faster R-CNN is used to predict object bounding boxes.Furthermore, the k-means clustering method is presented on the training bounding boxes set to obtain anchor priors.

Darknet and Single Shot Framework
Base feature extractor.A deeper neural network has stronger feature learning and generalization abilities.Generally speaking, the ResNet-101 as a base feature extractor performs better than Visual Geometry Group (VGG) model in the detection framework.In proposed single-shot object detection framework, we construct a superior network to be the feature extractor, as shown in Figure 3, literally named as Darknet-53.Except for the last fully connected layer, there are 53 convolutional layers without any pooling layer.Similar to the VGG-16, 3 × 3 filters are mostly used.Instead of max-pooling or average-pooling, the size of the feature map is decreased by a factor of 2 through adjusting the convolutional stride.To make the training of deep network easier, Darknet-53 adopts residual blocks.Each residual block contains 1 × 1 and 3 × 3 convolutional filters, and 23 residual blocks (1 + 2 + 8 + 8 + 4) are finally used.When optimizing a very deep network, it is important to control overfitting and convergence during the training process.To address this problem, without using the dropout technique, a batch normalization (BN) [50] operation is applied after each convolutional layer in the whole framework.The Leaky ReLU is used as the activation function in each convolutional layer.) are finally used.When optimizing a very deep network, it is important to control overfitting and convergence during the training process.To address this problem, without using the dropout technique, a batch normalization (BN) [50] operation is applied after each convolutional layer in the whole framework.The Leaky ReLU is used as the activation function in each convolutional layer.Multi-scale feature fusion detector.Most object detection models extract upper features at the top-most layer of a base CNN, including Faster R-CNN [28] and YOLO [30].Although these methods show powerful detection performance, they do not utilize more local detailed information.The feature maps from different layers contain different object information such as high-level semantic features and low-level fine details.To make full use of the abundant information of the whole feature extractor, multi-scale features are fused to predict bounding boxes.This multi-scale feature fusion detector has inspiring feature representation capacity to cover kinds of geospatial objects with different scales and shapes.
As shown in Figure 4, three convolutional layers at different scales of Darknet-53 are used to make predictions.To make first-scale predictions, we add layer conv_6 after the top-most convolutional layer conv_5_2, which is full of high-level context and semantic information.There are two feature fusion modules to combine shallow features.In feature fusion module 1, conv_6 is upsampled and then merges with conv_4 through concatenation operation to make second-scale predictions.In feature fusion module 2, the upper fusion module is up-sampled and then merges with conv_3 through the same concatenation method to make third-scale predictions.Different level features are fused to be mainly responsible for detecting small objects (area is smaller than <!--MathType@Translator@5@5@MathML2 (no namespace).tdl@MathML2.0 (no namespace)@ --> <math> <semantics> <mrow> Multi-scale feature fusion detector.Most object detection models extract upper features at the top-most layer of a base CNN, including Faster R-CNN [28] and YOLO [30].Although these methods show powerful detection performance, they do not utilize more local detailed information.The feature maps from different layers contain different object information such as high-level semantic features and low-level fine details.To make full use of the abundant information of the whole feature extractor, multi-scale features are fused to predict bounding boxes.This multi-scale feature fusion detector has inspiring feature representation capacity to cover kinds of geospatial objects with different scales and shapes.
As shown in Figure 4, three convolutional layers at different scales of Darknet-53 are used to make predictions.To make first-scale predictions, we add layer conv_6 after the top-most convolutional layer conv_5_2, which is full of high-level context and semantic information.There are two feature fusion modules to combine shallow features.In feature fusion module 1, conv_6 is up-sampled and then merges with conv_4 through concatenation operation to make second-scale predictions.In feature fusion module 2, the upper fusion module is up-sampled and then merges with conv_3 through the same concatenation method to make third-scale predictions.Different level features are fused to be mainly responsible for detecting small objects (area is smaller than 32 × 32).Anchor priors and predictions.The model is unstable especially during early training iterations when the locations of bounding boxes are directly predicted.Motivated by the Faster R-CNN, anchors are introduced and applied to predict bounding boxes in our method.In our designed network, three kinds of feature maps with different size are obtained after convolutional layers Multi-scale feature fusion module.The specific details of the feature fusion module are described in Figure 5.The dimension of feature maps is reduced firstly with the use of 1 × 1 convolutional kernel.High-level feature maps are up-sampled after Conv 1 × 1 to be the same size of low-level feature maps.The dimension of low-level features is 2 times higher than high-level features, which means the importance of fine details.Considering the different feature dimension from two scales, concatenation is applied to merge these features.Features extracted from one scale, two scales and three scales are used to generate three predictions.For each scale prediction, the 7 convolutional layers are added using 1 × 1 and 3 × 3 convolutional kernels.Every pixel in the feature map corresponds to N prediction scores that will be explained in the next part.Anchor priors and predictions.The model is unstable especially during early training iterations when the locations of bounding boxes are directly predicted.Motivated by the Faster R-CNN, anchors are introduced and applied to predict bounding boxes in our method.In our designed network, three kinds of feature maps with different size are obtained after convolutional layers Anchor priors and predictions.The model is unstable especially during early training iterations when the locations of bounding boxes are directly predicted.Motivated by the Faster R-CNN, anchors are introduced and applied to predict bounding boxes in our method.In our designed network, three kinds of feature maps with different size are obtained after convolutional layers down-sampling and multi-scale feature fusion module: 13 × 13, 26 × 26, 52 × 52, which is named as S × S grid (S = 13, 26, 52).B anchor priors are generated and corresponding B bounding boxes are predicted at each grid cell.During the training process, the proposed network outputs 5 coordinate values t x , t y , t w , t h , t o directly.The final location of the predicted bounding box can be obtained through the anchor priors' size and the network outputs.As a result, the final prediction and object confidence of bounding boxes at each cell can be calculated as follows.
As shown in Figure 6, the center location of bounding boxes b x , b y is relative to the grid cell offsets (c x , c y ) and the sigmoid activation function value of location coordinates (t x , t y ), where (c x , c y ) denotes the offsets from the top left corner of the original image to the current grid cell.The width and height of anchor priors is denoted as (p w , p h ).p o means the confidence score of object probability.The σ stands for sigmoid function that limits the values of t x and t y to be (0~1).By applying the sigmoid function to normalize the predicted t x , t y , t o , the model is more stable for training.
More than four coordinates and one object confidence information, the grid cell also predicts C class probabilities for each bounding box.The dimension of network output tensor is S × S × N, where N = (5 + C) × B; S = 13, 26, 52; C = 5; B = 3 in our experiments.

Loss Function
The training objective loss is defined with localization loss (loc), confidence loss (conf ) and classification loss (cla).We use squared error loss to compute localization loss and confidence loss.For the multi-class problem, softmax function is applied, and the categorical cross-entropy is computed to obtain classification loss.The overall loss function is defined as: where λ coord , λ obj , λ noobj , and λ cla are scaling factors to weight localization loss, confidence loss and classification loss.And P obj denotes the object existing in the anchor box.The predicted bounding box without object is penalized more.In the experiments, we set λ coord = 1, λ obj = 5, λ noobj = 1, and λ cla = 1.

Soft Non-Maximum Suppression
Our proposed one-stage detection method generates a large number of cluttered and repetitive bounding boxes in the final three prediction modules.NMS is an integral component of the object detection framework to predict final object detections from a set of location candidates, which effectively improve detection performance.The traditional NMS ranks location candidates according to their classification score.If there is a high overlap between two boxes, bounding box with lower scores will be removed.In our constructed RSD-GOD dataset, there are some dense objects such as warships and small planes.This hard NMS might miss part of neighboring detections whose classification scores are lower.Instead of removing the location candidates directly, the proposed soft-NMS reassign a bounding box a new classification score, which is denoted as follows: where b i denotes i-th bounding box in the location candidates and b M is the bounding box with maximum score.If the IoU between b i and b M is larger than threshold T, a decayed score will be given to b i with the use of a penalty function: The reassigned score is associated with the overlap between two boxes.When the IoU is low, these two candidate detections have a high probability to be both true positives.Thus, the iou(b i , b M ) should have some effects on the s i .Specifically, s i remains unchanged when the overlap is zero.To address this point, the Gaussian penalty function is considered:

Results and Discussion
To evaluate the performance of the proposed single-shot object detection approach on remote-sensing images, we compare it with several existing The concise experimental settings are described in this section, including datasets, evaluation metrics, and compared methods.Then, the quantitative and qualitative analysis are brought into the discussion.

Dataset
For reliable evaluation and verification of the proposed method, two datasets are used in our experiments.The first one is RSD-GOD geospatial dataset, which is introduced in Section 2.1 in detail.The RSD-GOD is a challenging 5-class object detection dataset that contains 18,187 images with more than 40,000 instances.We divide the whole RSD-GOD dataset into three parts, 35% for training, 15% for validation, and 50% for testing.Specifically, the location sources of remote-sensing images are different between training validation and testing datasets.
The second one is the NWPU VHR-10 dataset, which contains 10 geospatial object classes.There are two image subsets in NWPU VHR-10 dataset: a positive set including 650 annotated images and a negative set including 150 images without any targets of the given 10 categories.In our experiments, the positive set is divided into 20% for training, 20% for validation and 60% for testing, which corresponds to 130, 130, 390 images respectively.

Evaluation Metrics
To quantitatively evaluate the performance of the proposed framework, the average precision (AP) and precision-recall curve (PRC) are adopted, which are two standard and widely used evaluation metrics in object detection tasks.For better expression, true positives, false positives, and false negatives are denoted as TP, FP and FN.A predicted bounding box is considered to be TP if the IoU between predicted bounding box and ground truth is larger than 0.5.Otherwise, it would be considered as FP.FN denotes that the actual annotated object has no predicted bounding box.Specifically, TP means the correct retrieval of an object.The precision indicator measures the proportion of detections that are TP and the recall indicator measures the fraction of practical annotations that are classified correctly.The calculation formulas of precision and recall are as follows: Recall = TP TP + FN (14)

Quantitative Comparisons
Three different methods as mentioned above are compared to evaluate the performance of our proposed geospatial object detection framework.Quantitative comparison results are shown in Table 2, including AP values of five target categories, and the mean AP of comprehensive assessment.For better visualization and comparison, the PRCs are also displayed in Figure 7.It is found that Faster R-CNN has the highest AP value for airport.However, as shown in Figure 7, it is obvious that Faster R-CNN has lower recalls of oiltank and plane than other methods, which causes the lowest AP value.This may be due to the fact that Faster R-CNN has only selected about 200 regions of interest to be classified in our experiments.Besides, the extracted features from Faster R-CNN have more high-level semantic information that has weak identification ability to detect small objects like the plane.By contrast with the two-stage detection method, the one-stage method predicts a large number of bounding box candidates such as SSD and YOLO2.Compared with YOLO2 that uses the last convolutional layer to make predictions, SSD makes detections at different layers with feature maps of different scales.As a result, it can be seen that SSD obtains a reasonable recall for each class and has a higher mean AP value than YOLO2.As shown in Table 2 and Figure 7, our method achieves the best mean AP value of 87.9%.Apart from the airport, the proposed approach obtains the highest AP values.Compared with the SSD, there are 5.1%, 5.3%, 7.8%, 2.2% and 3.8% performance gains of the proposed network for airport, helicopter, plane, oiltank and warship correspondingly.The proposed method obtains 4.8% performance gains in term of mean AP, which demonstrates the effectiveness of our multi-scale feature fusion detector.It can be inferred that the proposed feature fusion modules play an important role in improving detection performance, especially satisfying detection results for small geospatial targets.With the implementation of soft-NMS, the proposed method achieves a better performance.As can be seen, the soft-NMS algorithm improves the recalls of the warship and plane.The overall mean AP value increases from 87.9% to 89.0%, showing that the soft weighting function can improve the detection performance of neighboring objects.This may be due to the fact that Faster R-CNN has only selected about 200 regions of interest to be classified in our experiments.Besides, the extracted features from Faster R-CNN have more highlevel semantic information that has weak identification ability to detect small objects like the plane.By contrast with the two-stage detection method, the one-stage method predicts a large number of bounding box candidates such as SSD and YOLO2.Compared with YOLO2 that uses the last convolutional layer to make predictions, SSD makes detections at different layers with feature maps of different scales.As a result, it can be seen that SSD obtains a reasonable recall for each class and has a higher mean AP value than YOLO2.As shown in Table 2 and Figure 7, our method achieves the best mean AP value of 87.9%.Apart from the airport, the proposed approach obtains the highest AP values.Compared with the SSD, there are 5.1%, 5.3%, 7.8%, 2.2% and 3.8% performance gains of the proposed network for airport, helicopter, plane, oiltank and warship correspondingly.The proposed method obtains 4.8% performance gains in term of mean AP, which demonstrates the effectiveness of our multi-scale feature fusion detector.It can be inferred that the proposed feature fusion modules play an important role in improving detection performance, especially satisfying detection results for small geospatial targets.With the implementation of soft-NMS, the proposed method achieves a better performance.As can be seen, the soft-NMS algorithm improves the recalls of the warship and plane.The overall mean AP value increases from 87.9% to 89.0%, showing that the soft weighting function can improve the detection performance of neighboring objects.Considering the object size and ratio analyzed in Section 2.2.1, we further evaluate the proposed method through calculating AP values on various object sizes and ratios.By contrast with the previous five levels of bounding box area, we regard the extra-small and small as one small level (S a ), middle as medium level (M a ), large and extra-large as large level (L a ).The number of instances on the RSD-GOD testing dataset and AP values in different categories from three size levels shown in Table 3.It can be found that AP value becomes larger with the increasing area of bounding box.Furthermore, the proposed method shows good detection performance on the helicopter and plane with small level size.Similarly, we reallocate the ratio of the bounding box into three levels: wide level (W r , 0 < ratio ≤ 0.5), medium level (M r , 0.5 < ratio ≤ 1), and tall level (T r , 1 < ratio).The number of instances on the RSD-GOD testing dataset and AP values in different categories from three ratio levels are shown in Table 4. Compared with the wide or tall level, the medium level of the object ratio mostly achieves the highest AP values.On account of special shape of warship, its ratio mostly distributes in wide and tall level which causes higher AP value of the wide level.We can also infer that a better AP value is obtained when the corresponding number of instances is bigger.This is because more bounding box instances are applied to learn the network parameters in the training process.For a better understanding, a number of detection results using the proposed method with soft-NMS are shown in Figure 8.Each target class has four samples that contain various scales, shapes, resolutions and complex backgrounds.The detection results of different kinds of categories are represented with bounding boxes in different colors.Our method shows the effective performance of detecting geospatial objects with remote-sensing images.
approach has strong ability to predict most of the true bounding boxes except for a few missing objects.Actually, our method will generate some false negatives as shown in Figure 9d (3).This is due to the large number of anchor priors, which cause multiple bounding boxes candidates at neighboring regions.Moreover, it can be found that our approach performs better than comparison methods on detecting warships with the dense distribution.To improve detection performance, our proposed method applies the soft-NMS algorithm.Figure 10 shows the detection results using NMS and soft-NMS.It is obvious that soft-NMS recalls more targets to be detected.When the IoU value between two bounding boxes of different objects is large, the soft-NMS will give one of them a decayed score instead of removing the bounding box.This soft-NMS strategy effectively helps to improve performance on detecting neighboring targets without increasing computational complexity.As depicted in Figure 8, it is found that the proposed method successfully detects most of the objects.Although the airport has various sizes and shapes, our approach has the ability to extract valid features such as single racetrack or crossed runways and shows robustness to detecting them.Specifically, the airport is covered well with predicted bounding boxes.For closely aligned objects, especially small helicopters, planes and oiltanks, the detection results have a promising and excellent performance.There are only a small number of false alarms.For example, a bounding box candidate containing two planes and backgrounds is regarded as a plane target.In the complex conditions of changing illumination, object shadows, viewpoint variations, blurred targets, varying scales and densely distributed groups, the proposed approach is shown to be effective and sufficient in predicting satisfying object bounding boxes.
To further demonstrate the detection performance of the proposed network, the qualitative results between our approach and three compared methods are shown in Figure 9.The proposed method performs better than the other three detection frameworks.Compared with YOLO2 and SSD, Faster R-CNN has a good performance in detecting warships of a large scale.SSD and Faster R-CNN have a good deal of missing targets when detecting small objects, such as oiltank and plane.Relatively speaking, the detection results of the helicopter and oiltank demonstrates that our approach has strong ability to predict most of the true bounding boxes except for a few missing objects.Actually, our method will generate some false negatives as shown in Figure 9d (3).This is due to the large number of anchor priors, which cause multiple bounding boxes candidates at neighboring regions.Moreover, it can be found that our approach performs better than comparison methods on detecting warships with the dense distribution.

Results on NWPU VHR-10 Dataset
In order to further evaluate the effectiveness and generalization ability of our designed multiscale feature fusion network, we also train a detector on NWPU VHR-10 dataset.The quantitative results of different methods are shown in Table 5, including AP values of 10 categories and a mean AP measurement.For a more comprehensive evaluation, we add collection of part detectors (COPD) To improve detection performance, our proposed method applies the soft-NMS algorithm.Figure 10 shows the detection results using NMS and soft-NMS.It is obvious that soft-NMS recalls more targets to be detected.When the IoU value between two bounding boxes of different objects is large, the soft-NMS will give one of them a decayed score instead of removing the bounding box.This soft-NMS strategy effectively helps to improve performance on detecting neighboring targets without increasing computational complexity.
Remote Sens. 2019, 11, x FOR PEER REVIEW 92 of 96 R-CNN significantly improves AP values for the airplane, baseball diamond, tennis court, basketballcourt and ground track field.Although SSD obtains the highest AP values of the airplane, ship and baseball diamond, it has poor performance in detecting the harbor and bridge.Compared with the traditional and CNN-based methods, the proposed approach has the best performance with a mean AP value of 82.9%.With the application of the soft-NMS algorithm, our network performs better, which achieves nearly 1% performance gains in terms of mean AP.It can be found that our method obtains the best detection results on the storage tank, tennis court, harbor, bridge and vehicle, showing that the proposed multi-scale feature fusion network is effective and robust to detect objects with a small size, a high aspect ratio or variable shapes.

Results on NWPU VHR-10 Dataset
In order to further evaluate the effectiveness and generalization ability of our designed multi-scale feature fusion network, we also train a detector on NWPU VHR-10 dataset.The quantitative results of different methods are shown in Table 5, including AP values of 10 categories and a mean AP measurement.For a more comprehensive evaluation, we add collection of part detectors (COPD) [9], a rotation-invariant CNN (RICNN) [42] model and R-P-Faster R-CNN [5] as comparisons.COPD and RICNN are all rotation-invariant frameworks with an SVM classifier for geospatial object detection.The difference is that COPD uses hand-crafted features while RICNN applies learned features from CNN.It is found that features extracted from the CNN show a better representation ability for detecting objects.Compared with COPD, the mean AP value of RICNN obtains an 18% increase.Faster R-CNN and R-P-Faster R-CNN integrate the region proposal network and classification procedure through sharing the convolutional weights.Compared with RICNN, Faster R-CNN significantly improves AP values for the airplane, baseball diamond, tennis court, basketballcourt and ground track field.Although SSD obtains the highest AP values of the airplane, ship and baseball diamond, it has poor performance in detecting the harbor and bridge.Compared with the traditional and CNN-based methods, the proposed approach has the best performance with a mean AP value of 82.9%.With the application of the soft-NMS algorithm, our network performs better, which achieves nearly 1% performance gains in terms of mean AP.It can be found that our method obtains the best detection results on the storage tank, tennis court, harbor, bridge and vehicle, showing that the proposed multi-scale feature fusion network is effective and robust to detect objects with a small size, a high aspect ratio or variable shapes.

Efficiency Analysis of Proposed Model
To verify the efficiency of our approach, the running time of different methods is evaluated.Table 6 shows the average running time when one image is tested.RICNN has the lowest computational efficiency due to its multiple detection stages.Compared with the two-stage detection framework such as Faster R-CNN and R-P-Faster R-CNN, the single-shot network has a fast inference speed.It is found that SSD and YOLO2 have less computing time than our method.However, considering the tradeoff between speed and detection performance, the proposed approach achieves effective detections with a fast running time of 0.057 s.With the help of a suitable GPU, our proposed multi-scale feature fusion framework can achieve the inspiring detection results with high computation efficiency, which is able to detect geospatial objects in real-time.

Conclusions
In this paper, we firstly construct a novel remote-sensing dataset named RSD-GOD, especially for martial object detection.Secondly, a single-shot geospatial object detection framework based on multi-scale feature fusion modules has been proposed.Feature maps from different layers are merged through up-sampling and concatenation operations, which finally generates pyramid feature maps.These fused features predict bounding box candidates at three scales.The proposed detector with the use of multi-scale feature fusion modules achieves an effective performance.We can draw the conclusions through the experimental results on RSD-GOD and NWPU VHR-10 datasets: (1) The proposed method demonstrates the effectiveness and the better detection performance compared with existing approaches.Specifically, our single-shot detection network achieves a good tradeoff between superior detection accuracy and computation efficiency.(2) The multi-scale feature fusion modules make full use of sufficient local details and high-level semantic information, which shows strong feature representation ability to detect small objects.(3) The soft-NMS algorithm improves the detection performance when there are densely distributed targets.In future work, we will focus on generating more accurate anchor box candidates, and design more powerful matching strategies in the training process.

96 Figure 1 .
Figure 1.Example images and annotated bounding boxes of the remote-sensing dataset for geospatial object detection (RSD-GOD).There are 5 classical geospatial categories, including airport, helicopter, oiltank, plane, and warship.Different object categories are indicated by different color rectangles.

Figure 1 .
Figure 1.Example images and annotated bounding boxes of the remote-sensing dataset for geospatial object detection (RSD-GOD).There are 5 classical geospatial categories, including airport, helicopter, oiltank, plane, and warship.Different object categories are indicated by different color rectangles.

Figure 2 .
Figure 2. Statistical information of the constructed RSD-GOD.Statistical results of the trainingvalidation, testing and the entire dataset are depicted as bars with different colors.(a) Number of

Figure 2 .
Figure 2. Statistical information of the constructed RSD-GOD.Statistical results of the training-validation, testing and the entire dataset are depicted as bars with different colors.(a) Number of instances with different area of the bounding box in different datasets; (b) number of instances with different aspect ratio of the bounding box in different datasets.

Figure 3 .
Figure 3. Darknet-53: the base network to extract features in the single-shot object detection framework.

Figure 3 .
Figure 3. Darknet-53: the base network to extract features in the single-shot object detection framework.

Figure 4 .
Figure 4. Multi-scale feature fusion detector.Darknet-53 is the base feature extractor.Three predictions are generated at three different scales.

Figure 5 .
Figure 5. Multi-scale feature fusion module.Feature maps from different layers are merged through up-sampling and concatenation operations, which then predict object detections.Each convolutional layer is followed with a batch normalization (BN) layer and a Leaky ReLU layer.The stride of convolution is 1 (s1).

Figure 4 .
Figure 4. Multi-scale feature fusion detector.Darknet-53 is the base feature extractor.Three predictions are generated at three different scales.

Figure 4 .
Figure 4. Multi-scale feature fusion detector.Darknet-53 is the base feature extractor.Three predictions are generated at three different scales.

Figure 5 .
Figure 5. Multi-scale feature fusion module.Feature maps from different layers are merged through up-sampling and concatenation operations, which then predict object detections.Each convolutional layer is followed with a batch normalization (BN) layer and a Leaky ReLU layer.The stride of convolution is 1 (s1).

Figure 5 .
Figure 5. Multi-scale feature fusion module.Feature maps from different layers are merged through up-sampling and concatenation operations, which then predict object detections.Each convolutional layer is followed with a batch normalization (BN) layer and a Leaky ReLU layer.The stride of convolution is 1 (s1).

Figure 6 .
Figure 6.Anchor priors and location prediction.The framework directly generates 4 coordinates t x , t y , t w , t h .The center location of final bounding boxes b x , b y is relative to the grid cell offsets (c x , c y ) and the sigmoid activation function value of location coordinates (t x , t y ), where (c x , c y ) denotes the offsets from the top left corner of the original image to the current grid cell.The width and height of anchor priors is denoted as (p w , p h ).The σ stands for sigmoid function that limits the values of t x and t y to be (0~1).

Figure 7 .
Figure 7.The precision recall curve of proposed approach and other comparison methods.

Figure 7 .
Figure 7.The precision recall curve of proposed approach and other comparison methods.

Figure 8 .
Figure 8. Example images and detection results on RSD-GOD dataset using the proposed approach.

Figure 8 .
Figure 8. Example images and detection results on RSD-GOD dataset using the proposed approach.

Figure 9 .
Figure 9. Detection results on the RSD-GOD dataset with the proposed approach and the other three comparison methods.

Figure 9 .
Figure 9. Detection results on the RSD-GOD dataset with the proposed approach and the other three comparison methods.

Figure 10 .
Figure 10.Detection results on the constructed RSD-GOD dataset when using non-maximum suppression (NMS) or soft-NMS in the proposed network.

Figure 10 .
Figure 10.Detection results on the constructed RSD-GOD dataset when using non-maximum suppression (NMS) or soft-NMS in the proposed network.

Table 1 .
The number of instances in three sets.

Table 2 .
The average precision (AP) values of compared object detection methods on RSD-GOD dataset.MethodFaster R-CNN SSD YOLO2 Proposed Proposed (Soft NMS)

Table 2 .
The average precision (AP) values of compared object detection methods on RSD-GOD dataset.

Table 3 .
The AP values of different object sizes on RSD-GOD testing dataset.

Table 4 .
The AP values of different object ratios on RSD-GOD testing dataset.

Table 5 .
The AP values of compared object detection methods on NWPU VHR-10 dataset.

Table 6 .
The average testing time of compared object detection methods.