A Novel Multi-Model Decision Fusion Network for Object Detection in Remote Sensing Images

: Object detection in optical remote sensing images is still a challenging task because of the complexity of the images. The diversity and complexity of geospatial object appearance and the insufﬁcient understanding of geospatial object spatial structure information are still the existing problems. In this paper, we propose a novel multi-model decision fusion framework which takes contextual information and multi-region features into account for addressing those problems. First, a contextual information fusion sub-network is designed to fuse both local contextual features and object-object relationship contextual features so as to deal with the problem of the diversity and complexity of geospatial object appearance. Second, a part-based multi-region fusion sub-network is constructed to merge multiple parts of an object for obtaining more spatial structure information about the object, which helps to handle the problem of the insufﬁcient understanding of geospatial object spatial structure information. Finally, a decision fusion is made on all sub-networks to improve the stability and robustness of the model and achieve better detection performance. The experimental results on a publicly available ten class data set show that the proposed method is effective for geospatial object detection.


Introduction
Nowadays, optical remote sensing images with high spatial resolution are obtained conveniently due to the significant progress in remote sensing technology, which leads to a wide range of applications such as land planning, disaster control, urban monitoring, and traffic planning [1][2][3][4].As one of the most fundamental and challenging tasks required for understanding remote sensing images, object detection has gained increasing attention in recent years.To deal with a variety of problems faced in optical remote sensing image object detection, numerous approaches have been proposed [5,6].A deep review on object detection in optical remote sensing images can be found in [7].
As is known to all, a common method for object detection is to extract features.The quality of the extracted features is critical as it will directly affect the final result of object detection.Powerful feature representation can make an object more discriminative and its location more explicit, which makes the object easier to detect.On the contrary, insufficient ability to represent objects will result in inaccurate detection.Therefore, it is important for us to choose a method to extract features for object detection in remote sensing images.Currently, because of the advantage of directly generating more powerful feature representations from raw image pixels through neural networks, deep learning methods, especially CNN-based [4,[8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25], are recognized as predominate techniques for extracting features in object detection.Therefore, we select a CNN-based approach to extract features for object detection in optical remote sensing images.
Object detection in remote sensing images becomes more complicated because of the diversity of illumination intensities, noise interference, and the influence of weather.At present, there are still a lot of problems to be solved, such as the diversity and complexity of geospatial object appearance, and the insufficient understanding of geospatial object spatial structure information.
In the field of optical remote sensing images, lots of object detection algorithms only pay attention to the features of objects themselves [16,17,26].However, due to the diversity and complexity of geospatial object appearance, in many cases, relying solely on the characteristics of an object itself cannot effectively identify the object, and sometimes may even cause mis-detection between two objects which belong to two different classes but look very similar in appearance factor.For instance, recognizing a storage tank only through exploiting its features may be difficult as its appearance is just circular, and a bridge is often mistaken for part of the road (as shown in Figure 1).In this case, the application of auxiliary information can effectively help detect objects.Therefore, contextual information is a choice.Some existing works [18,20,27] take local contextual information into account and obtain good performance.For example, the work in [20] used features surrounding the regions of interest, thus alleviating false detection caused by object appearance ambiguity.Although those methods yield good results, there are still deficiencies.Also, relationships among objects play an important role in improving the performance of detection.Therefore, in addition to the use of local contextual information, the proposed method takes object-object relationship contextual information into consideration.
bridge road The spatial structure of geospatial objects plays an important role in recognizing the objects.Optical remote sensing images with high spatial resolutions always contain abundant spatial structure information about objects.Therefore, investigating deeply the structural information about objects can result in good detection results.It is necessary to design an object detector to effectively alleviate the insufficient understanding of geospatial object spatial structure information.Each part of a geospatial object provides many local visual properties and much geometric information about the object.Paying attention to the various parts of an object can help us to understand more details about its spatial structure.There are lots of part-based models [28][29][30][31][32] concentrating on using the various parts of objects to improve detection performance.For example, Zhang et al. [28] proposed a generic discriminative part-based model (GDPBM), which divides a geospatial object with arbitrary orientation into several parts to achieve good performance for object detection in optical remote sensing images.Unlike the previous part-based approaches [28][29][30][31][32], which use traditional features such as histogram of oriented gradients (HOG) [33], the proposed method applies the CNN-based technique to extract high-level features for better feature representation.In addition, it is easier to obtain and process parts of objects in the proposed approach.
In this paper, we propose a novel multi-model decision fusion framework for object detection in remote sensing images.Aiming at the diversity and complexity of geospatial object appearance, we build a local contextual information and object-object relationship contextual information fusion sub-network.Focusing on the insufficient understanding of geospatial object spatial structure information, we construct a part-based multi-region feature fusion sub-network.Furthermore, unlike many methods just using single model, we make a decision fusion on several models for better stability and robustness.For the implementation of the multi-model decision fusion strategy, in addition to the above two sub-networks, we also fuse a baseline sub-network based on Faster R-CNN model.
In summary, the major contributions of this paper are presented as follows.
(1) We propose a local contextual information and object-object relationship contextual information fusion sub-network based on gated recurrent unit (GRU) to form discriminative feature representation, which can effectively recognize objects and reduce false detection between different types of objects with similar appearance.The object-object relationship contextual information is introduced for the first time in the field of remote sensing image object detection as far as we know.
(2) We propose a new part-based multi-region feature fusion sub-network to investigate more details of objects, which can diversify object features and enrich semantic information.
(3) We propose a multi-model decision fusion strategy to fuse the detection results of the three sub-networks, which can improve the stability and robustness of the model and obtain better algorithm performance.
The remainder of this paper is organized as follows.The second section gives a brief review of the related work on geospatial object detection, contextual information fusion, and the RoIAlign layer.In the third section, we introduce the proposed method in detail.The details of our experiments and results are presented in the fourth section.The last section concludes this paper with a discussion of the results.

Geospatial Object Detection
In the past decades, the research on the field of remote sensing image object detection has made a breakthrough development.Many object detection algorithms have been proposed to address various problems [17,20,34].For example, Cheng et al. [17] proposed a novel and effective approach to learn a rotation-invariant CNN (RICNN) model for addressing the problem of object rotation variations, which is achieved by introducing and learning a new rotation-invariant layer on the basis of the existing CNN frameworks.Han et al. [34] combined the weakly supervised learning (WSL) and high-level feature learning to tackle the problems of manual annotation and insufficiently powerful descriptors.Li et al. [20] put forward a novel region proposal network (RPN) including multiangle, multiscale, and multiaspect-ratio anchors to address the problem of geospatial object rotation variations, and also proposed a double-channel feature fusion network which can learn local and contextual properties to deal with the geospatial object appearance ambiguity issue.
Low-level features are often used for image analysis [35].Employing the extracted low-level features of objects for object detection has been a very common method used by many scholars.Those low-level features contain scale-invariant feature transform (SIFT) [3,34,36], histogram of oriented gradients (HOG) [5,6,33], the bag-of-words (BoW) model [37][38][39], Saliency [40,41], etc.For example, Tuermer et al. [5] used the HOG feature and disparity maps to detect airborne vehicles in dense urban areas.Shi et al. [6] developed a circle frequency-HOG feature for ship detection by combining circle frequency features with HOG features.Han et al. [40] proposed to detect multiple-class geospatial objects through integrating visual saliency modeling and the discriminative learning of sparse coding.Although those low-level features show impressive success in some specific object detection tasks, they have certain limitations because they do not represent the high-level semantic information required for identifying objects, especially when visual recognition tasks become more challenging.
Currently, deep convolutional neural network (CNN) models are widely used in the field of visual recognition [42][43][44], such as object detection, owing to the powerful ability of CNN to capture both low-level and high-level features.The region-based convolutional neural network (R-CNN) [8] is considered as a milestone among CNN-based object detection approaches, and achieves superior performance.Subsequently, many advanced object detection algorithms in natural images, such as Fast R-CNN [9], Faster R-CNN [10], YOLO [11], SSD [12], Mask R-CNN [13], are proposed successively and yield unusually brilliant results.However, the aforementioned models can not be directly utilized for geospatial object detection, because the properties of remote sensing images and natural images are different and the direct application of those models to remote sensing images is not optimal.Researchers have done a lot of work in applying CNN-based models to detect geospatial objects in remote sensing images and achieved remarkable consequences [4,[15][16][17][18][19][20][21][22][23][24][25]45].For example, the work in [4] utilized a hyperregion proposal network (HRPN) and a cascade of boosted classifiers to detect vehicles in remote sensing images.Long et al. [16] proposed a new object localization framework based on convolutional neural networks to efficiently achieve the generalizability of the features used to describe geospatial objects, and obtained accurate object locations.Yang et al. [21] constructed a Markov random field (MRF)-fully convolutional network to detect airplanes.

Contextual Information Fusion
Contextual information is advantageous to various visual recognition tasks [18,20,27,[46][47][48][49][50][51][52][53], such as object detection.For example, in order to promote object detection performance, the work in [48] developed a novel object detection model, attention to context convolution neural network (AC-CNN), through incorporating global and local contextual information into the region-based CNN detection framework.Bell et al. [49] presented the Inside-Outside Net (ION) to exploit information both inside and outside the regions of interest, which integrates the contextual information outside the regions of interest by using spatial recurrent neural networks.Furthermore, some recent works [50][51][52] proposed new architectures to investigate the contextual information about object-object relationships for better object detection performance.In the field of remote sensing images, the work in [20] fused local and contextual features to address the problem of object appearance ambiguity in object detection.Considering that the appearance is not enough to distinguish oil tanks from the complex background, Zhang et al. [27] applied trained CNN models to extract contextual features, which makes oil tanks easier to recognize.Xiao et al. [18] fused auxiliary features both within and surrounding the regions of interest to represent the complementary information of each region proposal for airport detection, effectively alleviating detection problems caused by the diversity of illumination intensities in remote sensing images.Motivated by those models, we believe that the local contextual information and the object-object relationship context are very useful for object detection in optical remote sensing images.It is necessary to remember features of the object itself before incorporating contextual information.The process of merging messages follows the memory characteristics of Gated Recurrent Units (GRU) [54].Therefore, we use GRU to fuse the two types of features.
Next we introduce how the j-th hidden unit in a GRU cell works.First, the reset gate r j is obtained by: where σ is the logistic sigmoid function, and [.] j indicates the j-th element of a vector.x is the input, while h t−1 denotes the previous hidden state.Both W r and U r are learnable weight matrices.
Similarly, the update gate z j is calculated by: The actual activation of the proposed unit h j is then calculated by: where φ denotes tanh activate function, and indicates element-wise multiplication.W and U are weight matrices which are learned.As described in [54], the reset gate r effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, which provides a more compact information representation.On the other side, the update gate z dominates how much information from the previous hidden state will carry over to the current hidden state.More details about GRU can be seen in Figure 2.
An illustration of a gated recurrent unit (GRU) [54].The update gate z selects whether the hidden state h t is to be updated with a new hidden state h.The reset gate r decides whether the previous hidden state h t−1 is ignored.

The RoIAlign Layer
RoIAlign [13] is based on RoIPooling [10].As we know, RoIPooling performs two quantizations, first quantizing a floating-number RoI to the discrete granularity of the feature map and then subdividing the quantized RoI into spatial bins which are themselves quantized.Unlike RoIPooling, RoIAlign avoids any quantization of the RoI boundaries or bins.In the execution of RoIAlign, bilinear interpolation [55] is exploited to calculate the exact values of the input features at four regularly sampled locations in each RoI bin.The result after bilinear interpolation is aggregated by average pooling.

Proposed Framework
The flowchart of the proposed object detection method is shown in Figure 3.The framework is based on the VGG16 model [56] and the popular detection frame Faster R-CNN [10].First, given a remote sensing image, we employ the parts of VGG16 to extract object features and use the region proposal network (RPN) to generate region proposals.Unlike the work of Faster R-CNN using a RoI pooling layer to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent, we apply the RoIAlign layer proposed in Mask R-CNN.There are misalignments between the RoIs and the extracted features in RoI pooling.RoIAlign can address the problem of misalignments introduced by quantizations, thus, enhancing the ability to detect small and intensive objects.Second, motivated by the work in [51] and for adapting to remote sensing images which contain complex backgrounds, we extract both local contextual information and object-object relationship contextual information, and fuse them by GRU.The fused feature is employed subsequently to obtain the classification and regression results of the contextual information fusion sub-network.Then, we divide the object in candidate regions generated by RPN into several parts and utilize the RoIAlign layer to pool each part.All parts are merged to gain better feature representations for detecting objects.After that, we perform classification and regression to obtain the consequences of the part-based multi-region sub-network.Finally, in the case of separately gaining results of the contextual information fusion sub-network, the part-based multi-region fusion sub-network, and the baseline sub-network, we execute a decision fusion on those results to acquire the bottom detection result, which we call multi-model decision fusion.Each component of the proposed framework is described as follows.

Local Contextual Information and Object-Object Relationship Contextual Information Fusion Sub-Network
Many works show the effectiveness of investigating features surrounding the regions of interest or relationships among objects [20,51].Therefore, for object detection in remote sensing images, inspired by the work in [51], we construct our local contextual information and object-object relationship contextual information fusion sub-network.Different from [51] using global contextual information for the entire image, we employ local contextual features around objects.For some objects in remote sensing images, scenes far from them are more diverse, resulting in unstable contexts which are likely to be noise that affects the detection result.That is the reason we choose to exploit local contextual information for geospatial object detection.In addition, we replace RoI pooling with RoIAlign because of there existing a lot of dense and small objects in remote sensing images.The features to be fused in the sub-network consist of three parts: local contextual information, features in original candidate regions, and object-object relationship contextual information.
First, in conv5 layer, we extract the features from original proposal boxes and the 1.8× of original proposal boxes.The features in 1.8× of original proposal boxes are used as local contextual information.The RoIAlign layer and the fully connected layer act on the two types of features in succession.Second, we build relationships among objects [as illustrated in Figure 4].The process is the same as [51].There we set V to represent the collection of candidate boxes generated by RPN.The term v i indicates the i-th candidate box.We calculate the relationship between v i and v j by: where e j→i represents the influence of v j on v i and it is a scalar weight.W p and W v are weight matrices which are learned.The visual relationship vector is formed by concatenating visual feature where (x i , y i ) means the center of RoI b i .w i and h i are the width and height of b i .s i is the area of b i .The final object-object relationship contextual information m i is calculated by: It represents that we choose the box which has the greatest impact on v i as the final relationship contextual message to be integrated.Then, we exploit GRUs to merge the three features gained in the previous operation, taking the processed features from original proposed boxes as the initial hidden states, both the relationship contexts and the processed features (local contextual information) which stem from 1.8× of original proposed boxes as inputs related to two GRUs.Afterwards, we average the outputs of the two GRUs and denote the final feature as C. Finally, we apply C to gain the class scores S C and the predicted boxes R C .

Max pooling
An illustration of building object-object relationship.The process is the same as [51].For object v i , the message m 1→i from object v 1 to object v i is controlled by e 1→i .
For large optical remote sensing images, it is necessary to use object-object relationship contextual information within meaningful limited regions in images instead of the entire images.That is because the effect of object-object relationship contextual information on the detection result is very little if the distance between two objects is too long.The images used in this paper are 400 pixels wide and 400 pixels high, just like limited regions cropped from large remote sensing images.Therefore we can obtain object-object relationship contextual information in the entire images.

Part-Based Multi-Region Fusion Sub-Network
For a specific object proposal, paying attention to each part of the object in it can help to obtain much useful spatial structure information about the object, so we can obtain more semantic information for better object detection performance.We use multiple parts of each object to acquire more local visual properties and geometric information, providing an enhanced feature representation.
The parts used include the original proposal box, the left-half part of the proposal box, the right-half part of the proposal box, the up-half part of the proposal box, the bottom-half part of the proposal box, and the inner part obtained by scaling the proposal box by a factor of 0.7 (see Figure 5).First, we gain those parts of each candidate region produced by RPN and perform the RoIAlign operation soon after.Second, we concatenate the pooled features along the channel axis.Then, a 1 × 1 convolution is implemented to reduce the dimension of the concatenated feature, which makes the feature adapt to the input shape of the fully connected layer.Later, the feature is fed into a fully connected layer to generate the final feature representation with more semantic information.We denote the final feature representation as P. Finally, we utilize P to gain the class scores S P and the predicted boxes R P .

Multi-Model Decision Fusion Strategy
The multi-model decision fusion strategy, relying on several detection results, is more robust compared to the single model which may cause much false detection.In addition to exploiting the contextual information fusion sub-network and the part-based multi-region fusion sub-network, we also utilize a baseline sub-network that only uses the original proposal regions for object detection.In the baseline sub-network, we perform the RoIAlign operator as same as the two aforementioned sub-networks.Then we employ a fully connected layer to obtain the final feature denoted as B. Finally, we use B to gain the class scores S B and the predicted boxes R B .
After obtaining the three types of class scores S C , S P , S B and predicted boxes R C , R P , R B , we make a decision fusion on them.The decision fusion ratio of S C , S P , and S B is 2:1:1, so do R C , R P , and R B , which can provide better detection results in experiments.Then, we use a softmax layer to get the final class labels of all predicted boxes.The loss function employed in this paper is as same as that in Faster R-CNN [10].

Experiments and Results
In this part, we first introduce the data set and evaluation metrics used for the experiments.Then, we describe the implementation details and parameter settings of the proposed method.The results and some comparisons to other methods are discussed afterward.The models were trained on a computer with two Intel Xeon E5-2630 v4 CPUs and two NVIDIA GeForce GTX 1080 GPUs.The operating system and deep learning platform used were Ubuntu 16.04 and TensorFlow 1.3.0,respectively.

Data Set
We evaluate the performance of the proposed object detection method on a publicly available data set: NWPU VHR-10-v2 data set [20].The data set stems from the positive image set of the original NWPU VHR-10 data set [31] and still contains ten classes of geospatial objects, including airplane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, and vehicle.There are 1172 images (400 × 400 pixels) in the data set we use.The data set is challenging, because the objects are multi-category and multi-scale and the backgrounds are complex.In all experiments, the training data and test data we employ are the same as that in [20], 879 (75% of the data set) remote sensing images in the training data and 293 images in the test data.

Evaluation Metrics
Here, we evaluate the performance of object detection methods through two standard, universally agreed and widely used measures illustrated in [7], namely precision-recall curve (PRC) and average precision (AP).

Precision-Recall Curve (PRC)
The Precision metric measures the fraction of detections which are true positives, and the Recall metric weighs the fraction of positives which are correctly recognized.The number of true positives, the number of false positives, and the number of false negatives are denoted as TP, FP, and FN, respectively.Therefore, the Precision and Recall metrics can be obtained by: The PRC metric is based on the overlapping area between the detection and the ground truth object.A detection is considered to be a true positive if the intersection over union (IoU) between the detection and the ground truth box exceeds a predetermined threshold; otherwise, the detection is marked as a false positive.What is more, if several detections overlap with a same ground truth bounding box, only one is regarded as the true positive, and others are labeled as false positives.The intersection over union IoU is formulated as:

Average Precision (AP)
The AP calculates the average value of Precision over the interval from Recall = 0 to Recall = 1, namely the area under the PRC.Therefore, the higher the AP value, the better the performance, and vice versa.

Implementation Details and Parameter Settings
The proposed model is based on the successful VGG16 network [56] that was pretrained on ImageNet [57].To augment the training data, we flip all the training images horizontally.For training our model, we utilize the stochastic gradient descent with 0.9 momentum.The learning rate is initialized to 0.001 and we use it for 20 k iterations; then we continue training for 10k iterations with 0.0001.The last fully connected layers for classification and bounding box regression are randomly initialized with zero-mean Gaussian distributions with standard deviations of 0.001, simultaneously other fully connected layers and the 1 × 1 convolutional layer with standard deviations of 0.01.Biases are initialized to 0. For training RPN, each mini-batch arises from a single image which includes many positive and negative example anchors, and we randomly sample 128 anchors in an image to calculate the loss function of a mini-batch.The sampled positive and negative anchors have a ratio of up to 1:1.If there are fewer than 64 positive samples in an image, we pad the mini-batch with negative ones.The entire model is trained end-to-end.Furthermore, we consider a detection to be correct if the IoU between the predicted bounding box and the ground truth bounding box exceeds 0.5.Otherwise, the detection is considered as a false positive.In the implementation of the test, we employ Soft-NMS to reduce redundancy for better detection performance.

Evaluation of Local Contextual Information and Object-Object Relationship Contextual Information Fusion Sub-Network
To evaluate the efficiency of our local contextual information and object-object relationship contextual information fusion sub-network, we designed a basic set of experiments.First, we run the standard Faster R-CNN model as a benchmark experiment.Then, on the basis of the baseline sub-network, we incorporate the proposed sub-network which fuses both local contextual information and object-object relationship contextual information.In the experiments, we find that using the features extracted from the 1.8× of the original proposal boxes as local contextual features leads to better detection performance.In the field of remote sensing image object detection, some works [18,20,27] take local contextual information into account and therefore obtain good results.However, the object-object relationship contextual information has not been proven to be beneficial for detecting geospatial objects.To illustrate the usefulness of the object-object relationship contextual information, we implement an experiment in which we incorporate the sub-network only containing local contextual information into the baseline sub-network.The detailed experimental results are summarized in Table 1.As shown in Table 1, an improvement of 4.24 percent points in mean average precision (mAP) can be seen by adding the local contextual information and object-object relationship contextual information fusion sub-network compared to the Faster R-CNN baseline network.This validates that our local contextual information and object-object relationship contextual information fusion sub-network has a strong discriminating ability to represent features of geospatial objects, providing useful contextual cues for better detection performance.In addition, Table 1 shows the mAP improves from 92.42% (only using local contextual information) to 94.04% (using both local contextual information and object-object relationship contextual information), demonstrating that the object-object relationship contextual information plays an important role in achieving better detection performance for geospatial object detection.Furthermore, we execute an experiment to illustrate that local contextual information is more useful than global contextual information for the entire image in remote sensing image object detection.In the experiment, we replace local contextual information with global contextual information for the entire remote sensing image in the overall proposed framework.The results are shown in Table 1.As we can see, in terms of mAP over all ten object categories, applying local contextual information outperforms the use of global contextual information for the entire image by 2.4%.This demonstrates that the use of local contextual information is critical, leading to better detection results than using global contextual information for the entire remote sensing image.

Evaluation of Part-Based Multi-Region Fusion Network
To verify that the part-based multi-region fusion sub-network has a positive effect on geospatial object detection, we compared the overall proposed model (including the part-based multi-region fusion sub-network) with the previous variant where the framework only merges the baseline sub-network and the local contextual information and object-object relationship contextual information fusion sub-network.As can be seen from Table 1, incorporating the part-based multi-region fusion sub-network offers a further performance increase of 1.0 percent point.This demonstrates that fusing multiple parts of each geospatial object can investigate more spatial structural information about objects, which helps to diversify object features and enhance semantic information for forming powerful feature representation.

Comparisons with Other Detection Methods
We compared the proposed approach with five state-of-the-art methods, including the collection of part detector (COPD) [31], a transferred CNN model from AlexNet [58], the rotation-invariant convolutional neural network (RICNN) [17], the rotation-insensitive and context-augmented object detector (RICAOD) [20], and Faster R-CNN [10].In the implementation of the ten-class object detection task, the COPD is made up of 45 seed-based part detectors.Each part detector is a linear support vector machine (SVM) classifier and corresponds to a particular viewpoint of an object class, therefore the collection of them providing a solution for rotation-invariant detection of multi-class objects.Exploited as a common CNN feature extractor, the transferred CNN model has shown great success for PASCAL Visual Object Classes object detection.For dealing with the problem of object rotation variations, the RICNN is designed to introduce and learn a new rotation-invariant layer on the basis of the existing CNN architecture, AlexNet.The RICAOD utilizes multiangle anchors for rotation-invariant object detection and combines local and contextual features to address the problem of appearance ambiguity.The quantitative comparison results of the six different methods are shown in Table 3 and Figure 6, representing the AP values and PRCs, respectively.As can be observed in Table 3, in terms of mean AP over all ten object categories, the proposed approach outperforms the COPD method [31], the transferred CNN method [58], the RICNN method [17], the RICAOD method [20], and the Faster R-CNN method [10] by 40.15%, 35.43%, 21.93%, 7.92%, and 5.24%, respectively.In addition, we also obtain good detection accuracy in each category, especially airplane, storage tank, basketball, ground track field, and harbor, with very high AP values.Those fully demonstrate that the proposed method achieves much better performance compared to the existing state-of-the-art methods.Table 3 also shows the average running time of each image for the six different approaches.We can observe that the proposed method costs less computation time than other methods except Faster R-CNN.
For all results, it can be easily illustrated: due to the use of the contextual features containing local contextual features and object-object relationship contextual features, the proposed method obtains a discriminative feature representation ability to effectively recognize objects in spite of the diversity and complexity of object appearance, such as storage tank, bridge, and so on; the part-based multi-region fusion sub-network provides more spatial structural information about objects, so that more semantic information can be obtained to enhance the feature representation; the multi-model decision fusion strategy makes the algorithm more robust and provides better detection performance, because it acts like operating on three different single CNN-based models, each of which generates representative characteristics that describe the object.
Figure 7 shows a lot of geospatial object detection results.The green boxes denote true positives; the red boxes denote false positives; the yellow boxes indicate false negatives.

Conclusions
In this paper, we proposed a multi-model decision fusion framework for geospatial object detection.The framework combines a contextual information fusion sub-network, a part-based multi-region fusion sub-network, and a baseline sub-network to recognize and locate geospatial objects.The final detection results are obtained by way of making a decision fusion on the results of the three sub-networks.The proposed model presents a remarkable performance on the publicly available data set NWPU VHR-10-v2.All experiments show that: (1) local contextual information and object-object relationship contextual information are beneficial to effectively recognizing objects and alleviating the mis-detection between different types of objects with similar appearance; (2) the part-based multi-region fusion sub-network can provide more details of objects to alleviate the insufficient understanding of geospatial object spatial structure information; (3) the multi-model decision fusion strategy can lead to a more stable and robust model and achieve better algorithm performance; (4) the proposed framework can produce more accurate object detection results than other previous methods.In future work, for better detection performance, we will continue to improve the proposed framework.Many fine details of some small objects are lost due to the implementation of pooling, which can lead to the inability to identify the objects.Therefore, we will consider the use of features from lower convolutional layers.In addition, we will consider designing an operator to obtain more accurate localization of detected objects.

Figure 1 .
Figure 1.Examples difficult to detect.(Left) Only using the sample appearance features in the red rectangle, just a circle, is hard to identify the storage tank.(Right) The bridge and the road are easily confused.

Figure 3 .
Figure 3.The proposed framework, which is made up of four parts.(1) A contextual information fusion sub-network; (2) a part-based multi-region fusion sub-network; (3) a baseline sub-network; (4) the last multi-model decision fusion part.

Figure 5 .
Figure 5. Illustration of object parts used in the proposed framework.(a) Original candidate boxes.(b) Left-half part of candidate boxes.(c) Right-half part of candidate boxes.(d) Inner part obtained by scaling candidate boxes by a factor of 0.7.(e) Up-half part of candidate boxes.(f) Bottom-half part of candidate boxes.

Figure 6 .
Figure 6.Precision-recall curves (PRCs) of the proposed method and other state-of-the-art methods for airplane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, vehicle classes, respectively.

Figure 7 .
Figure 7.Some object detection results obtained by using the proposed method.The true positives, false positives, and false negatives are denoted by green, red, and yellow rectangles, respectively.

Table 1 .
Detection results of using sub-networks.C-Gl: Incorporate the Contextual Information Fusion Sub-network only containing global contextual information for the entire image.C-Lo: Incorporate the Contextual Information Fusion Sub-network only containing local contextual information.C-Re: Incorporate the Contextual Information Fusion Sub-network only containing object-object relationship contextual information.P: Incorporate the Part-based Multi-region Fusion Sub-network.

Table 2 .
Comparison detection results of 25 different decision fusion ratios.

Table 3 .
Comparison detection results of six different methods.