Aircraft Detection in High Spatial Resolution Remote Sensing Images Combining Multi-Angle Features Driven and Majority Voting CNN

: Aircraft is a means of transportation and weaponry, which is crucial for civil and military ﬁelds to detect from remote sensing images. However, detecting aircraft effectively is still a problem due to the diversity of the pose, size, and position of the aircraft and the variety of objects in the image. At present, the target detection methods based on convolutional neural networks (CNNs) lack the sufﬁcient extraction of remote sensing image information and the post-processing of detection results, which results in a high missed detection rate and false alarm rate when facing complex and dense targets. Aiming at the above questions, we proposed a target detection model based on Faster R-CNN, which combines multi-angle features driven and majority voting strategy. Speciﬁcally, we designed a multi-angle transformation module to transform the input image to realize the multi-angle feature extraction of the targets in the image. In addition, we added a majority voting mechanism at the end of the model to deal with the results of the multi-angle feature extraction. The average precision (AP) of this method reaches 94.82% and 95.25% on the public and private datasets, respectively, which are 6.81% and 8.98% higher than that of the Faster R-CNN. The experimental results show that the method can detect aircraft effectively, obtaining better performance than mature target detection networks.


Introduction
Remote sensing image is a crucial and indispensable resource widely used in civil and military fields [1,2]. As one of the most important tasks and research hotspots of remote sensing image interpretation, object detection has attracted the attention of academia and industry with the higher spatial resolution of remote sensing images and the richer information contained in the images. Aircraft, as an important means of transportation and weaponry, is one of the most important targets in the field of object detection. The accurate detection of aircraft has crucial practical significance and military value [3]. Therefore, aircraft detection from remote sensing images has become the focus of attention.
With the increasing capacity of computer data processing, deep learning methods based on convolutional neural networks have made remarkable achievements in speech recognition, computer vision, autonomous driving, and other fields [4][5][6][7]. Compared with the traditional methods, the deep learning methods based on convolutional neural network can extract features with richer semantic information, higher level, stronger robustness and generalization ability from samples in a data-driven way. In recent years, the convolutional neural networks with excellent feature expression abilities have been widely used in image classification [8,9], object detection [10,11], and semantic segmentation [12][13][14]. Also, the object detection networks based on deep convolutional neural network have good performance in natural image detection. In view of the success of convolutional neural networks in natural image detection, some researchers tried to apply convolutional neural network to aircraft detection in remote sensing images. In order to improve the detection accuracy of aircrafts in remote sensing images, Hu et al. [15] used the saliency detection algorithm to reduce the number of proposal boxes, and obtained the target position information by using the saliency algorithm based on the background priori, and finally, a deep convolutional neural network was used to determine the category and fine-tune bounding boxes of the objects. Shi et al. [16] proposed a model of aircraft detection called DPANet based on deconvolution and position attention to extract the external and internal structure features of the aircraft. Besides, aiming at the problem of multi-angle distribution of aircrafts in remote sensing images, the rotation invariant detection network based on convolutional neural networks was proposed and widely used in the detection of aircraft targets in remote sensing images [17,18]. Although the accuracy of the detection can be improved to a certain extent by applying convolutional neural networks to the detection of aircrafts in remote sensing images, there are still two challenges in the target detection of remote sensing images. Firstly, unlike natural images, remote sensing images have more complex objects and special imaging angle and capture mostly the top information of objects, resulting in the similarity (e.g., spectrum similarity or geometry similarity) of the targets and other objects. In addition, the imaging results of remote sensing images are susceptible to interference from the atmosphere, electromagnetic waves and so on, which makes it difficult for object detection in remote sensing images [19,20]. Secondly, the network cannot avoid the loss of information in the process of feature extraction, so there is often a high rate of missed detection and false alarm when detecting complex and dense small objects and weak objects [21]. In general, all the above reasons will cause difficulties in extracting information from remote sensing images, resulting in a decrease in the accuracy of target detection. At present, mature object detection networks, such as R-CNN, SSD, YOLO, etc., have greatly improved the accuracy of detection, but in the face of the above difficulties, they are still unable to avoid the problems of missed detection rate and false alarm rate [22][23][24].
To solve the above problems, we proposed a simple, effective, and more universal multi-angle features driven method. Through adding the multi-angle transformation module, the features of object from multiple angles can be extracted to reduce the missed detection rate of the model. In addition, aiming at the common problem of false detections in the existing object detection networks, we proposed a box detection post-processing method based on majority voting strategy. Through post-processing of the detection results, we can further judge whether the detection boxes contain the target objects, thereby reducing the possibility of misjudgment and improving the overall performance of the model. The experimental results showed that the proposed method achieved better performance than the existing object detection networks on both the public dataset and the private dataset. The one-stage object detection algorithm based on convolutional neural network has been paid more attention because of its simple structure, high computational efficiency, and high detection accuracy [25]. It regards the object detection as a regression analysis problem, and uses mature convolutional neural networks, such as VGG [26], Resnet [8], etc., as the backbone to determine the object category and location. The representative one-stage object detection networks include SSD [27], YOLO [11], and improved methods based on them, such as DSSD [28], YOLO-V2 [29], YOLO-V3 [30], and so on. Although the one-stage object detection method has advantages in calculation speed, its accuracy is usually lower than that of the two-stage object detection method. There will be serious missed detections when facing remote sensing images with a large range. Therefore, few scholars apply it to large-scale remote sensing image target detection directly.

Two-Stage Target Detection Algorithm Based on Convolution Neural Network
Compared with one-stage object detection methods, two-stage methods based on convolutional neural network have higher accuracy and better performance in locating and recognizing targets, so they are widely used in the object detection of remote sensing images. The typical two-stage object detection networks are the R-CNN [31] series, such as R-CNN, Fast R-CNN [10], and Faster R-CNN [32]. Besides, some scholars were inspired by the idea of them, putting forward their own methods to detect objects in remote sensing images. Wu et al. [33] used Edgeboxes algorithm to generate region proposals, and then used convolutional neural networks to perform feature extraction and classification. Similarly, Yang et al. [34] proposed a "region proposal-classification-accurate object localization" framework for detecting objects in remote sensing images. However, all the above methods have the problem of redundancy in candidate regions, in order to solve those problems, Liu et al. [35] proposed an aircraft detection method based on corner clustering and convolutional neural networks, which used the mean-shift clustering algorithm to generate candidate regions for the corner points detected in the binary image, and then utilized CNN to determine the target category in the candidate region. Although the two-stage method can improve the accuracy of object detection, the existing two-stage target detection methods ignore the differences of features extracted after image rotation, resulting in insufficient features extracted, which easily leads to false detections or missed detections. Therefore, how to use the object detection network to extract image information more comprehensively has become one of the future research directions.

Improvement on Mature Target Detection Networks
At present, in addition to directly applying mature target detection networks to remote sensing images, another idea is to improve the performance of the existing mature target detection networks as summarized in Table 1. The DepthFire module is added, which reduces the amount of calculation and improves processing efficiency Real-time Compared with two-stage target detection, the overall accuracy is low Qu et al. [40] Combination of dilated convolution and feature fusion Yin et al. [41] Design encoding-decoding module to detect small objects

Real-time and high precision
Low utilization of image information By improving the mature target detection networks, the existing achievements can be fully utilized. However, although the improvement methods such as fine-tuning for network parameters and structure have improved the performance of target detection, most of them are optimized for the feature extraction process or certain types of targets. In the face of small, weak, and dense targets, there are still missed detections and false detections caused by the low utilization rate of image information and insufficient extraction of target potential feature [21]. Therefore, based on the method of making full use of remote sensing image information and extracting potential target characteristics as much as possible, we proposed a target detection model based on multi-angle features driven and majority voting strategy.

Methods
The target detection model proposed in this paper consists of two modules: multiangle features driven and majority voting. Also, the multi-angle features driven module includes three parts: multi-angle transformation, feature pyramid network (FPN) [21] and Faster R-CNN. The part of multi-angle transformation is used for transformation of images. The feature pyramid network is embedded in the backbone of Faster R-CNN to extract multi-scale features of targets. The majority voting strategy is used to process the detection results of multi-angle features driven strategy. In addition, the accuracy of Resnet is higher, and it can effectively solve the problem of optimization and training of networks as the number of layers deepen. Therefore, resnet50 was utilized as the backbone network. We describe these two modules in detail below.

Aircraft Detection Based on Multi-Angle Features Driven Strategy
For a remote sensing image rotated at different angles, it becomes a new image compared with the original one. Therefore, the extracted features are different when the image is input to the network at different angles. The reasons of feature differences caused by rotation are analyzed as follows: (1) In the process of forward propagation, the neural network will lose information due to various operations, such as convolution, pooling, etc. [8,44]. Moreover, the loss will become more serious as the depth of forward propagation increases. Therefore, the difference of final features will be caused due to the loss of their respective information when the image is input to the network at different angles.
(2) For the convolutional neural networks, the extracted features are closely related to the stride and the convolution kernel of the network. Moreover, for the image input to the network at different angles, although the stride and the convolution kernel of the network are same, the extracted features are different due to the different angle in the convolution direction.
As shown in Figure 1, for a 15 × 15 image, the stride and padding of the network are set to 2 and 3, respectively, and the convolution kernel size is set to 7. The feature map extracted by the convolution process is different from that extracted after 270 • rotation of the image. Therefore, multi-angle features driven strategy can effectively compensate for the difference of the extracted features, to make better use of the information in remote sensing images and reduce the missed detections and false detections caused by the loss of information.
As shown in Figure 1, for a 15 × 15 image, the stride and padding of the network are set to 2 and 3, respectively, and the convolution kernel size is set to 7. The feature map extracted by the convolution process is different from that extracted after 270° rotation of the image. Therefore, multi-angle features driven strategy can effectively compensate for the difference of the extracted features, to make better use of the information in remote sensing images and reduce the missed detections and false detections caused by the loss of information. A rotation invariant detection network is usually capable of identifying the targets distributed at different angles in the image by adding rotation invariant layer, using rotation invariance regularization constraint or directly using data augmentation [17,45,46]. The targets of aircrafts in the image are mostly distributed at random angles, which can make the network have rotation invariant to a certain extent by the training of dataset. Therefore, the strategy of multi-angle features driven proposed in this paper did not focus on the rotation invariant but emphasized the use of differences caused by image rotation. It performed multiple feature extractions at different angles for the input image. The purpose is to reduce the loss of target features and improve the performance of the model. Figure 2 shows the process of multi-angle features driven strategy. In order to extract multi-angle features, we designed the multi-angle transformation module. It performs multi-angle (0°, 90°, 180°, 270° and mirror image) transformation on the remote sensing image and inputs it into the trained target detection network for prediction. A rotation invariant detection network is usually capable of identifying the targets distributed at different angles in the image by adding rotation invariant layer, using rotation invariance regularization constraint or directly using data augmentation [17,45,46]. The targets of aircrafts in the image are mostly distributed at random angles, which can make the network have rotation invariant to a certain extent by the training of dataset. Therefore, the strategy of multi-angle features driven proposed in this paper did not focus on the rotation invariant but emphasized the use of differences caused by image rotation. It performed multiple feature extractions at different angles for the input image. The purpose is to reduce the loss of target features and improve the performance of the model. Figure 2 shows the process of multi-angle features driven strategy. In order to extract multi-angle features, we designed the multi-angle transformation module. It performs multi-angle (0 • , 90 • , 180 • , 270 • and mirror image) transformation on the remote sensing image and inputs it into the trained target detection network for prediction. In addition, in order to reduce the loss of small target information caused by the increase of network depth and improve the overall detection performance of the network, we used a feature pyramid network to fuse the multi-scale features of remote sensing image. The combination of multi-angle features driven strategy and feature pyramid network can effectively use the information in the image. For the problem of inconsistency of direction of detection boxes after multi-angle transformation, we used the mapping function to map the detection boxes to the corresponding position in the original image, and then used them for the subsequent majority voting process.

Detection Boxes Processing Based on Majority Voting Strategy
For the existing target detection networks, the final detection result is similar to a "one-shot decision". In other words, the preliminary detection results of the network are the final result, lacking the process of judgement for the detection results. Besides, although the strategy of multi-angle features driven can reduce the loss of information, it also causes the accumulation of wrong information. All the above reasons will lead to the high false alarm rate of the detection results, reducing the final target detection accuracy. Thus, a post-processing method of box detection called majority voting was proposed on the basis of the result of multi-angle features driven strategy. It was achieved by stacking the detection results after multi-angle feature extraction and voting on the stacking results of the detection boxes at each position.
We believe that when the number of votes is advantaged (n ≥ 3, n is the number of votes), then it can be determined that there is a positive sample at the position of the detection boxes. However, it is also possible that a positive sample exists in the detection box when the number of votes is disadvantaged (1 ≤ n ≤ 2, n is the number of votes) caused by the limitations of the network structure. Therefore, in order to further improve the accuracy of detection and the overall performance of the network, a simple binary classification network was designed in this paper. It was used to judge whether there is positive sample in the detection box with inferior votes. The detailed introduction of the binary classification network will be described in Section 2.2.3.
In the final step, in order to get the final results, the Intersection Over Union (IOU) [47] index was utilized to remove the redundant detection boxes. IOU is usually used to measure the coincidence degree between the detection box and the ground true box, and here it represents the coincidence degree between multiple detection boxes in the same position. The calculation of IOU is shown in Formula (1). In addition, in order to reduce the loss of small target information caused by the increase of network depth and improve the overall detection performance of the network, we used a feature pyramid network to fuse the multi-scale features of remote sensing image. The combination of multi-angle features driven strategy and feature pyramid network can effectively use the information in the image. For the problem of inconsistency of direction of detection boxes after multi-angle transformation, we used the mapping function to map the detection boxes to the corresponding position in the original image, and then used them for the subsequent majority voting process.

Detection Boxes Processing Based on Majority Voting Strategy
For the existing target detection networks, the final detection result is similar to a "one-shot decision". In other words, the preliminary detection results of the network are the final result, lacking the process of judgement for the detection results. Besides, although the strategy of multi-angle features driven can reduce the loss of information, it also causes the accumulation of wrong information. All the above reasons will lead to the high false alarm rate of the detection results, reducing the final target detection accuracy. Thus, a post-processing method of box detection called majority voting was proposed on the basis of the result of multi-angle features driven strategy. It was achieved by stacking the detection results after multi-angle feature extraction and voting on the stacking results of the detection boxes at each position.
We believe that when the number of votes is advantaged (n ≥ 3, n is the number of votes), then it can be determined that there is a positive sample at the position of the detection boxes. However, it is also possible that a positive sample exists in the detection box when the number of votes is disadvantaged (1 ≤ n ≤ 2, n is the number of votes) caused by the limitations of the network structure. Therefore, in order to further improve the accuracy of detection and the overall performance of the network, a simple binary classification network was designed in this paper. It was used to judge whether there is positive sample in the detection box with inferior votes. The detailed introduction of the binary classification network will be described in Section 2.2.3.
In the final step, in order to get the final results, the Intersection Over Union (IOU) [47] index was utilized to remove the redundant detection boxes. IOU is usually used to measure the coincidence degree between the detection box and the ground true box, and here it represents the coincidence degree between multiple detection boxes in the same position. The calculation of IOU is shown in Formula (1). where Area∩ represents the intersection between the detection boxes, and Area∪ represents the union between the detection boxes. For multiple detection boxes in the same position, only one detection box will be saved, and the redundant detection boxes will be deleted when IOU is higher than 0.5. The processing flow is shown in Figure 3. where Area∩ represents the intersection between the detection boxes, and Area∪ represen the union between the detection boxes. For multiple detection boxes in the same position, only one detection box will b saved, and the redundant detection boxes will be deleted when IOU is higher than 0. The processing flow is shown in Figure 3.

The Binary Classification Network
Considering the hardware and time cost, the binary classification network used this paper adopted the simplified VGG-16 network. The standard VGG-16 network r quires the size of the input image to be 224 × 224. However, according to the size of th detection boxes, the size of the input image was adjusted from 224 × 224 to 64 × 64. Mea while, in order to make the network more lightweight, we deleted the last three convolu tion layers and pooling layer in the original model and adjusted the input dimension the full connection layer from 7 × 7 × 512 to 4 × 4 × 512. By modifying the structure of VGG 16 network, the amount of training parameters of the whole network is greatly reduce which makes it more suitable for the binary classification task in this study. Finally, fo the selection of the loss function and optimizer of the binary classification network, w used the CrossEntropy Loss and Adam optimizer of the original network. The specif structure of binary classification network used in this paper is shown in Figure 4.

The Binary Classification Network
Considering the hardware and time cost, the binary classification network used in this paper adopted the simplified VGG-16 network. The standard VGG-16 network requires the size of the input image to be 224 × 224. However, according to the size of the detection boxes, the size of the input image was adjusted from 224 × 224 to 64 × 64. Meanwhile, in order to make the network more lightweight, we deleted the last three convolution layers and pooling layer in the original model and adjusted the input dimension of the full connection layer from 7 × 7 × 512 to 4 × 4 × 512. By modifying the structure of VGG-16 network, the amount of training parameters of the whole network is greatly reduced, which makes it more suitable for the binary classification task in this study. Finally, for the selection of the loss function and optimizer of the binary classification network, we used the CrossEntropy Loss and Adam optimizer of the original network. The specific structure of binary classification network used in this paper is shown in Figure 4.

Comprehensive Accuracy Evaluation Method
In order to evaluate the effectiveness of the proposed method, the AP value used by Pascal VOC 2012 was introduced into our experiments as the performance evaluation index of object detection, which can effectively evaluate the performance of the network in

Comprehensive Accuracy Evaluation Method
In order to evaluate the effectiveness of the proposed method, the AP value used by Pascal VOC 2012 was introduced into our experiments as the performance evaluation index of object detection, which can effectively evaluate the performance of the network in a certain category. In general, the higher the AP value, the better the performance of the network in a certain category [48]. The AP value can be obtained by calculating the area under the smooth curve formed by the combination of a series of Recall and Precision values. The calculation formulas of Precision, Recall and AP are as follows: where TP denotes the number of true positives identified, FP denotes the number of false positives identified, FN denotes the number of false negatives identified, and P(R) denotes the Precision value corresponding to the Recall value. At the same time, in order to study the time consumption of the model, the Average Time index was employed to evaluate the detection speed of the model. The Average Time refers to how long it would take from the input of the image to the output of the final result, including the time spent on pre-processing, network transmission and post-processing. The Average Time is calculated as follows: where n is the number of images in test dataset, ET k is the output time of the k-th result, and ST k is the input time of the k-th image.

Experiments
All the experiments in this article were performed on the Windows10 system, and Py-Torch was employed as a deep learning framework. The hardware configuration consisted of IntelCore i5-9400F CPU, RAM (16 GB) and GPU (Nvidia GeForce RTX 2080Ti 11 GB).

Object Detection Datasets
In order to verify the effectiveness of the model, we evaluated the performance of our model on three datasets of remote sensing images, namely RSOD [49], DIOR [50], and private dataset, named I-III, respectively. Some samples of the datasets are shown in Figure 5.
The RSOD (https://github.com/RSIA-LIESMARS-WHU/RSOD-Dataset-, accessed on 23 April 2021) dataset contains four typical target categories of remote sensing image. It has 976 images captured by Google Earth from some airports around the world and obtained by manually marking the locations and attributes. We only used images of aircraft as our training dataset. DIOR (http://www.escience.cn/people/gongcheng/DIOR.html, accessed on 23 April 2021) dataset is a large public dataset proposed by Northwestern Polytechnical University in 2018, which contains 23,463 images and 20 land species categories, including about 1200 images of aircraft, we randomly selected 300 images as our test dataset. The object instances of DIOR dataset have a wide range of spatial resolution, inter-class and intra-class variation, and have high inter-class similarity and intra-class diversity. At the same time, images were obtained under different weather, season, and imaging conditions, so it can be used as a test set to better verify the robustness of the network. In addition, we collected 25 images with different size and spatial resolution Remote Sens. 2021, 13, 2207 9 of 17 from Google Earth, including a total of 1064 aircrafts. The range of images in this dataset is relatively large and the aircrafts are small and weak targets in the image. Therefore, it can be used to verify the generalization and practicability of the model, the basic information of different image datasets is shown in Table 2: ries, including about 1200 images of aircraft, we randomly selected 300 images as our test dataset. The object instances of DIOR dataset have a wide range of spatial resolution, interclass and intra-class variation, and have high inter-class similarity and intra-class diversity. At the same time, images were obtained under different weather, season, and imaging conditions, so it can be used as a test set to better verify the robustness of the network. In addition, we collected 25 images with different size and spatial resolution from Google Earth, including a total of 1064 aircrafts. The range of images in this dataset is relatively large and the aircrafts are small and weak targets in the image. Therefore, it can be used to verify the generalization and practicability of the model, the basic information of different image datasets is shown in Table 2:   The binary classification network dataset adopted in this paper collected a total of 1200 image blocks with different size and resolution arbitrarily intercepted from the RSOD dataset, including 800 positive samples and 400 negative samples. The research shows that data augmentation can make the network fully learn the change of objects and enhance the ability to recognize the complex change of objects. Therefore, we expanded the number of samples in dataset by means of rotation and cropping from 1200 to 4700, including 3000 positive samples and 1700 negative samples. We randomly selected 80% as the training set and 20% as the test set. Partial samples of the dataset are shown in Figure 6. Sample sizes in the dataset range from 30 × 30 pixels to 100 × 100 pixels. enhance the ability to recognize the complex change of objects. Therefore, we expanded the number of samples in dataset by means of rotation and cropping from 1200 to 4700, including 3000 positive samples and 1700 negative samples. We randomly selected 80% as the training set and 20% as the test set. Partial samples of the dataset are shown in Figure 6. Sample sizes in the dataset range from 30 × 30 pixels to 100 × 100 pixels.

Training of the Network
In this paper, network training was divided into two parts. The first part is the training of the target detection network. Based on the pre-training weights provided by the official, we embedded the feature pyramid network into the backbone of Faster R-CNN and used the training dataset for transfer learning. The second part is the training of binary classification network. The dataset was trained based on the simplified VGG-16

Training of the Network
In this paper, network training was divided into two parts. The first part is the training of the target detection network. Based on the pre-training weights provided by the official, we embedded the feature pyramid network into the backbone of Faster R-CNN and used the training dataset for transfer learning. The second part is the training of binary classification network. The dataset was trained based on the simplified VGG-16 network, and the binary classification network was saved as a ".pkl" file for parsing and calling by the model.

Parameters Setting
Effective training of the network has an extremely important influence on the performance of network, and some parameters for training need to be clearly stated. In order to train the target detection network, the learning rate controlling the learning progress was 0.005. The number of epoch and batch size used in this paper were 100 and 5, respectively. In addition, as a crucial part of detection boxes post-processing, the parameters setting and the training result of the binary classification network directly affect the final accuracy. Therefore, for training the binary classification network, the learning rate was set as 0.0001. Meanwhile, the epoch and batch_size were 50 and 32, respectively.

Experimental Results
In order to evaluate the effectiveness of the model proposed in this paper, we demonstrated the detection results on different datasets. We used the trained model to detect the test datasets, including a total of 325 images and 5007 aircrafts. Figure 7 shows part of the detection results on the test datasets using the model proposed in this paper. It can be seen from the figures that the model can detect targets effectively when facing large-scale images or the small and dense targets in the image.
Effective training of the network has an extremely important influence on the performance of network, and some parameters for training need to be clearly stated. In order to train the target detection network, the learning rate controlling the learning progress was 0.005. The number of epoch and batch size used in this paper were 100 and 5, respectively. In addition, as a crucial part of detection boxes post-processing, the parameters setting and the training result of the binary classification network directly affect the final accuracy. Therefore, for training the binary classification network, the learning rate was set as 0.0001. Meanwhile, the epoch and batch_size were 50 and 32, respectively.

Experimental Results
In order to evaluate the effectiveness of the model proposed in this paper, we demonstrated the detection results on different datasets. We used the trained model to detect the test datasets, including a total of 325 images and 5007 aircrafts. Figure 7 shows part of the detection results on the test datasets using the model proposed in this paper. It can be seen from the figures that the model can detect targets effectively when facing large-scale images or the small and dense targets in the image. In addition, we also used the comprehensive accuracy evaluation method proposed in Section 2.2.4 to quantitatively evaluate the detection results, and the results are shown in Table 3. In addition, we also used the comprehensive accuracy evaluation method proposed in Section 2.2.4 to quantitatively evaluate the detection results, and the results are shown in Table 3.

Comparison with the Advanced Models
In Tables 4 and 5 and Figure 8, we show the results of our methods in Dataset II and Dataset III compared with the performance of the state-of-the-art target detection networks, including SSD300, YOLOV4 and Faster R-CNN. The results of the comparison are as follow.    As we can see from Tables 4 and 5 and Figure 8, the method of combing multi-angle features driven strategy and majority voting strategy proposed in this paper has the best performance in terms of AP. The AP of the model proposed in this paper is 94.82% on As we can see from Tables 4 and 5 and Figure 8, the method of combing multi-angle features driven strategy and majority voting strategy proposed in this paper has the best performance in terms of AP. The AP of the model proposed in this paper is 94.82% on Dataset II, which achieves better accuracy than the most advanced target detection networks at present. On Dataset III, the AP also achieved 95.25%. In addition, on Dataset II with a small image range and clear targets, all models achieved good performance. However, on Dataset III with a large image range and small and weak targets, the performance of one-stage target detection networks changed a lot, the change of two-stage target detection networks was relatively little. The performance of our proposed methods combining multi-angle features driven strategy and majority voting strategy is stable, and the AP still reached 95.25%.

The Effectiveness of Multi-Angle Features Driven Strategy
To verify the effectiveness of the multi-angle features driven strategy, we tested it on Dataset II and Dataset III without using multi-angle transformation module to verify the performance of the model. As we can see from Tables 6 and 7, the AP of the model with multi-angle features driven strategy is significantly higher than that without it, and the AP reaches 93.09% and 94.51%, respectively. Compared with the model that lacks multi-angle features driven strategy, the AP is improved by 5.08% and 8.24%, respectively. Figure 9 shows some samples that compare the results of model with multi-angle features driven strategy with that without multi-angle features driven strategy. It can be seen from the figures, although the targets are missed due to the large image range or the small and dense aircrafts in the image, the missed aircrafts are detected again through the multi-angle features driven strategy, which makes the missed detection rate of the network reduce.

The Effectiveness of Majority Voting Strategy
In this part, we compared the detection results without majority voting strategy with that using it to verify the effectiveness of majority voting strategy. The experimental results are shown in Tables 8 and 9.  Tables 8 and 9, the model with majority voting strategy has the best performance on both datasets. Compared with that without majority voting strategy, the AP of our model increased by 1.73% and 0.74% on Dataset II and Dataset III, respectively. At the same time, we can see from Figure 10 that the false detections of aircraft Therefore, it can be proved that multi-angle features driven strategy can effectively use the difference caused by image rotation and reduce the loss of information to a certain extent and it has practical significance for improving the detection accuracy of targets in remote sensing images.

The Effectiveness of Majority Voting Strategy
In this part, we compared the detection results without majority voting strategy with that using it to verify the effectiveness of majority voting strategy. The experimental results are shown in Tables 8 and 9.  It can be seen from Tables 8 and 9, the model with majority voting strategy has the best performance on both datasets. Compared with that without majority voting strategy, the AP of our model increased by 1.73% and 0.74% on Dataset II and Dataset III, respectively. At the same time, we can see from Figure 10 that the false detections of aircraft targets in the detection process are eliminated to a certain extent by majority voting processing, which reduces the false alarm rate of the model.

The Limitation of the Model
Although the model proposed in this paper has good performance in AP, it also has certain limitations. As we can see from Tables 4 and 5, our model has increased AP, but its time consumption increased relatively. The main reasons are listed as below: (1) Compared with single feature extraction, we performed multiple feature extraction on the image to reduce the loss of information, but this also led to an increase in the complexity of the model, resulting in excess time consumption.
(2) In the post-processing of the box detection, the voting and de-duplication of the detection boxes need to be completed through multiple cycles, which is also a crucial reason for the increased time consumption of the model.
Besides, through further exploration of the missed detections and false detections that still exist in the detection results, we summarized the reasons as follows: (1) In the process of network training, the inability to optimize the selection of hyperparameters, such as learning rate and batch_size, is one of the reasons for missed detections.
(2) Although the strategy of multi-angle features driven can reduce the miss of target feature to a certain extent, it still cannot fully compensate for the deficiencies of the network structure.
(3) The false alarm has been reduced to a certain extent through majority voting strategy, but the accuracy of the binary classification network for determining the targets in Therefore, the combination of multi-angle features driven and majority voting strategy proposed in this paper can effectively improve the overall detection performance of the model.

The Limitation of the Model
Although the model proposed in this paper has good performance in AP, it also has certain limitations. As we can see from Tables 4 and 5, our model has increased AP, but its time consumption increased relatively. The main reasons are listed as below: (1) Compared with single feature extraction, we performed multiple feature extraction on the image to reduce the loss of information, but this also led to an increase in the complexity of the model, resulting in excess time consumption.
(2) In the post-processing of the box detection, the voting and de-duplication of the detection boxes need to be completed through multiple cycles, which is also a crucial reason for the increased time consumption of the model.
Besides, through further exploration of the missed detections and false detections that still exist in the detection results, we summarized the reasons as follows: (1) In the process of network training, the inability to optimize the selection of hyperparameters, such as learning rate and batch_size, is one of the reasons for missed detections.
(2) Although the strategy of multi-angle features driven can reduce the miss of target feature to a certain extent, it still cannot fully compensate for the deficiencies of the network structure.
(3) The false alarm has been reduced to a certain extent through majority voting strategy, but the accuracy of the binary classification network for determining the targets in the inferior boxes is limited, which is also one of the reasons why false alarm still exists.
Therefore, we will pay more attention to the further simplification and optimization of the model structure and voting algorithm in future work.

Conclusions
Most target detection models based on convolutional neural networks are unable to fully extract features from remote sensing images and the results of target detection are mostly unprocessed, which can easily lead to missed detections and false detections of targets. In this paper, we presented a multi-angle features driven method and a majority voting strategy to adequately extract features in high resolution remote sensing images and to optimize the target detection results. Combining these methods, an aircraft target detection model for high resolution remote sensing images was proposed. Experimental results showed that the model could greatly reduce the missed detection rate and false alarm rate in target detection.
Through several groups of comparative experiments, it is verified that the performance of proposed model is obviously better than the existing networks of target detection. Although the model has an increase in time consumption compared with the existing target detection networks, it is acceptable in practical applications with the improvement of accuracy brought by the model proposed in this paper and by the development of computer hardware. In addition, it should be noted that this paper is mainly aimed at aircraft detection and the method proposed is theoretically applicable to other kinds of targets in remote sensing images. In the future, we will also conduct detections for other target types.