Non-locally Enhanced Feature Fusion Network for Aircraft Recognition in Remote Sensing Images

: Aircraft recognition has great application value, but aircraft in remote sensing images have some problems such as low resolution, poor contrasts, poor sharpness, and lack of details caused by the vertical view, which make the aircraft recognition very difﬁcult. Especially when there are many kinds of aircraft and the differences between aircraft are subtle, the ﬁne-grained recognition of aircraft is more challenging. In this paper, we propose a non-locally enhanced feature fusion network(NLFFNet) and attempt to make full use of the features from discriminative parts of aircraft. First, according to the long-distance self-correlation in aircraft images, we adopt non-locally enhanced operation and guide the network to pay more attention to the discriminating areas and enhance the features beneﬁcial to classiﬁcation. Second, we propose a part-level feature fusion mechanism(PFF), which crops 5 parts of the aircraft on the shared feature maps, then extracts the subtle features inside the parts through the part full connection layer(PFC) and fuses the features of these parts together through the combined full connection layer(CFC). In addition, by adopting the improved loss function, we can enhance the weight of hard examples in the loss function meanwhile reducing the weight of excessively hard examples, which improves the overall recognition ability of the network. The dataset includes 47 categories of aircraft, including many aircraft of the same family with slight differences in appearance, and our method can achieve 89.12% accuracy on the test dataset, which proves the effectiveness of our method.


Introduction
With the development of space technology, the remote sensing image has become an effective means to survey and monitor resources, environment, urban layout, and traffic facilities, playing an increasingly important role in these fields. As a subtask of remote sensing image processing, aircraft recognition is of great practical demand and application value. In our study, a remote sensing dataset with 47 types of aircraft is collected from GoogleEarth, and many types belong to the same family with very slight differences between them. Therefore, our research is a fine-grained recognition task of aircraft in remote sensing images, which is very challenging.
On the one hand, aircraft recognition in remote sensing images is more difficult than in ordinary optical images. Generally, remote sensing images of aircraft are acquired at different times and on different platforms, and the light condition, atmospheric transparency, and sensor performance will cause great differences in the imaging effect. Compared with ordinary optical images, remote sensing images have their own unique characteristics which cause difficulties in image recognition: (1) The size of aircraft is generally tens of meters, so the aircraft has only a few pixels in the remote sensing image structural sparse restricted Boltzmann machine (RBM). An et al. [12] propose a new idea to address the aircraft type recognition problem by the aircraft's landmark detection, and use a convolutional neural network called the vanilla network for all landmark regressions. Zuo et al. [20] use a convolutional neural network (CNN) for semantic segmentation, and then put the segmented aircraft mask into the classification algorithm. Zhang et al. [2] train a conditional generation of adversarial network from which the multi-scale characteristics of aircraft can be extracted. Fu et al. [4] adopt a multi-class activation map to locate the aircraft in the image and use a mask filter to eliminate interference in the original image.
In the field of image processing, self-similarity in an image has received growing interest. Perceptual grouping follows some principles, such as the proximity and the similarity to extract groups from initial primitives (such as edges and curves), organize them into sets that have similar "perceptual" content, and use the sets for recognition. Kim et al. [21] propose a hierarchical approach to extracting lines and polygons in digital images based on perceptual grouping. Randall et al. [22] propose a Hierarchical Cluster Model to extract object symmetries from a digital image. Michaelsen et al. [23] propose a method for building recognition in high-resolution SAR images based on perceptual grouping, which make use of symmetry and repetitive similar structure in remote sensing images. Bag of visual words model is another way to exploit similarity, which cluster similar visual descriptors together to form a visual vocabulary. Csurka et al. [24] assign patch descriptors to a set of predetermined clusters, construct a bag of keypoints, which counts the number of patches assigned to each cluster, and then train multi-class classifiers using the bags of keypoints as feature vectors. Batista et al. [25] propose a technique based on a bag-of-keypoints representation to identify images containing buildings in the APM photographic collection. The self-correlation matrix of the image is a simple and straightforward method to model the correlation between long-distance pixels. Des et al. [26] first propose a non-local denoising algorithm based on image self-correlation. References [27][28][29] use non-local means algorithms to remove noise in remote sensing images such as hyperspectral images and radar images.
Inspired by the process of human visual recognition, when the differences between categories are very subtle, we need to locate some important parts in the object firstly, and then carefully observe the subtle features inside the parts. Zhang et al. [30] utilize Selective Search [31] algorithm and R-CNN [32] to locate the head and body of birds in the sub-classification task of 200 species of birds [33], but Selective Search algorithm consumes a lot of computational resources. Huang et al. [34] adopt a full convolutional network to locate key points on the bird, then take a 6 x 6 size region as the concerned part, and propose a two-stream classification network to encodes object-level and part-level cues simultaneously. Zhou et al. [35] propose a generic technique called class activation mapping (CAM), which enables CNN to locate distinguishing or informative areas on an image without using any bounding box annotations. Peng et al. [36] obtain the saliency map by CAM, which is used as the target-level attention, then the object-part spatial constraints are used to select discriminant parts from the candidate parts.
One of the problems existing in recognition methods for aircraft in remote sensing images based on deep learning is that the long-distance correlation in the aircraft images is not properly utilized to enhance the distinguishing features. They use CNN for feature extraction, but CNN is limited in the receptive field, and can only utilize local information within a certain range, unable to establish the relationship between long-distance pixels, and unable to comprehensively utilize global information. In fact, the long-distance correlation in the aircraft image is very obvious, such as significant object edges, symmetrical wings, recurring engines and so on. To carry out fine-grained classification, it is especially necessary to pay attention to some structures or details of aircraft. However, due to such adverse factors as low resolution, poor contrasts and sharpness, some structures are easy to be neglected by CNN. If there exist some similar structures elsewhere, the use of the correlation between them can enhance such useful structures for classification, suppress interferential and irrelevant information, and reduce the error rate of recognition. Although perceptual grouping and bag of visual words model are both effective ways to group or cluster similar structures but their representation ability is not as good as that of deep neural network, and it is difficult for them to adjust the weight of the clusters for fine-grained classification. On the other hand, References [27][28][29] use non-local means algorithm to remove noise in remote sensing images, but self-correlation is only used for filtering and denoising, just as a preprocessing method of remote sensing images, and it is not integrated into the training process of deep neural network. Wang et al. [37] introduce non-local operation into the deep neural network for the first time and design a residual non-local module in the video classification task, which makes up for the deficiency of CNN in global information perception. Li et al. [38] use non-local operation to remove the superimposed raindrops on the image and achieve a good effect. In this paper, non-local operation is introduced into neural network for fine-grained classification of aircraft in remote sensing images, so that the network can make full use of the redundant mode in the aircraft image, learn the long-distance correlation of aircraft, and with the guide of loss function, it can gradually focus on the structures and details that are beneficial to classification and suppress other useless features during iterative training.
Another problem existing in recognition methods for aircraft in remote sensing images based on deep learning is that they all treat the aircraft as a whole to extract features, instead of examining detailed features inside its parts. They distinguish only a few types of aircraft, and the differences between categories are relatively obvious. Therefore, the aircraft can be classified correctly by only the overall features, and there is no need for further feature extraction and fusion of the internal details of the parts. However, there are 47 kinds of aircraft in our dataset, especially different subcategories in the same series, many of which are only slightly different from each other, so we must locate the parts of aircraft and distinguish internal details inside the parts subtly.
Although Zhang et al. [30], Huang et al. [34], Zhou et al. [35], and Peng et al. [36] propose some methods for part location, these methods are not suitable for our classification task of aircraft in remote sensing images. Firstly, part localization and cropping methods do not take full advantage of the geometric features of aircraft. Unsupervised localization methods [35,36] generate fuzzy and irregular boundary of parts, easy to contain irrelevant areas, and have poor positioning accuracy. While the supervised localization method [34] adopts a part cropping strategy that is not suitable for aircraft, which takes the key point as the center to crop a box, and so makes the bounding box relatively loose and may include some background. Secondly, when fusing features of each part, some methods [30,36] need to put each part into a separate network for feature extraction, without sharing the feature extractor, while others such as the method [34] have shared the feature extractor, but each part has no chance to set any unique parameters, and cannot be adjusted respectively. In this paper, we address these problems as follows: when locating the parts, according to the geometric features of the aircraft, we align the posture of aircraft and adopt a reasonable cropping strategy to directly crop 5 geometric regions based on key points, so can localize parts accurately and efficiently; when fusing features of each part, we first put the whole image into the shared feature extractor, then crop the corresponding feature sub-maps of each part on the output of the feature extractor, and then add a part full connection layer (PFC) after the feature sub-maps of each part to learn the detailed features inside the part, which cannot only share feature extractor, but also keep the flexibility of each part to adjust independently, so can improve the network's ability of extracting detailed features.
In this paper, a complete classification framework of aircraft is constructed based on the deep learning method. Firstly, we adopt the CNN feature extractor to obtain the feature map of the original image, and in view of the symmetry, repetitive structure, and obvious geometric shape of the aircraft, we insert a non-locally enhanced module into the feature extractor, which utilizes the self-correlation operation to enhance the effective features for aircraft classification. On this basis, we use key points to crop 5 parts of the aircraft on the feature map, and extract the detailed features of each part by the part full connection layer (PFC), and then integrated the features of each part by the combined full connection layer (CFC) to complete the final classification. The main contributions of this paper include: (1) As far as we know, our proposed NLFFNet is the first piece of work that attempts to incorporate non-local operations into the remote sensing image processing task based on neural network architecture, which can get the global receptive field by the self-correlation algorithm, and guide the network to pay more attention to the discriminating structures or details, so as to enhance the effective features for classification.
(2) We propose an efficient method for part localization. According to the appearance characteristics of the aircraft, we develop a reasonable cropping strategy, based on which we utilize 5 key points to generate the part masks, and then crop 5 parts by using these masks. On one hand, the acquired parts provide the prerequisite for the subsequent extraction of subtle features inside each part; on the other hand, these parts combine together to form the mask of the whole aircraft, which can eliminate the interference of irrelevant backgrounds.
(3) We realize an efficient part-level feature fusion mechanism. By shared feature extractor, we get the feature map of the original image, and then crop corresponding feature sub-maps of each part, after that the part full connection layer (PFC) is utilized to extract the detail features inside each part, on this basis, we adopt the combined full connection layer (CFC) to fuse features of all parts. In this way, we cannot only share the feature extractor, but also keep the flexibility of each part to adjust independently, and greatly enhance the recognition ability of subtle difference.
(4) By adopting the improved loss function, we increase the weight of hard examples in the loss function and reduce the weight of examples that are too hard to be recognized, such as outliers, so as to improve the overall recognition performance of the network.

Proposed Method
As described in Figure 1, the fine-grained aircraft recognition framework consists of four parts: the part detection sub-network, the feature extraction sub-network, the feature fusion sub-network, and the no-locally enhanced module inserted into the feature extraction.
The workflow is as follows: (1) we get an image (denote as Image I) from the dataset, and the nose of the aircraft in Image I may be oriented in any direction. We feed Image I into the part detection sub-network, and get 5 key points of the aircraft; (2) we utilize the detected key points to correct the posture of the aircraft in Image I (Image I is rotated accordingly), and generate 5 part bounding boxes according to the strategy, as described in Section 2.2.1; (3) we feed the rotated Image I into the feature extractor, and get the feature maps of the whole image; (4) we map the part bounding boxes generated in Step 2 to the feature maps generated in Step 3, and get the corresponding feature sub-maps of each part; (5) we further extract detailed features of each part and then integrate these features, as described in Section 2.2.2.
In addition, according to the long-distance correlation of aircraft image, a non-locally enhanced module is inserted into the feature extractor, which utilizes self-correlation calculation to improve the features that are beneficial to classification.

Long-distance Correlation of Aircraft in Remote Sensing Images
The remote sensing image dataset of the aircraft is collected from GoogleEarth. When using imaging from different times and using different equipment, many images are greatly affected by the external environment. Our dataset reflects the real situation of aircraft in remote sensing images.
It can be seen from Figure 2 that the image of the aircraft has obvious geometric features and symmetrical structures, and there are many redundant modes, such as repeated engines and loads, so there is obvious strong correlation information between the long-distance pixels, this characteristic is of great value for aircraft recognition. At the same time, we find that some structures or details are difficult to be observed because of low resolution, poor contrasts and sharpness (as show in Figure 2). These structures or details may be right the differences between aircraft, if similar structures happen to exist elsewhere, correlation between them can guide neural network to pay more attention to these structures and enhance these effective structures and details.

Principle of Non-locally Enhanced Operation
In our processing framework, we utilize the front part of VGG19 [15] as the feature extractor. The input image size is 224 × 224 × 3, after multiple convolution and pooling operations, output the feature map with size of 14 × 14 × 512. Convolution operation in the extractor is a kind of local operation, the output value of each position is obtained by convolution calculation between kernel and the local pixels. Convolution operation assumes that adjacent pixels have a strong correlation, while the correlation of pixels with a long distance is weak. However, there is an obvious correlation of long-distance correlation in the remote sensing images of the aircraft, therefore, non-locally enhanced operation should be adopted in aircraft recognition task.
Non-locally enhanced operation is essentially a self-attention mechanism, and its basic principle is to construct the model's long-distance dependence by a triple (key, query, value): obtain the corresponding attention weight by dot product between key and query, and then multiply the weight and value to get the final output. It is formalized as follows: and where x and y denote the input and output, i and j are the coordinates of pixels, the value range of j is any coordinate in the image. C(x) is a normalized constant, f(.) is a two-input function used to construct the correlation information between point i and point j, g(.) is a single input function to calculate the influence of point j on point i, x i corresponds to the query in the triple, x j corresponds to the key, and g(x j ) corresponds to the value, while θ and φ denote the embedding of query and key respectively. The summation operation is to synthesize the influence of all other pixels on x i (query). As shown in Figure 3, key1 and key2 have a high similarity to the query, while key3 and key4 have a low similarity to the query. The weight factor f (query, key1) and f (query, key2) are correspondingly high, so the output value at the coordinate of the query point can be enhanced by the contribution of high-correlation pixels such as key1 and key2. query key1 key2 key3 key4 Figure 3. Schematic diagram of the correlation between query and key.

The Realization of Non-locally Enhanced Module
The left middle part of Figure 1 briefly depicts the non-locally enhanced module. The input of this module can be any feature layer in the neural network, and the output is exactly the same size as the input, therefore, the module can be inserted into any position in the network without changing the original network. This module adopts residual structure, which makes the gradient of self-correlation operation more easily propagating in the network. Figure 4 shows the internal calculation process of non-locally enhanced operation. Before calculating the correlation coefficient f (x i , x j ), three embeddings are obtained respectively, so that the module has one more chance to adjust before calculating the correlation coefficient, which enable the network to enhance or suppress features more flexibly. When faced with large size feature map, correlation operation will lead to large-scale matrix calculation. As a compromise, the feature map can be sliced and calculated separately, and then the parts can be combined together, as shown in Figure 5. In Section 3.5, we need to perform non-locally enhanced operation on the feature map output by conv1_2, but the size of the feature map is 112 × 112, and the computational cost is too large. We adopt this strategy: firstly, divide the 112 × 112 feature map into four sub-maps with size of 56 × 56; secondly, perform non-locally enhanced operation on each sub-maps, refer to Figure 4, the matrix calculation cost of each sub-map is 1/16 of the feature map with size of 112 × 112, and the total calculated cost of the four sub-maps can be reduced to a quarter; thirdly, combine the four outputs of non-locally enhanced operation into a whole feature map with size of 112 × 112.

Part-level Feature Fusion
The general classification network directly accesses several full connection layers and a softmax layer after the convolution layers, so as to obtain the probability of each category. In this way, all the pixels in the feature maps of the input image are treated as the same. When there are large differences between classes, the network can distinguish the categories correctly, but in the face of the classification task with slight differences between sub-classes, because the details are easily overwhelmed by irrelevant or distracting information, this processing method is not competent. We need to help the network locate the target and its parts in a certain way, then look for details in these regions. In our method, we generate part's bounding boxes by key points, extract the subtle features inside each part, improve the network's ability to distinguish subtle features, and then fuse these features together for classification. Aircraft are symmetrical in structure and have obvious key points in appearance, so the corresponding parts can be cropped according to the key points. Compared with the part detection method based on proposal boxes, the method based on key points has an obvious advantage, which can reduce the search space and computation. According to the characteristics of the aircraft, five key points are designed, which are located in nose, fuselage, empennage, left-wing and right-wing. We realize key point detection by adopting the method proposed in Reference [39], build a simple and efficient key point detection network based on resnet50 and 3 layers of deconvolution, and use the gaussian heatmap of the key point coordinates as the monitoring information of the network.
According to the appearance characteristics of the aircraft, we develop a set of strategies as the cropping rule of the part. We first correct the aircraft's posture, we take the line between the nose and fuselage as the datum line and rotate the image to make this line perpendicular to the X-axis.
After posture correction, the size of the aircraft is calculated by the following formula: For the parts of nose and fuselage, we offset the key point (Nose or Fuselage) by 5 pixels up as the top boundary of the bounding box and take W object /2 and H object /2 as the width and height of the bounding box. For the empennage, we also take W object /2 and H object /2 as the width and height of the part's box, take the x coordinate of empennage as the X-axis midpoint of box, took the y coordinate of empennage as a reference, and H object *3/8 above the reference, W object 1/8 down the reference as the Y-axis interval of the box. The goal of this strategy is to take into account that the left and right horizontal stabilizers are not on the same horizontal plane as the empennage, and most of the areas will be skewed towards the top, this method can better contain the tail area without adding more background areas.
For the bounding box of the wing, we deal with it in two cases according to the angle of wing, where the angle is calculated by the line between fuselage and left-wing (or right-wing) and the line between fuselage and empennage. As shown in Figure 6a-c, when the angle is less than or equal to 60 degrees, draw a rectangular box diagonally with the line connecting fuselage and left-wing or right-wing. Figure 6d-f is the case that the angle is greater than 60 degrees, and in this case, we use the difference between the y-coordinates of Fuselage and left-wing (or right-wing) as the height of the wing (denoted as H wing ). When cropping the wing, we reserve H wing /2 above and below the wing, so as to capture engines when the wing is spread horizontally. The wing clipping method is formalized as follows: Figure 6. Two ways to crop wings.

Feature Extraction and Feature Fusion of Parts
Firstly, feature maps of each part are obtained by the shared feature extractor based on the part bounding boxes. On this basis, we extract the subtle features inside each part by the part full connection layer(PFC), and then fuse feature maps of each part by the combined full connection layer (CFC), as shown in the right part of Figure 1.
To get the features of each part, the traditional method is to crop the original image, scale the cropped part to meet the input requirements of the feature extraction network, and then obtain the corresponding feature maps of each part through several separate neural networks. However, this method would require a large amount of calculation, specifically 5 times for 5 parts.
In our method, a shared feature extractor is adopted, and transfer the cropping operation of part to the feature maps, to directly obtain the corresponding feature sub-maps of five parts. After multiple convolutions and pooling operations, the original image is transformed from the size of 224 × 224 to the feature map with size of 14 × 14, and the scaling ratio is 16:1, the coordinates of the part boxes are mapped to the feature map according to this ratio, and then crop the five-part boxes in the shared feature map.
Different parts have different sizes and aspect ratios according to the cropping strategy, to facilitate the subsequent feature fusion, it is necessary to align the feature maps of the cropped parts to the same size. The size range of the cropped parts can be expressed as P size = m × n, 2 < m < 7, 2 < n < 7. We adjust these different sizes to 6 × 6 and use bilinear interpolation to obtain the value of each pixel after resize. Bilinear interpolation prevents the precision loss caused by the rounding operation and retains the precision before resize.
We designed a part full connection layer (PFC), which is built by two full connection layers after feature maps of each part, to further extract the details inside the part, which can be formulated as where x i represents the input feature of part i, and y i represents the output feature of part i in the PFC layer, W i represents the weight parameter, and f represents the nonlinear activation function. Especially W i of each part are not shared because we need the network to extract the characteristics of different parts separately. The PFC layer abandons the irrelevant areas and focuses on the features inside the parts. Each part adopts a separate fully connected parameter matrix, which allows the network to learn individually about each part and further focus on the subtle features that need to be paid attention to inside the part.
After the PFC layer, we adopt the combined full connection layer (CFC) to integrate the information of all parts together. The outputs of the full connection layer of all parts are the same size, which can be concatenated together as the input of the CFC layer. The CFC establish a connection with each node in the PFC, so can model the internal relationship between different parts correctly, which can be formulated as where x i represents the input features of part i, W i represents the weight parameters of part i, W c represents the weight parameters of the CFC layer, f c and f p respectively represent the activation function of the CFC layer and the PFC layer, ∑ represent the concat operation between parts, and y represents the final output feature vector of the CFC layer.

Hard Example Mining
As shown in Figure 7, in the fine-grained recognition task of the aircraft, there exist many hard examples that are difficult to be accurately recognized by the neural network. There are mainly three reasons: firstly, the differences between some categories are subtle; secondly, there are adverse factors such as poor contrasts and sharpness in remote sensing images, which may cause images of different classes to look the same; thirdly, as some labels come from the Internet, inaccurate labels may exist in the dataset. In general, we use cross-entropy as the loss function of classification, which reflects the degree of difference between the predicted value and the ground truth. The formula is as follows: where p gt is the ground truth, and p pred is the prediction value, n is the number of categories. (1 − p gt (i) * p pred (i))) γ log p pred (i) (10) In fact, the focal loss has two hyperparameters that require careful adjustments, moreover, it only adjusts the weight according to the difference between the predicted value and the ground truth, but does not reflect the proportion of the hard examples in the mini-batch and does not adapt to the change of data distribution.
In Reference [40], a new gradient coordination mechanism (GHM) is proposed to hedge the incongruity between examples. This method uses the distribution of the gradient norm to reflect the imbalance between hard and easy examples, and the gradient norm is defined as g = |p gt − p pred | (11) which presents the difference between predict value and ground truth, with a value range between 0 and 1. Then, the gradient density function is used to represent the distribution of gradient norm: where N is the total number of examples, g k is the gradient norm of the k-th example, presents a neighborhood of g, and δ (g k , g) indicates whether g k is distributed in the neighborhood , and l (g) presents the length of the neighborhood.
With the continuous iteration of training, a large number of gradient norms(g) are concentrated near the 0 value, leaving only a small number of gradient norms away from zero, which mean difficult examples. Although the contribution of an easy example on the gradient is less than that of a hard example, the total contribution of a large number of easy examples can exceed the contribution of a small number of hard examples, and the training process will become inefficient. On the other hand, when the network converges, there may be some too hard examples whose corresponding g value is relatively large, and the density of these g values is slightly higher than that of the normal hard example (because as the training iterates, the g value of the normal hard example is moved to the neighborhood of the 0 value). These hard examples can be considered as outliers because they exist stably even when the model converges. Because the gradient of outliers may be quite different from other common examples, it may affect the stability of the model. If excessive attention is paid to these abnormal examples like focal loss, parameter adjustment will be too large and these outliers will be over-fitted, but at the same time, the fitting ability of other normal examples will be destroyed.
As mentioned above, the reciprocal of gradient density can be used as the loss weight factor of the corresponding example, which can be formulated as follows: where N is the total number of examples, which plays a role of normalization. The loss function L GHM−C cannot only enhance the influence of hard examples but also restrain the influence of outliers on the loss function. Experimental results show that the loss function can ensure the stability of the model and make it get a better classification result.

Dataset
Currently, major public remote sensing image datasets [41,42] contain a few categories with large interclass variance, such as ship, tank, harbor, plane, forest, building. It is easy to distinguish one from another in these datasets, so it can only be used for common classification problems and cannot be used for the study of the fine-grained classification of aircraft.
To study the fine-grained recognition of aircraft, we collected an aircraft dataset from GoogleEarth. We classify the aircraft according to the specific type, rather than roughly divide them into passenger aircraft, transport aircraft, training aircraft, for example. The dataset includes 47 types of aircraft, which as far as we know, is the dataset with most categories of aircraft in remote sensing images. Each type of aircraft has about 17 images, and each image is scaled to the size of 224 × 224. Sixty percent of them are used as the training set while forty percent as the test set, and the images in the test set never appear in the training set. We use Labelme (a database and web-based tool for image annotation described in Reference [43]) to mark the type information and key points of the image. Each original image is marked with five key points, which are located in the nose, fuselage, tail, left-wing and right-wing. Due to the small number of images, the original images need to be augmented to enhance the generalization performance of the model. We mirror the images, also carried out random translation operations in the upper, lower, left, and right directions, respectively. In the process of data augmentation, coordinate transformation of key points is carried out to ensure the correctness of key points after augmentation. The dataset will be public in the future. Please contact the corresponding author to ask for the state of availability of the dataset.

Implementation Details
The NLFFNet network is built on TensorFlow 1.10, trained and tested on the operating system Ubuntu 16.4, with an NVIDIA 1080Ti GPU which has 12 GB of memory.
We utilized the conv1_1 to conv5_4 of VGG19 network as a feature extractor. Due to the small amount of data, if we directly start training scratch from random initialization parameters, it is likely that over-fitting will occur. Therefore, we use the pre-trained parameters on imagenet as the initial values of feature extractor, and other network parameters adopted the Xavier initialization. We train the network using SGD optimizer with a mini-batch size of 32, and evaluate the performance of the model with top-1 accuracy as a metric. Cross-entropy is adopted as the loss function and we compare the effects of cross-entropy with focal loss and GHM-C loss function.
Learning rate is a very important parameter in the training process. If too large, the network is prone to gradient explosion, or may not reach the optimal classification results. If too small, the optimization speed of the network is too slow. In the setting of the learning rate, we adopt two basic strategies: (1) Different learning rates are adopted for different parameters. Parameters without pre-training have higher learning rates, while the parameters of the feature extractor, which are loaded from the imagenet pre-training model, have a lower learning rate. (2) The cycle learning rate strategy proposed in Reference [44] is adopted to improve the convergence speed without decreasing the classification accuracy.

The Results of the Proposed Method
To check the effect of our proposed method, we conduct a series of comparative experiments with other methods. Meanwhile, in order to observe the impact of non-locally enhanced module and part feature fusion (PFF) method on classification results separately, we conduct an ablation experiment.
First, we train some classic CNN networks, including AlexNet, VGG, and Resnet. Due to the small amount of data, we fine-tune based on the pre-training model to ensure the rapid convergence of the model. It can be seen from the results in Table 1 that these classic fine-tuned networks have been able to extract a lot of useful information from remote sensing images of aircraft, and effective classification results can be obtained as long as simple training is conducted. In addition, we also compare with the image segmentation method in Reference [20]. We first train an aircraft segmentation model, then align the direction, and recognize the type of the aircraft. The benefits of this approach in Reference [20] are obvious: it separates the aircraft from the background, reducing the impact of sundries on the ground, and it can be seen from Table 1 that this method has better performance than ordinary CNN networks.
Our proposed method has two main improvements: firstly, we conduct the part-localization and part-level feature fusion according to the geometric feature of the aircraft; secondly, we insert a non-locally enhanced module into the feature extractor to enhance the feature beneficial to classification. To check how much of performance boost these two improvements bring to the network, we make an ablation experiment. First, we add a non-locally enhanced module to the feature extractor, according to our experiment, it is better to put this module in the shallow layer of the network(between conv2_2 and conv3_1). Therefore, in Table 1, we record the classification accuracy when non-locally enhanced module inserting between conv2_2 and conv3_1 of the feature extractor. In addition, we build a part-level feature fusion network based on key points without a non-locally enhanced module, to observe the classification performance improvement brought by the feature fusion method alone. Finally, the non-locally enhanced operation and feature fusion method are assembled together to build a complete fine-grained classification network of aircraft called NLFFNet.
As the proposed method utilizes an additional subnet for key point detection and feature fusion of different selected parts except the baseline extractor module, it must need more computational resources. We use inference time as the criterion of computational cost and evaluate the inference time of the networks with a NVIDIA 1080Ti GPU. The inference time of the baseline feature extractor is 0.11 s, while our proposed method is 0.48 s, our method has a higher computing cost.
The comparison of the accuracy of each method is shown in Table 1, where "Extractor" refers to the feature extractor, "non-local" refers to the non-locally enhanced module, and "PFF" refers to the part-level feature fusion mechanism. Table 1. Comparison results of the proposed method.

The Influence of PFF
To observe the influence of part feature fusion on the network, we make a heatmap experiment refer to Reference [45], the steps are as follows: (1) Firstly, we select the feature maps we are interested in, such as the feature maps obtained after the last convolution (with size of 14 × 14 × 512, and the following steps are assumed to deal with this size).
(2) The influence of 512 feature maps in the softmax layer must be different, and the weight of each feature map can be calculated by backpropagation. We select the node with the largest softmax value (corresponding to the category with the highest confidence), calculate the gradient of the feature map we were interested in base on backpropagation, and the mean value of the gradient of the feature map is taken as the weight of the feature map.
(3) Multiply each feature map by the weight to get a weighted feature map with a shape of 14 × 14 × 512, calculate the mean value in the third dimension to get a map with a shape of 14 × 14 , then perform relu activation and normalization.
(4) The heatmap is resized to the shape of the original image, and calculate a weighted sum of the heatmap and the original image, which is available to observation and analysis.
Heatmap can reflect the importance of each region of the image to the classification result, in other words, it can be seen from the heatmap that the network gets a certain classification result because it pays attention to which region of the image mostly.
As shown in Figure 8e-i, they represent the heatmap of the nose, fuselage, empennage, left-wing, and right-wing, respectively. Then, the heatmaps of each part were assembled according to their positions in the original image to obtain the concatenated heatmap as shown in Figure 8c. Based on the concatenated heatmap, a threshold is set, and when the heat value is higher than the threshold, it is truncated. Then, all heat values are normalized, and the area beyond the threshold is displayed in a unique brown color to prevent confusion between the background color of the original image and the color of the heatmap, so as to facilitate observation, shown in Figure 8d. It should be noted that in order to facilitate observation, the heatmap is superimposed with the original image. During the superimposition, the pixel value of the original image is multiplied by a weighting factor of 0.4, while the heatmap is multiplied by a weighting factor of 0.6. In addition, the heatmaps of five parts overlapped with a few area when stitching, so the value of the overlaped area get bigger than the real value, and the visualization of the heatmap is automatically generated by calling the applyColorMap function in opencv, which generates a smooth color scheme based on the maximum and minimum values, as a result, the color of each part (as shown in (e)-(i)) is not completely consistent with the color of corresponding parts of sub-diagram (c). The aircraft in Figure 8 is misidentified before adopting a part-feature fusion method, and it could be correctly recognized after using this method. By comparing (b) and (c) in Figure 8, it can be seen that the part box excludes the interference in the upper right corner. In addition, compared with (h) and box b2 in (b), it can be seen that the network pay more attention to the surface of the wing when without the part-box, while the focused area is obviously shifted to the payload in front of the wing when adding the part box, and the network could extract more detailed information conducive to classification. Similarly, by comparing (i) and box b3 in (b), it can be seen that the payload of the right-wing is paid little attention when without the part-box, and the attention of the right-wing and its payload increase obviously after adding the part-box.
The aircraft in Figure 9 is misidentified before adopting the part-feature fusion method, and it could be correctly recognized after using this method. By comparing (b) and (c) in Figure 9, it can be seen that the part-box excludes the interference on the left side from the attention of the network. In addition, it can be seen from (a) that the left engine and the payload under the left-wing are not obvious due to poor contrasts and sharpness. As shown in box b2 of the sub-diagram (b), the network does not pay attention to the left engine and the payload under the left-wing when there is no part-box, but in the sub-diagram (h), the network obviously pays attention to these details. (1) All part-boxes stitching together actually form a mask of the target, which segments the target from the whole image, and makes the network focus on the target itself, without interference from irrelevant objects and backgrounds outside the target. (2) The part full connection layer(PFC) allows the network to learn the details inside each part and to better distinguish the nuances between the subclasses.

The Influence of Non-locally Enhanced Operation
We try to insert the non-locally enhanced module into different positions of the feature extractor, conduct training respectively, and found that placing it after conv2_2 gets the best effect, as shown in Table 2. To observe the influence of non-locally enhanced operation, we also draw the heatmaps based on the principle of Grad-Cam [45]. Our non-locally enhanced module is inserted between conv2_2 and conv3_1, so we draw the heatmaps corresponding to conv2_2 and conv3_1. We also draw the heatmap corresponding to conv5_4 to observe the influence of non-locally enhanced operation on the final output of the feature extractor.
The aircraft Type41 in Figure 10 is misidentified without non-locally enhanced module and can be correctly recognized after the addition of the module. In Figure 10, from the comparison of (b) and (f), as well as a comparison of (c) and (g), it is found that a non-locally enhanced module inhibits the interference of irrelevant objects on the ground. According to the comparison of the green box in (d) and (h), the non-locally enhanced module makes the network pay more attention to the canards(Canards are small delta wings on either side of the cockpit) of Type41, which is a significant feature of Type41 that distinguishes it from Type40(This image happened to be misidentified as Type40 without non-locally enhanced module). Figure 10. Heatmaps of aircraft Type41 before and after adding a non-locally enhanced module. The first row is the phenomenon without inserting the non-locally enhanced module, and the second row is the phenomenon after inserting this module, and from left to right are the original image, heatmap of conv2_2, heatmap of conv3_1 and heatmap of conv5_4, respectively.
The aircraft Type40 in Figure 11 is misidentified without non-locally enhanced modules, and can be correctly recognized after the addition of non-locally enhanced modules. It is found from (b)(f) in Figure 11 that this module suppress the interference caused by ground plaques. It can also be found from (c),(g) in Figure 11 that the module suppress the interference brought by the ground line. Figure 11. Heatmaps of aircraft Type40 before and after adding non-locally enhanced module. The first row is the phenomenon without inserting the non-locally enhanced module, and the second row is the phenomenon after inserting this module, and from left to right are the original image, heatmap of conv2_2, heatmap of conv3_1 and heatmap of conv5_4, respectively.
A comprehensive comparison of Figures 11 and 10 shows that: (1) With the addition of a non-locally enhanced module, the focused area on the conv3_1 heatmap is more accurate and concentrated than that without the module, indicating that non-locally enhanced operations could guide the network to focus on effective details and ignore useless features. (2) With the addition of a non-locally enhanced module, heatmap of conv2_2 changes significantly compared with that without the module, indicating that all parameters in the neural network are interrelated, and non-locally enhanced modules could not only influence the subsequent feature maps, but also influence the feature maps before the module. (3) In the high-level semantic feature maps of conv5_4, the heatmap with non-locally enhanced module is significantly more focused on the aircraft itself and rarely diffuses to irrelevant areas such as the ground, indicating that the effect of the non-locally enhanced module in shallow layers could be effectively transferred to high-level semantics to improve the final presentation and classification ability of the feature extractor.

The Comparative Experiment of Loss Functions
In the training process of NLFENet, we take cross-entropy as the baseline and compare three different loss functions, aiming at online mining of hard examples by the loss function. The comparison results are shown in table 3.
According to the formula of focal loss, there are two hyperparameters that can be adjusted. γ is a scalar, which is easy to be adjusted, while α is a vector with a length of 47, which corresponds to the difficult degree of 47 types of aircraft respectively, it is highly dependent on manual experience for careful setting, so it is very difficult to determine the appropriate value. Therefore, only γ has been adjusted in this experiment, whereas α is a vector of all one.
According to the formula of loss function GHM-C, we must calculate the density of the gradient norm, and because the gradient norm g is a continuous distribution between 0 and 1, we should first to discretize it. When programming, we set = 0.05 and divide 20 intervals between 0-1 and count the number of g values in each interval, thus generating 20 density values (GD 1 , GD 2 ...GD j ...GD 20 ), if the value of g i falls in the k-th interval, we set GD(g i ) = GD k approximately. By observing the dataset carefully, it was found that there was a suspicious image of Type1 in the training and test sets, which may be labeled incorrectly, as shown in Figures 12-14. In Figure 13a, the image is recognized as type Type1 when cross-entropy is applied, and recognized as type Type27 when GHM-C loss is applied, which indicates that GHM-C loss regards it as an outlier, and the contribution of this outlier is inhibited in the loss function.

Discussion
The general image classification methods based on deep learning usually first obtain the feature maps of the original image as a whole through the CNN network, and then classifies them according to the feature maps. However, for the aircraft classification task studied in this paper, due to the high similarity between sub-categories, the difference in the corresponding feature maps is very subtle, so the general methods cannot be competent. We try to solve this problem from two aspects: one is to enhance the structures and details beneficial to classification of the non-locally enhanced operation; the other is to locate and discover the details by feature extraction and feature fusion of parts. In fact, we can also try to convert the original image to a particular feature space, in this kind of space, the similarity between examples of the same category increases (or the distance between them decreases), while the similarity between examples of different categories decrease (or the distance between them increases), so as to improve the clustering performance and improve the discriminant ability of the classifier, which is exactly the method of metric learning.
Deep metric learning combines the feature representation ability of deep learning with the similarity characterization ability of metric learning and realizes the perception from original input to semantic output in an end-to-end manner, which has made important progress in several visual tasks. General deep metric learning includes two aspects: the first is encoding original data to feature vector by the neural network; the second is using loss function to carry out a similarity comparison of a group of feature vectors. The classical metric learning loss function includes Triplet loss, N-pair loss, and Angular loss, which are used to distinguish examples with small differences, such as the face dataset. In the future, we will use deep metric learning methods to study the classification task of aircraft in remote sensing images and further improve the discrimination ability of the network.

Conclusions
In this paper, a non-locally enhanced feature fusion network is designed for the remote sensing image dataset with 47 categories of aircraft. Firstly, we insert a non-locally enhanced module into the feature extractor to utilize global information and overcome the limitation of CNN's receptive field, guide the network to focus on discriminating regions, and enhance features beneficial to classification. Secondly, we crop 5 aircraft parts on the shared feature extractor based on key points, then extract and fuse features of these parts through the part full connection layer (PFC) and the combined full connection layer (CFC), which can extract the subtle features inside the parts, as well as act like a mask of aircraft, excluding background interference from the network attention. In the experiments, we analyze the influence of non-locally enhanced operation and part-feature fusion method by the heatmap in detail and verify the improvement brought by our method through contrast experiments. Based on the combination of a non-locally enhanced operation and part-feature fusion, a new loss function is introduced to mine hard examples online. In the challenging dataset, our method finally achieved an accuracy rate of 89.12%.