FGATR-Net: Automatic Network Architecture Design for Fine-Grained Aircraft Type Recognition in Remote Sensing Images

: Fine-grained aircraft type recognition in remote sensing images, aiming to distinguish different types of the same parent category aircraft , is quite a signiﬁcant task. In recent decades, with the development of deep learning, the solution scheme for this problem has shifted from handcrafted feature design to model architecture design. Although a great progress has been achieved, this paradigm generally needs strong expert knowledge and rich expert experience. It is still an extremely laborious work and the automation level is relatively low. In this paper, inspired by Neural Architecture Search (NAS), we explore a novel differentiable automatic architecture design framework for ﬁne-grained aircraft type recognition in remote sensing images. In our framework, the search process is divided into several phases. Network architecture deepens at each phase while the number of candidate functions gradually decreases. To achieve it, we adopt different pruning strategies. Then, the network architecture is determined through a potentiality judgment after an architecture heating process. This approach can not only search deeper network, but also reduce the computational complexity, especially for relatively large size of remote sensing images. When all differentiable search phases are ﬁnished, the searched model called Fine-Grained Aircraft Type Recognition Net (FGATR-Net) is obtained. Compared with previous NAS, ours are more suitable for relatively large and complex remote sensing images. Experiments on Multitype Aircraft Remote Sensing Images (MTARSI) and Aircraft17 validate that FGATR-Net possesses a strong capability of feature extraction and feature representation. Besides, it is also compact enough, i.e., parameter quantity is relatively small. This powerfully indicates the feasibility and effectiveness of the proposed automatic network architecture design method.


Introduction
With the great progress of remote sensing imaging, there are significant improvements for remote sensing images both in quantity and quality, which effectively propels the development of remote selection in Biology field. Genotypes with poor performance are eliminated immediately while that performing well are retained. And then crossover and mutation are conducted among them. Different from RL and EA which both adopt discrete optimization procedure, Gradient-Based method utilizes a novel relaxation programme to transform search space to be continuous. By means of this technique, architecture search process can become differentiable and gradient descent algorithm is able to be applied in it. Moreover, the computational consumption is relatively low compared with the former two search strategies. For these reasons, Gradient-Based method is becoming widely popular. As for the application of NAS in remote sensing, some pioneers have already made some explorations. Chen et al. [25] leverage Gradient-Based method to deal with HyperSpectral Image (HSI) classification task. Zhang et al. [26] propose an efficient search method to achieve an automatic network design procedure in semantic segmentation for high-resolution remote sensing images, which transfers NAS to a more advanced visual task. In this paper, we first attempt a novel differentiable automatic network architecture construction framework to achieve another shift in terms of design pattern for fine-grained aircraft type recognition in remote sensing images. The difference between previous NAS framework in natural scene images and ours can be referenced in Figure 2. To be specific, we utilize a differentiable approach where architecture parameters and weight parameters are optimized by turns in the search process. Due to the relatively large size of remote sensing images (32 × 32 in CIFAR-10 [27] while 156 × 156 in Aircraft17 [14] and 256 × 256 in MTARSI [28]), we search the network architecture in a growing way to avoid CUDA out of memory trouble caused by direct search method. Meanwhile, we also utilize pruning to cut off some weak connections at each search phase. Then, potentiality judgment can decide the network architecture after an architecture heating process. When all the search phases are finished, Fine-Grained Aircraft Type Recognition Net (FGATR-Net) is obtained. Ultimately, we train the FGATR-Net on target dataset from scratch. This design procedure is automatic and it not heavily relies on expert knowledge and artificial experience. In addition, series of experiments show that FGATR-Net outperforms all baseline models on fine-grained aircraft type recognition task not only in accuracy but also in lightweight, especially in the comparison with well-behaved EfficientNet [29] which is obtained by search approach. This strongly indicates that the proposed framework is a feasible and effective method. In summary, the main contributions of this paper are listed as follows: 1. A differentiable automatic network architecture design paradigm for fine-grained recognition in remote sensing images is explored for the first attempt to the best of our knowledge. 2. Considering the relatively large size of remote sensing images, network architecture deepens gradually in the search process. In the meanwhile, some unimportant edges are removed through different pruning strategies with the increase of network layers, making the network more compact. 3. In order to discriminate which architecture has more potential, we adopt potentiality judgment to determine the network architecture after an architecture heating process. 4. Experimental results on two challenging fine-grained aircraft type recognition datasets show that FGATR-Net is able to achieve the highest accuracy with just much fewer parameters. This strongly confirms the feasibility and effectiveness of the proposed method.  The rest of this article is organized as follows. Section 2 describes the proposed automatic architecture design framework in detail. In Section 3, datasets, evaluation metrics and implementation details are stated. And the experimental results to demonstrate the feasibility and effectiveness of our method are provided in Section 4. Subsequently, we have a discussion about the proposed method in Section 5. Ultimately, conclusions are drawn in Section 6 and the plan for further work is also given in this section.

Methodology
The overview of our search framework can be seen in Figure 3. The role of stem is to expand the number of channels and downsample the input remote sensing images. With the depth of the architecture growing gradually (Here, m < n), some unimportant edges (indicated by dotted lines) will be pruned. Potentiality judgment can give a preliminary evaluation of two kinds of pruning strategies (i.e., greedy strategy and -greedy strategy). In the last search phase, we just adopt greedy strategy for there are only very few options. After getting the ultimate architecture, we can train it on the target dataset from scratch.

Differentiable Automatic Network Architecture Design
Popular Convolutional Neural Network (CNN) architecture generally contains many blocks which consist of some common components, such as convolution, pooling and so on. The component can be seen as a mathematical function f , where a feature map from one high dimensional space can be mapped into another. In NAS framework, we collect some candidate functions and organize them into a search space F. The purpose is to find out the optimal combination and connection relationship of these functions in a block. Then, these blocks are stacked layer by layer to construct the whole network. Different from RL-based [18,19] or EA-based [20,21] method, differentiable approach is more efficient and is easier to implement.

Block Representation as a DAG
We utilize a Directed Acyclic Graph (DAG) which is composed of M ordered nodes to represent a block. Each node in a block represents a feature map in high dimensional space. An edge of a DAG can be considered as a component in the search space. We first assume that each block has two different input nodes P 1 , P 2 and a single output node P M . Hence, there are M − 3 middle nodes in total, that is These two input nodes are taken from the output of two predecessor blocks. For middle node, it can be computed as: where i, j separately represent the node index in DAG.
[·] is Iverson bracket. If the expression inside the bracket is true, the bracket value is 1 and vice versa. The output node is the concatenation of all non-input nodes, which can be represented by: where concat is the abbreviation of concatenation.

Architecture Parameters Relaxation
Components selection procedure is discrete, i.e., selected or discarded. Consequently, it is unable to be optimized by Back Propagation algorithm [30]. In order to make this process continuous, we apply softmax to all candidate functions between two nodes in DAG to relax the choice. This step can be described as:α where α is architecture parameter which determines the importance of one component.õ (i,j) is the weighted sum of all components from node i to node j. Through this approach, the architecture parameter can be transformed into [0, 1], making the optimization process feasible. When the architecture search is complete, the maximum architecture parameter between node i and node j can be obtained by:α After that, to prevent excessive connections, the in-degree of node j is constrained to two. This means that only two most likely connections are preserved and the others are discarded.
The purpose of our method is to search for two types of blocks, namely, normal block and reduction block. For normal block, the stride of all components is 1 so that the size of feature map is maintained while the stride in reduction block is 2, enabling the searched model to downsample feature maps. Hence, the architecture parameters can be divided into 2 groups:α normal andα reduction . It is noteworthy that all normal blocks and all reduction blocks share the sameα normal andα reduction respectively.

Optimization Policy
The optimization problem seems to be simpler after relaxation. However, it should not be ignored that architecture parametersα and weight parameters ω are coupled together. The loss function are determined by bothα and ω. Aiming to decouple these two kinds of parameters, we leverage a bilevel optimization policy [31,32] to optimize them separately. Here, we use L train to denote training loss and L val is the validation loss. The optimization process can be expressed as: where ω * represents the optimal weight parameter which is able to minimize the training loss. Through the alternate optimization, the best architecture parameterα * and weight parameter ω * can be obtained.
Then the entire neural network can be constructed by decoding the best architecture parameterα * . Next, the network can be trained from scratch as common measures [9,10].

Model Pruning Strategy
Although gradient decent algorithm is able to be applied with the aid of relaxation and bilevel optimization, the computational complexity is quite high for the weight parameters of all components need to be optimized simultaneously. The computational burden is not obvious in a dataset with relatively small size images such as CIFAR-10 [27] whose size is just 32 × 32 pixels. Nevertheless, for large size images especially for remote sensing images whose height and width commonly contain hundreds of pixels, the computation can be even unbearable. Therefore, we adopt a dynamic method to construct the whole network. As the network deepens, some components with poor performance will be excluded during the search phase. The specific growth process and pruning strategy are described below in detail.

Network Layers Growth
In general, network in the search process is relatively shallow compared with the ultimate architecture, due to the limitation of computation capability. However, according to some current studies [9,33,34], the depth of a network has a significant impact on its performance. The excellent architecture in search stage may perform poorly in evaluation stage.
Therefore, in order to improve the situation, we set the depth of network in the search stage to be equal to the ultimate architecture. However, considering the computational overhead, we adopt a gradual growth approach, rather than a direct one. Every time the network becomes deeper, some unimportant candidate functions in search space will be discarded through pruning. The pruning strategy will be stated in Section 2.2.2. This pattern not only offers the possibility of searching deeper architectures, but also relieves the computational burden brought by optimizing their architecture parameters and weight parameters.

Greedy Strategy and -Greedy Strategy
As regard to the pruning strategy, greedy pruning is a more basic method, which can be expressed by: f ) is the pruning probability of component f between node i and node j. The pruned architecture can obtain the maximum reward under all known conditions by exploiting existing knowledge. Intuitively, this pruning strategy is also very reliable.
Nevertheless, greedy pruning strategy ignores the role of exploration to a certain degree. Typically, exploration and exploitation are two contradictory aspects when making decisions. The reward which the algorithm obtains will not grow, only exploiting existing knowledge. On the other hand, if just explore blindly, the risk will increase. This is exploration or exploitation dilemma [35]. Yet, -greedy strategy can alleviate this contradiction to a certain degree by setting a smaller . In this strategy, any candidate function may be pruned with the probability , otherwise remove the weakest edge. It can be written by: where |F| denotes the number of candidate functions in the predefined search space. It is also worth mentioning that the value of is not unchanged. It will gradually decrease as the pruning process proceeds. Theoretically, this jitter pruning approach can cover all candidate functions as far as possible, thus the exploration is more adequate on the basis of taking account of exploitation. Besides, it is easy to implement this strategy and no extra complex computation is brought.

Potentiality Judgment
In practice, when decision is not frequent, it is difficult to distinguish the behavior of the two strategies for both of them are local optimal methods. Meanwhile, we also can not examine its performance from a global perspective for the limitation of computation resources. Hence, in order to consider the stability of greedy strategy and the exploration of -greedy, we also add a simple validation before making the final decision. The original architecture is pruned by these two pruning strategies respectively after one search phase. We call them A g and A separately. Next, they are preliminarily trained for some epochs and evaluated on the validation set, called architecture heating. We take the best performance of the pruned architecture (i.e., recognition accuracy) as the evaluation metric of different pruning strategies. The final decision is based on the behavior of these two pruning strategies.
All in all, our search framework for FGATR-Net can be briefly summarized in Algorithm 1. After obtaining the final architecture A, we will train it from scratch as a common neural network. And then, it could converge on the target dataset through a sufficient training procedure. Optimize weight parameters of A: ω ← ω − ξ A ∇ ω L D A (ω, α) 6: if epoch > E train then 7: Optimize architecture parameters of A: α ← α − ξ B ∇ α L D B (ω, α) 8: end if 9: end for 10: A g , F g ← greedy pruning strategy for A and F.

11:
A , F ← -greedy pruning strategy for A and F with .

12:
for i ∈ [1, E heating ] do // architecture heating process 13: Per f ormance g ← optimize and evaluate A g . 14: Per f ormance ← optimize and evaluate A . 15: end for 16: if Per f ormance g ≤ Per f ormance then 17: A = A and F = F . = t * . 22: end while 23: Obtain the searched FGATR-Net A.

Dataset
We conduct experiments on two challenging fine-grained aircraft type recognition datasets. The details of these two datasets are depicted as follow.

MTARSI
Multitype Aircraft Remote Sensing Images (MTARSI) [28] is a large-scale fine-grained aircraft type classification dataset. It is carefully annotated under the guidance of 7 experts in remote sensing interpretation. Thus, this dataset possesses a higher authority. In MTARSI, there are 9385 remote sensing images with different width and height extracted from Google Earth. The spatial resolution of these images changes from 1 m to 0.3 m. Totally, this dataset contains 20 types of aircraft and the quantity of images in each type of aircraft ranges from 230 to 846. Moreover, MTARSI has abundant multi temporal information. Images in MTARSI are selected at different times. It enriches the intraclass variation, bringing more difficulties to aircraft type recognition task. Besides, for each aircraft type, the models are quite different even though their appearances are similar. This makes one recognition algorithm difficult to distinguish different types, for the increasing of interclass similarity. Some aircraft examples of MTARSI are listed in Figure 4. As for the split of training set and test set, our approach is consistent with [28], that is four-fifths of the whole samples are selected for training and the rest are used for evaluation.

Aircraft17
Aircraft17 [14] is a challenging fine-grained aircraft recognition dataset. This dataset consists of 1945 optical remote sensing images, which are collected from Google Earth. In total, there are 17 different types of aircraft located in different airports around the world. The size of all these remote sensing images are 156 × 156 and the spatial resolution is about 0.5 m. This dataset also contains multi temporal information. 982 remote sensing images utilized for training are selected from odd years while 963 images for testing are from even years. In addition, the distribution of all types of samples is quite unbalanced. The quantity of training samples in each type ranges from 30 to 60 and the quantity varies from 21 to 60 for test samples. Figure 5 shows some examples of aircraft type in this dataset. It is worth noting that we do not adopt a great deal of extra data augmentation to enlarge this dataset like [14] which expands the original dataset 56 times. We think that this approach is not conducive to verify the generalization performance of the searched model. Therefore, we just leverage the original dataset to evaluate our method in the experiments of this paper.

Evaluation Metrics
In order to judge the performance of various models quantitatively, we utilize Overall Accuracy (OA) which is a basic and widely used indicator. The calculation process of OA is as follows: where N corr is the number of samples which are correctly classified and N is the total number of samples. Generally, one model with a higher OA has a better recognition performance. With regard to the lightweight behavior, we adopt network parameter quantity to evaluate them. For convolution layer, this metric can be expressed as: where k h and k w denote the height and width of convolution kernel respectively. C in and C out are used to represent the number of input channels and output channels separately. Notably, if there is no bias, the last item in Equation (11) can be ignored. In addition, for fully connected (fc) layer or linear layer, it can be written by: Param( f c) = T in * T out + T out (12) where T in and T out respectively indicate the number of input neurons and output neurons. Similarly, the last expression can be removed if the bias is not considered.

Implementation Details
For the architecture search, we first evenly divide the initialization training set into D A and D B (i.e., the number of images in D A is equal to D B ). And 8 candidate functions are collected as the search space: 3 × 3 max pooling, 3 × 3 average pooling, skip connection, 3 × 3 depthwise separable convolution, 5 × 5 depthwise separable convolution, 3 × 3 dilated separable convolution, 5 × 5 dilated separable convolution and zero. Totally, we set 3 phases, each of which contains 5, 8 and 10 blocks respectively. Meanwhile, the number of pruned functions in each phase is 3, 2 and 2. The value of is initialized to be 0.1 with a decay coefficient of 0.9. After pruning, the architecture is heated for 5 epochs.
We adopt Stochastic Gradient Descent (SGD) optimizer [36] with a momentum of 0.9 to optimize the weight parameters. The initial learning rate is 0.025 and it is decayed with a cosine schedule. Moreover, to prevent over-fitting, we set weight decay to be 10 −3 . As for the optimization of architecture parameters, we leverage Adam optimizer [37]. The learning rate for them is 6 × 10 −4 and weight decay is set to be 10 −3 . In each phase, we only fine-tuned weight parameters for 10 epochs so that the model is able to get adequate learning process. After that, architecture parameters and weight parameters are joint trained for 15 epochs. When the optimal architecture is obtained, we train it from scratch on the target dataset. Figure 6 shows the structure of normal block and reduction block searched on MTARSI dataset. Similarly, the scale is more abundant for normal block while the reduction block has a preference for convolutions with large receptive field. Meanwhile, perhaps for the purpose of reducing the computation, we also observe that depthwise separable convolutions are frequently selected in most edges of these two kinds of blocks.

Results on MTARSI
In order to confirm the performance of FGATR-Net on MTARSI dataset, we also compare it with some popular neural networks. All results are listed in Table 1. The confidence interval under the confidence degree of 95% is ±0.14. FGATR-Net outperforms other manual models with much fewer parameters. Here, all baseline models load pretrained weights on ImageNet [38] while FGATR-Net is just trained from scratch. As for the comparison of EfficientNet [29], an excellent baseline model obtained by search approach not only in recognition performance but also in lightweight, FGATR-Net also performs well. There are about four percentage points increase in OA, yet the parameters are reduced by almost a half. As a result, we can draw an important conclusion that the proposed automatic design pattern provides a more flexible connection selection for the neural network. In this approach, there is no need to follow more regular topology rules which involve many artificial factors. Therefore, the topology diversity of neural network is increased, which is conducive to explore the optimal architecture.    Figure 7 shows some Class Activation Map (CAM) [39] examples on MTARSI dataset. CAM is a visualization method which can reflect the more concerned part of a neural network and then can affect its final decision. The red region indicates a high degree of concern while the model pays a relatively low attention to the blue region. We can observe that compared with EfficientNet, FGATR-Net can manage to capture the key region whether the scale of aircraft in images is large or small. For instance, Figure 7a,c respectively present a large aircraft while a small aircraft is shown in Figure 7e. EfficientNet only focuses on small scale aircrafts. On the contrary, FGATR-Net is able to cover all of them. We infer that this phenomenon is caused by the huge difference between remote sensing images and natural scene images. EfficientNet is more suitable for processing natural scene images, but not for remote sensing images. Moreover, FGATR-Net also has strong robustness to the brightness of remote sensing images, such as Figure 7c,e. Even more importantly, FGATR-Net searched on MTARSI can still better express the outline of aircraft. The boundary information and details are more obvious. This is a powerful support for the feature extraction and feature expression of the proposed FGATR-Net.

Results on Aircraft17
Architectures for Aircraft17 dataset searched by the proposed method are shown in Figure 8. The search procedure took about two hours on three RTX 2080Ti with batch size 12. From the results of normal block, we can find that there are many types of convolutions. It is helpful to extract multi-scale information for neural network through various convolutions. While for reduction block, 5 × 5 dilated convolution are preferred. This indicates that the network has a strong demand for a relatively large receptive field. With regard to quantitative experiments, we list the results in Table 2. The confidence interval under the confidence degree of 95% is ±0.42. As can be seen in it, FGATR-Net is superior to all baseline models and achieves the highest OA. Similarly, baseline networks all include pretrained model and FGATR-Net is just trained on Aircraft17 dataset from scratch. This suggests that FGATR-Net has a good performance on feature extraction. Moreover, it also performs well in terms of lightweight. The number of parameters FGATR-Net contains is only 1.86 MB, reducing about 25% parameters compared with ShuffleNetV2 [11] which is a very famous lightweight neural network. This strongly indicates that the architecture is relatively more compact. Here, it is remarkable that the architecture design is just an automatic procedure and this procedure does not need much expert knowledge and expert experience. Yet, some promising results can still be obtained from this automatic framework. In addition to the quantitative results above, the visualization results, i.e., CAMs, on Aircraft17 dataset are also presented in Figure 9. We compare the best performing baseline model, DenseNet, and the proposed FGATR-Net. We find that the salient regions obtained by DenseNet, namely deep red part in CAMs, either has an obvious offset of the aircraft or covers the aircraft area excessively. However, the corresponding CAMs of FGATR-Net almost manges to cover the aircraft area in remote sensing images. It suggests that the proposed FGATR-Net is able to effectively extract the vital information in main region, which verifies the effectiveness of our method.

Discussion
The experimental results show that the proposed FGATR-Net is a very competitive model. It is able to extract the key features in remote sensing images and express them well. From the perspective of both OA metric and parameter quantity, it achieves State Of The Art (SOTA) performance on two quite challenging datasets: MTARSI and Aircraft17 respectively. Besides, it is worth noting that our network architecture design is an automatic approach. It does not heavily rely on expert knowledge and artificial experience. Our method provides a new pattern for fine-grained visual classification in remote sensing images. Table 3 lists the selected pruning strategies at different phases on Aircraft17 and MTARSI respectively. We can observe that greedy strategy and -greedy strategy are selected successively during search on Aircraft17 dataset. We think that network architecture is relatively shallow at the beginning of the search process. The combination of all candidate functions is limited. Thus, exploitation is favored. However, as the network architecture deepens and search space still contains more elements, the combination becomes complicated. Exploration probably could be a better choice. On the contrary, for MTARSI, -greedy strategy is selected only. This is most likely caused by relatively large-scale dataset (9385 remote sensing images in this dataset). Information obtained during search is not sufficient enough, making exploration more effective. From the confusion matrix results in Figure 10, we observe that most types of aircrafts can be correctly distinguished on Aircraft17 dataset. The recognition metrics of them are satisfied in principle, especially for type 6 whose accuracy even reaches 100%. Nevertheless, the proposed method can still get some further improvements. We notice that the recognition performance of type 16 is a huge drag on OA metric for the best baseline model (i.e., DenseNet) and the proposed FGATR-Net. In DenseNet model, type 16 is most likely to be identified into type 15 while FGATR-Net possibly recognizes this type as type 10 and type 11. As shown in Figure 11, there is a big difference in aircraft type between type 16 and type 15, yet the background is quite similar for both two types. The background of these images is green and airport runways are light grey. This demonstrates that background is an important influence factor to DenseNet. On the other side, FGATR-Net overcomes this interference. The proportion of being misclassified as type 15 is not very large. However, FGATR-Net has a poor performance for similar aircraft types. There are both 4 engines for type 16 and type 10 while type 16 and type 11 are swept-back wings and their fuselages are relatively long. In practical application, especially in the military field, a neural network must be robust enough. It can be able to resist external interferences as mentioned above to prevent being deluded. Yet, for the proposed FGATR-Net, there is still room for improvement in this respect. Also, for edge computing devices, it is likely that the hardwares do not perform well as that under experimental conditions. How to keep a good behavior on real-world devices is an important consideration as well for automatic architecture design. Consequently, we will pay more attention to these problems in the future research.

Conclusions
In this article, a novel automatic architecture design framework for remote sensing fine-grained aircraft type recognition is firstly explored. In this approach, the search process is divided into several phases. Network architecture deepens at each phase while the number of candidate functions decreases gradually. To achieve it, we adopt different pruning strategies. Then, the network architecture is determined through a potentiality judgment after an architecture heating process. This approach can not only search deeper network, but also reduce the computational complexity, especially for relatively large size of remote sensing images. When the search process is complete, FGATR-Net is obtained and after that we train it on target dataset from scratch. Experimental results on two challenging datasets: MTARSI and Aircraft17 show that FGATR-Net can achieve the highest accuracy, i.e., 93.76% and 81.72% respectively, with just much fewer parameters (2.33 MB and 1.86 MB respectively), compared with popular baseline models, which verifies that FGATR-Net possesses a strong capability of feature extraction and feature representation. Furthermore, it powerfully indicates the feasibility and effectiveness of the proposed automatic architecture design method. As to future work, we will continue to concentrate on automatic and lightweight network design for remote sensing fine-grained aircraft type recognition, and attempt to improve search efficiency while reduce the computational complexity.