RS-DARTS: A Convolutional Neural Architecture Search for Remote Sensing Image Scene Classification

: Due to the superiority of convolutional neural networks, many deep learning methods have been used in image classification. The enormous difference between natural images and remote sensing images makes it difficult to directly utilize or modify existing CNN models for remote sensing scene classification tasks. In this article, a new paradigm is proposed that can automatically design a suitable CNN architecture for scene classification. A more efficient search framework, RS-DARTS, is adopted to find the optimal network architecture. This framework has two phases. In the search phase, some new strategies are presented, making the calculation process smoother, and better distinguishing the optimal and other operations. In addition, we added noise to suppress skip connections in order to close the gap between trained and validation processing and ensure classification accuracy. Moreover, a small part of the neural network is sampled to reduce the redundancy in exploring the network space and speed up the search processing. In the evaluation phase, the optimal cell architecture is stacked to construct the final network. Extensive experiments demonstrated the validity of the search strategy and the impressive classification performance of RS-DARTS on four public benchmark datasets. The proposed method showed more effectiveness than the manually designed CNN model and other methods of neural architecture search. Especially, in terms of search cost, RS-DARTS consumed less time than other NAS methods.

Nevertheless, designing a neural network architecture directly for remote sensing images requires expert knowledge and takes ample time. Trial and error is a time-consuming process. Most recently, there has been a focus on designing a CNN model for image tasks using a machine automatically. This approach can save time and manpower compared to the manual design of neural architectures. Neural architecture search (NAS) [23], a rapidly developing research direction in automated machine learning, aims to automatically design a CNN model with good performance and high accuracy. Early NAS methods have achieved impressive empirical performance in various tasks, such as image classification, but these approaches are still time-consuming and computationally expensive. In addition, most of the current methods of NAS are performed on natural images, and few works in the literature have solved high spatial resolution aerial image tasks because these tasks require many computing resources in the search phase. Gradient-based NAS has been proposed in recent years, which has reduced time consumption and computational cost. It is possible to apply gradient-based NAS to solve remote sensing images with few computational resources.
In this paper, we conclude the limitation of the high-level features extracted from a pretrained model and confirm the effectiveness of large-scale datasets for training NAS methods. To address these constraints from pretrained methods and create a suitable aerial image model, a new paradigm is proposed to automatically design a CNN model for remote sensing scene classification. The contributions to this paper are summarized as follows.
(1) We present a novel framework called RS-DARTS and use it to search optimal cells, stacking a new CNN model to improve remote sensing scene classification performance. The method in the paper can help to automatically design a more suitable CNN model for aerial images, which solves problems faced by existing handcrafted CNNs based on natural images. It can also handle the collapse issue in searched phases of neural architecture search methods based on remote sensing images.
(2) Some efficient architecture regularization schemes are proposed to improve the efficiency of the search process and reduce the advantage of skip connections to avoid model collapse. Some new strategies are presented to promise a high correlation between the search and evaluation phases. In addition, a noise is added to suppress the skip connections and ensure classification accuracy.
(3) To reduce the consumption of computing resources in the searched phases, we sample the neural architecture in a particular proportion to speed up these search phases. Compared with other previous methods, our method needs less time in the searched phases while still obtaining better classification accuracy.
(4) The effectiveness of the proposed RS-DARTS framework is demonstrated on four public benchmark datasets. Extensive experiments reveal that the final discovered optimal CNN model achieves better classification than the fully trained and pre-trained CNN model. Moreover, the framework performance is better than that of other NAS methods (includes some NAS methods applied to natural images and remote images, respectively) in terms of search efficiency and time consumption. By comparison, RS-DARTS achieved state-of-the-art accuracy of remote sensing image scene classification and improved search time cost by nearly five times compared to DARTS.
The remainder of this article is organized as follows. Section 2 discusses and summarizes CNN models and development of the NAS frameworks. Section 3 describes the principle of differentiable architecture search methods, then we present our approach. In Sections 4 and 5, the description of dataset, experiment setup, and classification result are shown. Finally, the conclusion is shown in Section 6.

Related Works
In past years, CNN-based methods have become the most popular method, with a powerful ability to extract features for aerial images. A suitable CNN architecture decides the performance of scene classification. Nevertheless, designing a new suitable model starting from scratch is a time-consuming process. NAS, as a framework for automatically designed model architecture, has gained wide attention in recent years. The NAS method has an impressive performance in many image tasks, including scene classification. This section provides a brief introduction about CNN-based and NAS methods and how to apply remote sensing scene classification.

CNN-Based Methods
CNN has shown astounding performance in image understanding tasks due to the advantage of automatic feature extraction.
Some methods directly use existing CNN models (pretrained, using CNNs as feature extractors). For instance, Fang et al. [24] proposed a MACP-based classification framework consisting of three steps. First, a pretrained CNN model was used to extract multilayer feature maps. Then, the feature maps were stacked, and a covariance matrix was calculated for the stacked features. Finally, the extracted covariance matrices were used as features for classification by a support vector machine. In reference [25], the author proposed a novel dataset, namely, NWPU-RESISC45, and several representative methods were evaluated with the dataset without using any data augmentation technique, including pretrained AlexNet, GoogLeNet, and VGGNet-16. Zheng et al. [13] attempted to extract CNN activations from the last convolutional layer of pretrained CNN. Then, the method performed multiscale pooling on these activations and made contact with the Fisher vector [26] approach to build a holistic representation. The work [27] utilized pretrained CNN such as VGG to extract dense convolutional features, flatten these features into a vector to generate visual words, and finally use these visible words for classification.
Some methods train a new CNN model from scratch for the scene classification task. For instance, Liu et al. [28] introduced a triplet network from scratch, where the network used weakly labeled images as network inputs. Besides, a new loss function was constructed to improve classification accuracy. F. Zhang et al. [29] proposed a gradient-boosting random convolutional network framework and introduced a multiclass softmax into the framework for scene classification. The paper [30] presented a fast regional-based CNN to detect ships from remote sensing images. It adopted an effective target detection framework and improved the structure of its original CNN. The results showed that training the model from scratch made the feature extraction procedure more effective.
In recent years, some new strategies have been proposed. These methods aim to address several special issues, such as spatial features, limited data, and interclass similarity [31]. For instance, [32] imposed a metric learning regularization term on CNN features, which can boost remote sensing image scene classification performance. Zhu et al. [33] proposed an attention-based deep feature fusion framework. The deep features derived from original images and attention maps were fused by multiplicative fusion, improving the ability to distinguish scenes of repeated texture and salient regions. Li et al. [34] raised the level of model learning from the sample to the task by organizing training in a meta way. The method learned a metric space that could help the model classify remote sensing scenes. The work in [35] developed a deep few-shot learning method based on prototypical deep neural networks combined with a SqueezeNet pretrained CNN for image embedding.
Most of these methods based on CNNs are designed directly for natural images. Unfortunately, the significant differences between natural and aerial images limit the performance of these methods when dealing with remote sensing scene classification directly. In other words, using existing CNNs or modifying modules in the CNN is not the optimal approach to further improve the classification performance.

NAS Methods
Neural architecture search as an automatic framework for designing the model architecture has attracted much attention. NAS aims to find an optimal architecture in a predefined search space and has a better performance in image classification in the validation phase. Although earlier NAS approaches can simplify the process of designing CNN models, reinforcement learning-based (RL-based) NAS [36] and evolutionary algorithm-based (EA-based) NAS [37] are still time-consuming and computationally expensive. In the search phase, these methods require many computing resources and consume hundreds of GPU days. In recent years, the demand for computing resources has decreased for NAS [38]. In [39], DARTS (Differentiable Architecture Search) used a new strategy to search the architecture over a continuous domain, which relaxed the search space to be continued and allowed the efficient search of the architecture by gradient descent. DARTS only consumed a few GPU days in the search phase and obtained a high classification accuracy in CIFAR10 and ImageNet. However, the DARTS performance often collapsed due to overfitting in the search phase, which resulted in a large gap between training and validation error [40]. To avoid these problems, researchers have made the following attempts.
GPAS [41] and Auto-RSISC [42] were based on a gradient descent framework to solve remote sensing scene classification. These methods aim to find more suitable convolution network models. GPAS used a greedy and progressive search strategy to strengthen the correlation between the search and evaluation phases [41]. Auto-RSISC sampled the neural architecture in a particular proportion to reduce the redundancy in the search space [42].
Although GPAS and Auto-RSISC have applied NAS to remote sensing scene classification, there are still some snags. GPAS considers the problem of model collapse, but it requires many computational resources during the training process. Auto-RSISC reduces the redundancy in the search space by sampling the neural architecture in a certain proportion. However, it limits the performance of the model and reduces the diversity of the model structure. To solve these drawbacks, a novel gradient descent-based paradigm method is proposed to design a suitable architecture for scene classification. Using the collaboration mechanism, the binarization of structural parameters and adding noise to the skip connection guarantee a much higher correlation between the search and evaluation phases. Then, we sample the neural architecture in a particular proportion to reduce the redundancy in the search space. Compared with GPAS and Auto-RSISC, the RS-DARTS method not only produces a state-of-the-art performance in remote sensing scene classification but also reduces GPU days in the search phase.

The Proposed Method
In this section, the differentiable architecture search methods (DARTS) are introduced, and the limitations of the methods are analyzed [41]. Then, the proposed search framework is described, and some rules are introduced to strengthen the correlation between the search and evaluation phases. In addition, noise is added to alleviate the collapse in the search phase, and the sample rule is proposed to reduce redundancy in the search phase. The overall framework of the algorithm in this work is illustrated in Figure  1. As an example, we investigate how information is propagated to node ≠ 3. There are three symbols during the search phase, namely, , and . Here, represents the sigmoid function. L represents the 0-1 Loss function, and E represents edge normalization. These functions and symbols are explained in Section 3.2. To determine the calculation results, we only sample a subset, 1/ , of channels and connect them to the next stage so that the memory consumption is reduced by K times [42,43]. During sample processing, and are used to make the calculation smoother and distinguish candidate operations easily. Meanwhile, noise was added to the skip connection to reduce competitiveness with other operations. Then, to minimize the uncertainty incurred by sampling, we use for normalization.

Preliminary: DARTS
As a gradient-based approach, DARTS is much simpler than other search frameworks and can produce high-performance architectures in many tasks. Compared with RL-based and EA-based NAS methods, DARTS does not use the controller [36,44], hypernetworks [45], and performance predictors [46]. Gradient descent mechanism allows DARTS to find suitable network architectures with few GPU days.
Following these works [33,40,41], DARTS first searched for an optimal computation cell in the search phase. The searched optimal cells stack to form a convolutional network or are recursively connected to form a recurrent network. A cell, as a directed acyclic graph (DAC), consists of an ordered sequence of N nodes. Each node ( ) is a latent representation, which represents a feature map in a convolutional neural network. Each directed edge (i,j) in the DAG represents a candidate computational operation ( , ) that transforms ( ) . In DARTS, it sets the cell to have two input nodes and one output node. Each intermediate node is calculated based on its predecessor nodes [39]. Node ( ) is obtained by calculating node ( ) via Equation (1).
where represents the candidate computing operation (e.g., convolution, max pooling, zero) for edge ( , ). To make the search space continuous, DARTS uses the softmax function to relax the categorical choice of a particular operation. The softmax function calculation formula is as follows.
where the operation mixing weights for ( , ) are parameterized by a vector ( , ) of dimension | |. The task of the search method reduces learned a set of continuous variables = { ( , ) }. After of the relaxation, the architecture and the weights are jointly learned within all the mixed operations (e.g., weights of the convolution filters). The and denote loss function for training and validation in the search phase, and determine both the architecture parameters and the weight in the network. The goal for architecture search is to find * that minimizes the validation loss ( * , * ), where the weights * associated with the architecture are obtained by minimizing training loss * = min ( , * ) [31]. DARTS uses a bilevel optimization approach to realize this goal, shown in Equations (3) and (4), where is the higher-level variable, and is the lower-level variable. .
Bilevel optimization is more complex than other optimization methods, which requires lots of computation resources. DARTS applies an approximate approach to solve the problem. The approximation scheme as follows: where denotes current weights maintained by the algorithm, and is learning rate for a step of the inner optimization. If has reached the local partial optimum, namely ( , ) = 0, Equation (5) can be simplified as ( , ) = 0. In other words, we update the training parameters by the crossover method and finally achieve convergence. For instance, we first update and used it to update the network weight . Then, the new network weight * is used to update the operations weight . The method has been applied to many works, such as meta-learning for model migration [47], gradient-based hyperparameter fine-tuning [48], and unfolded generative adversarial networks [49].
Although DARTS dramatically reduces the search time, there are still some problems. First, the optimal normal cell seared by DARTS involves many skip connections in the selected architecture, making the architecture shallow and exhibiting poor performance. First, a shallow network has fewer learnable parameters than a deep network, and thus, it has weaker expressive power. Second, the redundant space of network architecture causes heavy memory and computation overheads. At the same time, the problem is exacerbated by processing high-resolution remote sensing images. These problems prevent the search process from using a larger batch size to either speed up or obtain higher stability [43]. A novel search framework is proposed to address these drawbacks, which is more efficient and suitable to solve remote sensing image tasks. The details of our presented framework are shown in Section 3.2.

Collaboration Mechanism and Binarization of Structural Parameters
In the search phase of DARTS, the skip connection is similar to the residual connection of ResNet [50]. It can help the framework obtain superior accuracy in the search phase. However, the value of the skip connection becomes large when the number of search epochs is large. Thus, the number of skip connections increases in the selected architecture, which can cause collapse in the search phases. Meanwhile, the softmax function is based on exclusive competition, and it enhances the growth of the unfair competitive advantage of the skip-connection [51].
To solve the unfair competition between skip-connections and other operations, we use a cooperative mechanism to limit it. The sigmoid function ( ( )) is used to replace softmax function to calculation parameter ( , ) . It can help each operation selected independently without any competition. Equation (2) is modified as follows, At the same time, when discretizing continuous encoding, DARTS suffers from discrepancies [39]. In the search phases of DARTS, the structure parameter takes a value in the range of 0.1 ∼ 0.3. The range is too narrow to distinguish between good and bad candidate operations [51]. For instance, we select an edge of the cell, [0.1310, 0.1193, 0.1164, 0.1368, 0.1247, 0.1205, 0.1304, 0.1210], and these values are very close to each other. The highest value is 0.1368, and the next highest value is 0.1310, so it is hard to say an operation weighted by 0.1368 is better than another weighted by 0.1310. To solve the problem, the 0-1 loss function is proposed to restrict the sigmoid function results produced. The value of the structure weight can only be 0 or 1. Therefore, when selecting the final operation, we will choose the operation with a weight value of 1. If there are multiple weights of 1 in a set of data, these operations are tried and selected as the most profitable operation. The processing is similar to DARTS, where the two operations with the highest weight are selected [39]. The 0-1 loss function is expressed as follows.
where the Equation (8) is like L2-norm, it easily makes the weight of operations achieved 0 or 1 and helps distinguish between good and bad operations. A control variable 0−1 is added to control the strength of the 0−1 function. The final loss function is as follows: Then, Equation (3) will be modified to 10.

Adding Noise in Skip-Connection
Using only a collaboration mechanism is not an excellent solution to solve the unfair competition in search phases. The collapse of the model still occurs when searching computation cells. Unbiased random noise is applied in the output of skip connects [52]. This not only suppresses unfair competition but also helps the training of deep models [53]. Thus, a small and unbiased noise is introduced, which has zero mean and small variance.
As random noise � adds to the output of skip connection, and represents the structural weight of skip connection. The expression of the loss function for skip connection can be written as, where ( * ) represents the validated loss function and ( ) represents the sigmoid function to calculate . If the noise is much smaller than the output values, then we can get Equation (12) In the noisy scenario, the derivation of the skip connection is expressed in Equation (13).
As mentioned above, if � ≪ , there is no effect on the computed output. In this work, the Gaussian noise is used to attenuate the unfair competition of the skip connection. Equation (6) should be modified to (14),

Sample 1/K of All Channels into Mixed Computation
Despite its sophisticated design, DARTS still has a spatial redundancy problem in search phases and suffers from heavy memory and computation overheads [48]. To solve this problem, we randomly sample a subset while bypassing the rest directly in a shortcut [43]. The ideal method avoids sending all channels into the operation selection. The computation on this subset is an approximate agent and calculation on all channels. It can cause a tremendous reduction in the memory and computation costs and help to avoid getting stuck in local optima.
Using the strategy can significantly increase the batch size and speed up the trained processing. Specifically, as only 1/K of channels are randomly sampled for an operation selection, it reduces the memory burden by almost K times [42,43]. The rule allows using a K times large batch size during training, which not only speeds up the network search but also makes the process, particularly for large scale datasets, more stable. The parameter ( , ) is introduced to define whether the channel is marked or not. The selected channel is marked as 1, and the unselected channel is marked as 0. Therefore, Equation (14) in determining the channel is expressed as follows. , (15) where ( , ) × represents the selected channels and (1 − ( , ) ) × represents the unselected channels. But the sampled strategy can cause undesired fluctuation in the resultant network architecture [43]. To alleviate the problem, we introduce edge normalization, the computation of becomes: where ( , ) represents the normalization operation on ( , ). These parameter ( , ) and ( , ) decide the connectivity of edge ( , ) . The modified optimization process codenamed RS-DARTS is shown in Algorithm 1. Paying attention to the value of , it is assigned to the learning rate for the optimizer of the network weight .

Experiments
In this section, the experimental setup for evaluating the proposed method is described. First, Section 4.1 introduces the used data sets. Then, in Section 4.2, the metric for quantitative evaluation is described. In Section 4.3, a new large-scale dataset is merged and explains the reasons for merging the dataset. Finally, the implementation of the proposed method is shown in Section 4.4.

Datasets Description
In the experiments, four remote sensing scene datasets, namely, AID [54], NWPURE-SISC(NWPU) [25], RSI-CB [55], and PatternNet [56] datasets (These data resources can be obtained from https://captain-whu.github.io/AID/AIDscene.html, http://www.escience.cn/people/JunweiHan/NWPU-RESISC45.html, https://github.com/lehaifeng/RSI-CB and https://sites.google.com/view/zhouwx/dataset, accessed on 19 November 2020, respectively), are used to validate the performance of the proposed method. Since these datasets are collected from different satellite sensors, they show rich diversities, such as image size and resolution. The description of these datasets is shown in Table 1. The table shows that the maximum of the total images is for the NWPU dataset. For images/classes, the PatternNet dataset has the largest number with 800. In addition, Figure 2 shows some examples of different datasets. The first column is random sampled from AID; Scend column is from NWPU-RESISC45; The third and the last column are sampled from Pattern-Net and RSI-CB, respectively.

A Metric for Evaluation
In this paper, to assess the classification accuracy of the proposed method, overall accuracy (OA) is employed as a criterion. The OA is defined as follows.
, (17) where ( , ) represents the accuracy for the ( , )th result, and is the total number of the recording results.
represents the number of the image for each batch size.

Prepared Data Sets
As mentioned in Section 4.1, NWPU has the maximum number of total images, but in a single class, PatternNet has the largest number of images. These two datasets can be used as training sets in the search phase of RS-DARTS. Forty percent of the sample data are selected as the training set in the search phase to obtain the final network. The network is trained in the four benchmark datasets. The classification results are shown in Table 2, where the results of the final network from NWPU-based search achieve 86.77%, 90.06%, 99.02%, and 98.93%, respectively. However, the accuracy of PatternNet-based search is better than that of NWPU-based search. This illustrates that the size of the amount of data in a single category directly affects the performance of the searched final CNN models. A large dataset can help find a suitable and robust CNN model for remote sensing image scene classification. However, compared with the ImageNet dataset, the AID, NWPU, etc., datasets are smaller. In addition, remote sensing images from different datasets are collected by different sensors at different surface locations. This leads to some differences between these datasets [11]. Therefore, we use these rules to merge the four large benchmark remote sensing scene datasets to obtain a large dataset called Large-RS, which consists of 96,252 images with 80 classes. As the merged dataset and the GPAS model [11] are not open sources, we merged Large-RS using the same rules and divided it according to the same scale. These rules of merging are as follows.
(1) Since these datasets are from different institutions, the same scene from different datasets may be labelled with another name. A new uniform class names rule is used.
(2) We reclassified and merged some of the images that contained overlapping scene content but did not belong to the same category.
(3) Due to some ambiguous category definitions or a small number of images in different datasets, we directly removed these category data.

Implementation Details
To evaluate the effectiveness of the presented method, the experiment is split into two parts. In the search phase, the optimal cell architecture is searched by the proposed RS-DARTS on the Large-RS dataset. This is divided into three subsets based on stratified sampling, i.e., 40% (60%) samples for training, 20% samples for validation, and 40% (20%) samples for testing. In the evaluation phase, the final CNN model is constructed by the optimal cell. Then the final network uses these four benchmark datasets (e.g., AID, NWPU, RSI-CB, and PatternNet) through the full-training method to train from scratch. The hardware and software environments are shown in Table 3. Table 3. The hardware and software of the experimental environment.

GPUs
NVIDIA Tasla V100 Pytorch 1.6.0 Python 3.7.7 In the search phase, we predetermine the search space , which contains 3 × 3 and 5 × 5 separable convolution, 3 × 3 and 5 × 5 dilated separable convolution, 3 × 3 max-pooling, 3 × 3 average pooling, identity and none. All of the training images are resized to 224 × 224 pixels to capture more image information. Some data augmentation technologies are used to avoid overfitting and ensure the robustness of the searched architectures. These augmentation technologies include random cropping, rotation and flip, and cutmix. The hyperparameters are shown in Table 4.
In the evaluation phase, the optimal searched cell is stacked to construct the final CNN model and trained in the four benchmark datasets. We train the final optimal model from scratch for 500 epochs to ensure convergence. In this phase, we also employ cutmix and cutout regularization to add the number of images. The hyperparameters are shown in Table 5. Except for the hyperparameter settings mentioned in the above table, all the hyperparameters are the same as in Table 4.

Results and Analysis
In order to verify the validity of RS-DARTS, the experimental results of the proposed method are analyzed. At first, the performance of RS-DARTS contrasts with classical CNN models. Next, the proposed RS-DARTS compares with four state-of-the-art NAS methods to confirm the efficiency and robustness. Finally, the architecture search visual results are presented.

Compared with CNN Models
In the experiments, to validate the classification performance of the final search network, it is compared with the classical CNN models (i.e., fully trained and pretrained) on remote sensing images. The final searched network is stacked by the best search cell. These classical CNN models include VGG-16 [57], GoogleNet [58], and ResNet-50 [50], which are often used as feature extractors in various CNN-based approaches on remote sensing scene classification. For the training datasets, the proportions of training samples in the AID, NWPU, RSI-CB, and PatternNet datasets are set to 50%, 60%, 60%, and 50%, respectively. The final classification results are illustrated in Table 6, where the fifth to the seventh rows of the list presents the classical CNN models with pretrained results. For a fair comparison, these pretrained models are initialized based on ImageNet and fine-tuned on the target datasets. All others row lists are randomly initialized and trained from scratch.
In Table 6, for the AID dataset, the accuracy of the fully trained method is 93.37%, and that of the pretrained method is 93.90%. However, the classification accuracy of the final network searched by RS-DARTS achieves 94.14%. Compared with that of the fully trained and pretrained methods, our proposed approach achieves the highest classification accuracy. For the NWPU, RSI-CB, and PatternNet datasets, the accuracy of the proposed RS-DARTS compared with that of fully trained models is improved by 6%, 2%, and 2%, and contrasted with that of pretrained classical model methods, the accuracy of classification is improved by 1%. In the experiments, in the case of fewer training samples (40% training samples), the final network obtained from RS-DARTS has slightly lower accuracy than other pretrained CNN models in the AID dataset but has a significant improvement in the accuracy of different datasets. This reveals that the proposed NAS-based method can help improve the remote sensing scene classification performance. The pretrained classical CNN method is not the optimal method for remote sensing scene classification. Meanwhile, the classification accuracy for the RSI-CB, and PatternNet datasets is higher than that for the AID, and NWPU datasets. The reason is that each category amount of data for the former is relatively larger compared to the latter, thus making the CNN model easier to identify the image category. This is why merging the Large-RS dataset could help the final CNN model achieve better accuracy of scene classification.

Compared with Other NAS Methods
In the second experiment, the efficiency and robustness of RS-DARTS are verified by comparison with four state-of-the-art gradient NAS methods, which include DARTS [39], PC-DARTS [43], Fair DARTS [51], and GPAS [41]. For these selected NAS methods, the configuration is the same as in previous works [35,37,46,48]. The merged dataset Large-RS (40% training samples) is used for training in the search phase. The OA, the search cost (the number of GPU days), and the number of parameters are used as the criteria to judge the effectiveness of the NAS methods and RS-DARTS. Among them, the number of parameters determines the size of the final CNN model. Table 7 presents the OA results. Compared with the classification performance of the final CNN model from NAS methods on four datasets, the proposed RS-DARTS achieves the highest accuracy. For the NWPU dataset, RS-DARTS reaches 93.56%, which is much better than other gradient NAS methods. For other datasets, RS-DARTS achieves state-of-the-art accuracy. Compared with GPAS [41], the proposed RS-DARTS exhibits a 1% improvement in the NWPU, RSI-CB, and PatternNet datasets. Although the OA of RS-DARTS is weaker than that of GPAS in the AID dataset, the performance of the proposed method achieves state-of-the-art accuracy in other datasets. These results demonstrated that the benefit of the proposed rule prevents the collapse of the model during the searched phase and maintains the final network depth, which can guarantee the availability of the final network to address remote scene sensing classification.  Table 8 shows the results of the comparison of the number of parameters. For the AID dataset, DARTS produces the number of parameters of 2.3M, and PC-DARTS and GPAS have the number of parameters of 3.6M. However, the proposed RS-DARTS parameter number is 3.3M, which is a reduction of 0.3M compared to the PC-DARTS and GPAS parameter numbers. This may be due to limiting the advantage of the skip connections so that other operations are selected and increase the model parameters. It shows the effectiveness of our restriction on skip connections. Although more parameters were generated than DARTS, RS-DARTS spends significantly less time (0.83 GPU days) than DARTS (4 GPU days) in the search phase. The searched cost results are shown in Figure  3. RS-DARTS only costs 0.83 GPU days to find a competitive cell architecture, significantly reducing the search process time compared to DARTS (4 GPU days) and GPAS (1.8 GPU days). Although the number of parameters of the final CNN model obtained by RS-DARTS is increased by 1M, the search cost is lower and the method can guarantee higher classification accuracy.

Searched the Cell's Result
In the NAS method, the architecture of the searched cell decides the effectiveness of the final CNN model. However, the inevitable aggregation of skip-connections in normal cells can collapse in the search phase [44]. To demonstrate the ability to control skip connections of RS-DARTS, in this section, RS-DARTS, DARTS, and PCDARTS are used to search the optimal cell architecture on the Large-RS dataset, and then these optimal cell architectures are compared with each other. Meanwhile, GPAS is compared with the proposed RS-DARTS method.
The detailed search results with DARTS and PCDARTS are shown in Figure 4. The normal cell architecture shown in Figure 4a,c,e is obtained from DARTS, PCDARTS, and RS-DARTS. These results reveal that the normal cell searched by RS-DARTS tends to preserve deep connections (e.g., connections) and pooling layers rather than shallow connections (e.g., skip connections). This promises that the normal cells from RS-DARTS contain more learnable parameters. Thus, the final CNN model searched by RS-DARTS has a stronger expressive ability. In other words, the normal cell searched by RS-DARTS guarantees the depth of the final network, which leads to better performance. For reduction cells (Figure 4b,d,f), it is found that the reduction cell searched by RS-DARTS reserves the skip connection in many deep connections, which guarantees solving the gradient explosion and gradient disappearance problems in the deeper network [50].
To further validate the effectiveness of RS-DARTS, GPAS and Auto-RSISC are compared with our proposed method. The results of the searched normal cell architectures of GPAS and Auto-RSISC are shown in Figures 5a and 6a. The normal cell searched by RS-DARTS (Figure 4e) further reduces the number of skip connections which ensures the depth and expressiveness of the final network. For the reduction cell (Figures 4f, 5b and  6b), RS-DARTS provides the skip connection reserved in reduction cells. These results illustrate that the proposed method not only eliminates the enrichment of skip connections in a normal cell but also ensures the presence of skip connections in a reduction cell to make the network training more stable.

Conclusions
In this paper, we summarize the limitation of the features extracted from pre-trained CNN models and investigate the performance of the Neural Architecture Search for remote sensing scene classification. We find the drawback of GPAS and Auto-RSISC and propose a novel NAS framework applied to the task. Our framework uses a collaboration mechanism and binarization of the structural parameters search strategy with gradientbased optimization. These rules can alleviate the unfair competition between skip connections and others. However, we found that this strategy does not entirely avoid the problem of unfair competition. So we add a simple method that adds noise to skip connection to suppress the unfair competition further. To promote the speed of the search and reduce the memory and computation costs, we use a random sampling method to select feature channels. Moreover, to make the search model more generalizable, we merge a large-scale scene data set similar to VHRRS, namely Large-RS, using the combined dataset as the training validation dataset in the search process.
Extensive experiments demonstrate the efficiency of RS-DARTS framework and the impressive classification performance of searched CNN architectures on four public benchmark data sets. It shows that the framework is practical for remote sensing scene images. Then, our subsequent work applies the method to more remote sensing image problems. We believe that the NAS method will provide a new idea for model design in remote sensing images.
Funding: This research received no external funding.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.