A Method for Gradient Differentiable Network Architecture Search by Selecting and Clustering Candidate Operations

: The current evolution of deep learning requires further optimization in terms of accuracy and time. From the perspective of new requirements, AutoML is an area that could provide possible solutions. AutoML has a neural architecture search (NAS) ﬁeld. DARTS is a widely used approach in NAS and is based on gradient descent; however, it has some drawbacks. In this study, we attempted to overcome some of the drawbacks of DARTS by improving the accuracy and decreasing the search cost. The DARTS algorithm uses a mixed operation that combines all operations in the search space. The architecture parameter of each operation comprising a mixed operation is trained using gradient descent, and the operation with the largest architecture parameter is selected. The use of a mixed operation causes a problem called vote dispersion: similar operations share architecture parameters during gradient descent; thus, there are cases where the most important operation is disregarded. In this selection process, vote dispersion causes DARTS performance to degrade. To cope with this problem, we propose a new algorithm based on DARTS called DG-DARTS. Two search stages are introduced, and the clustering of operations is applied in DG-DARTS. In summary, DG-DARTS achieves an error rate of 2.51% on the CIFAR10 dataset, and its search cost is 0.2 GPU days because the search space of the second stage is reduced by half. The speed-up factor of DG-DARTS to DARTS is 6.82, which indicates that the search cost of DG-DARTS is only 13% that of DARTS.


Introduction
Neural architecture search (NAS) is an algorithm that automatically finds the optimal model architecture, and it is a field of automatic machine learning (AutoML), which has been recently gaining attention. NAS enables automatic design of the model architecture rather than manually running hyperparameters of models. NASNet [1], which laid the foundation of NAS, searches architecture with 500 GPUs in four days (1800 GPU-days) among 10 15 possibly generatable architectures. Therefore, current NAS algorithms have evolved to reduce the search space and select operations with changes in criteria to decrease the search cost and increase performance.
Early NAS algorithms, including reinforcement learning [2] and evolutionary algorithms [1], had huge search costs, up to thousands and hundreds of GPU days, whereas gradient-descent-based DARTS [3] have shown a remarkable accuracy with only four GPU days over one 1 GPU compared to existing methods. DARTS uses the idea of a cell-based architecture defined in NASNet [1] to find the best network architecture. The final architecture found by DARTS has multiple stacks of the same struct cell, which is composed of multiple nodes. Each node is connected by operations selected from the candidate operations. DART aims to determine the shape of the directed acyclic graph (DAG) of the nodes composing the cell architecture, and to select the operations used on each edge. Mixed operations, which include every candidate operation, are generated and used on the edges connecting the nodes. With the progression of the training step, each operation comprising a mixed operation is changed. When the training is completed, one of the operations that has the largest architecture parameter is selected, and then the cell architecture is decided. The stacks of the cells comprise the final model.
The current DARTS experiences the problem of vote dispersion, as described in Section 3.1. In this study, our goal was to solve the vote dispersion problem in DARTS in order to reduce the error in the final model and decrease the GPU cost. Therefore, we propose differential group-differentiable architecture search (DG-DARTS ) in this paper.
The rest of this article is organized as follows: Section 2 describes related works, especially focusing on DARTS. Section 3 provides a description of DARTS and discusses previous works, along with their relationships. In Section 4, we present the experimental environment details and results. Section 5 discusses the major benefits of our research in comparison to existing results. Finally, in Section 6, we conclude the paper.

Related Works
With the rapid development of deep learning, an area of AutoML has emerged. Neural architecture search (NAS), which is an automated design of artificial neural networks, is replacing the manual design of neural networks in order to solve desired tasks. NAS initially uses early reinforcement learning [1,2,4], with the evolutionary algorithm [5] and Bayesian optimization [6] being the major methodologies employed.
DARTS [3] based on gradient descent is one of the bases for our research. The model weights w and architecture parameters are toggled, and gradient descent is used for training in DARTS; however, DARTS has several drawbacks owing to its training method. Studies on overcoming these limitations are ongoing. P-DARTS [7] continuously searches architectures for the number of cells in three stages to solve the depth gap problem. The depth gap problem occurs because DARTS finds a 20-cell model from an 8-cell search network. PC-DARTS [8] decreases the memory burden by partially connecting the channels in the networks; then, it can increase the batch four times, thus decreasing the search cost to one-quarter. DART+ [9] prevents the collapse phenomenon, which dramatically increases the number of skip connections by analyzing the number of skip connections and the final architecture performance. This means that the number of skip connections and the number of training epochs are decreased using methods such as early stopping.
Fair DARTS [10] applies methods such as changing the activation and loss functions to solve the selection of skip connections due to an unfair advantage in an exclusive competition. StacNAS [11] attempts to solve the bi-level optimization of DARTS. Before training, the feature maps of the candidate operations are acquired by creating a feature-map-focused model and then a correlation coefficient among the feature maps of each operation to group operations. Subsequently, representative operations of each group and operations of the winner operation groups are used to train the model weight and architecture parameters.
The abovementioned related works can be divided into two branches in order to overcome shortages of DARTS: adopting a new architecture search algorithm [7][8][9] or changing the activation or loss functions for DARTS [10,11]. Our method can be categorized into the former branch. Our method reuses most of the features of DARTS such as search space, methodology, functions, hyperparameter, and so on.
We focused on the stage division of existing DARTS and operation selection in order to find a solution to the vote dispersion problem and to obtain reliable results with minimum changes to the existing DARTS methodology.
We suppose the goal of DG-DARTS, unlike StacNAS, which pre-changes the search space, is to create a new search space in the process of training based on NASNet [1]. Through this approach, we can solve the vote dispersion problem by grouping operations for a new dataset and a new candidate operations set. By solving the vote dispersion problem, we can avoid eliminating the required operations. In other words, the types of required operations are selected using the weight sum per group, even though similar types of operations exist in the search space.

Vote Dispersion Problem
For architecture search, DARTS [3] generates a search network with a stack of eight cells in which every operation in the search space is combined, and this combination of operations is the mixed operation. There can be multiple instances of mixed operations in DARTS. Using the search network, the architecture parameter α of the mixed operations changes as the training of the dataset progresses. When the training is complete, the operation with the largest α is selected among k operations in the search space, where k is the number of predefined operations; k = 8 in this study. In other words, the operation of the cells is decided with the largest α.
For the sake of the reader's convenience, Algorithm 1 for DARTS is cited from [3].
A negative phenomenon can occur: the weights of the appropriate operation could be dispersed, and irrelevant operations for an edge can be selected. This occurs because an invalid operation can have a higher weight than an adequate operation. The weights of several similar, adequate operations are divided once all such operations are considered important. In this paper, we refer to these phenomena as the vote dispersion problem. Definition 1. Vote Dispersion: Votes for the weight of meaningful but similar operations happen to be dispersed, and the weight of a meaningful operation becomes lower than that of a meaningless operation.
Thus, the possibility exists that meaningless operations can be selected because of the vote dispersion problem.
Many NAS algorithms, including NASNet [1], search the search spaces to determine appropriate operations composing cells. Examples of such operations are convolution, pooling, and skip connections. There are groups of operations with similar computation outputs in the search space, such as {max pool and average pool}. The possibility exists of a vote dispersion problem under such conditions, and DARTS cannot avoid this problem because it uses the same search space as NASNet. In this study, our goals were to solve the vote dispersion problem experienced by DARTS, to increase the performance of the final architecture, and to decrease the search cost. In DG-DARTS, operation search space is newly created for stage 2 by clustering results.

DG-DARTS Method
• In DG-DARTS, total epochs are divided by two, and each half of the total epochs is separated into two stages. Update architecture α by descending ∇ α L val (w, α) 17: Update weights w by descending ∇ w L train (w, α) 18: epoch = epoch + 1 19: end while 20: Derive the final architecture based on the learned α An example of a vote dispersion problem and its resolution by DG-DARTS can be addressed as follows: We suppose a case in which an edge between nodes 1 and 2 is a convolutional operation that minimizes the loss function. In this case, as the training is processed, the architecture parameter ratio of the convolutional operation is increased by the SoftMax function, as described in Equation (2), while the remaining operations are decreased.
where x i is the node output and α i,j is the architecture parameter.
If there are four types of convolution operations, each of the four operations shares the largest weight; thus, the weight of each operation is eventually lower than those of the other operations. Therefore, other operations, such as skip connections, can be selected as the final operation. Such problems are critical once operations in the search space of DARTS [3] are added or changed. For example, once a good and meaningful operation that was found in another study is added to the existing search space, the newly added operation can have its own shares of weights in the existing search space. If DARTS is applied to a new dataset, a new search space is also composed, then the ratio of the operation group needs to be tuned by analyzing the relations of each operation; otherwise, a vote dispersion problem may occur. To resolve the vote dispersion problem, DG-DARTS was constructed in this study.

Clustering Criteria: Gradient of Architecture Parameter
DG-DART uses a gradient, i.e., a derivative, of the architecture parameter α in order to determine the relationship between operations in the search space, apart from previous works. DG-DARTS repeats the epochs of the training, and α is updated in each epoch. Figures 3 and 4 show the gradient of the weight over epochs,. The gradient of α can be used as a hint to determine the relationship between operations because it varies with the training epochs. For example, if one specific edge requires a 5 × 5 filter for the convolution operation, dil_conv_5 × 5 and sep_conv_5 × 5 also have higher weights in the search space of DARTS, while the weights of the other operations decrease. Thus, the Elkan K-means cluster [12] algorithm is introduced to cluster operations based on the criteria of the gradient of the weight. Since each operation has several gradient changes of α with respect to the epochs, operations without none are considered labels, and their gradients are considered data. After clustering, the weight sum for each cluster is used to add clusters with large weights to the search space O for use in the next stage. Once we have a deficient number of operations for the next stage, higher-ranked operations from the second-ranked clusters are also added to the search space. With this new search space O composed, which is a subset of O, stage 2 is processed in the same manner as stage 1.    Here, we present a detailed example. Table 1 shows the detailed weights of the clusters, especially for the seventh mixed operation after stage 1. According to the methods described in Section 3.2.1, cluster numbers, operation names, architecture parameters, and the sum of the architectures for each cluster are shown. In the case of DARTS, dil_conv_5 × 5 with 0.1059 is selected, and the architecture search is completed, as none has the largest architecture parameter is excluded by DART's policy. In the case of DG-DARTS, half of the operations from the search space of stage 1 are selected, and a new search space is generated for stage 2. In other words, according to the policy of DG-DARTS, four operations among eight operations of the search space are selected: none, dil_conv_5 × 5, sep_conv_5 × 5, dil_conv_3 × 3. Without grouping of the operations, sep_conv_3 × 3 might be selected with a thirdranked weight; on the contrary, grouping of operations discards sep_conv_3 × 3 from the search space for stage 2 per the criteria of the sum of clusters. As mentioned in Section 3.2.1, none is excluded for clustering but is included in the selection process of the new search space. Figure 5 shows a virtualization of Figure 3b according to the clusters, along with the PCA results. The operations in the clusters are as follows: Cluster1: max_pool_3 × 3, avg_pool_3 × 3, skip_connect; Cluster2: sep_conv_3 × 3, sep_conv_5 × 5; Cluster3: dil_conv_3 × 3, dil_conv_5 × 5. The PCA results show that similar operations are grouped, which are similar to the K-means results.   Figure 4b according to the clusters, along with the PCA results. The operations in the clusters are as follows: Cluster 1: max_pool_3 × 3, avg_pool_3 × 3, skip_connect; Cluster 2: sep_conv_5 × 5, dil_conv_3 × 3, dil_conv_5 × 5; Cluster 3: sep_conv_3 × 3. Figure 6 shows that sep_conv_3 × 3 is an extraordinary operation, as clearly observed in Figure 6c,d operation dil_conv_3 × 3 has a lower α but a larger sum of α, such that it includes a new search space per the DG-DARTS policy.

Bebefit: Regularization and Search Cost Decrease
The major advantage of DG-DARTS is twofold. The first is the regularization of a specific operation, and the second is the reduced search cost. The benefits of DG-DARTS are obtained from the two-stage architecture search and operation clustering.
DARTS [3] shows a negative phenomenon in which the number of skip connections is dramatically increased, known as collapse [9], as the epochs progress. Empirically, once the number of skip connections is greater than three, the performance of the final architecture decreases significantly [7,9]. Recent DARTS-based approaches have focused on solving the weight monopoly problem of the skip connection [7,8,10,11,13]. P-DARTS [7] uses drop-out in units of operation, tries to regulate the skip connection's amount of training, and the number of skip connections of the final architecture is manually limited to two. Fair-DARTS [10] uses a sigmoid activation function instead of SoftMax to overcome the unfair competition of the over-selection of skip connection. PC-DARTS [8] applies regularization weight-free operations, such as skip connection and max-pooling, using edge normalization. DARTS+ [9] prevents excessive skip connections because the final architecture has two skip connections that produce the best performance. To prevent additional selection of skip connections, DARTS+ limits the number of skip connections or performs an early stop using the architecture parameter; here, DARTS+ used two stages of 25 epochs, which produced an effect similar to the early stopping of DARTS+.

Relationship to Previous Work
Several features of previous studies inspired this study. DG-DARTS has a similar concept to StacNAS [11] in that similar types of operations are grouped. StacNAS clusters operations with the criteria from the feature map by calculating candidate operations regarding the dataset before the architecture search. From the four clusters, one representative operation was selected, and the none operation was added. Five operations, max_pool_3 × 3, skip_connect, sep_conv_3 × 3, dil_conv_3 × 3, and none, compose the search space for stage 1, and the selected operations are used in stage 2. Several studies have focused on the feature map of the operations to improve the search space, and all of them require pre-calculation to obtain the feature map [10,11,13].
In comparison, DG-DARTS does not require pre-calculation of the feature map; therefore, DG-DARTS requires no additional calculation time. Both the relationships between the positions of the operations and the relationships of the operations inside the cell are considered because clustering in DG-DARTS is performed for all edges, for example, mixed operation. Therefore, the dynamic nature of DG-DARTS can help obtain dynamic and situation-oriented clusters, apart from predefined clustering. For example, mixed operation number 0 is clustered with sep_conv, dil_conv clustering, whereas mixed operation number 13 can have clusters with 3 × 3 and 5 × 5 according to the filter size criteria.
Dividing stages to reduce the search space is one of the ideas of P-DARTS [7]. P-DARTS has three stages, each with 25 epochs, and the number of cells in the search network increases with the number of stages, thus solving the depth-gap problem. In this process, a negative phenomenon can occur where the skip connection is over-selected, eventually leading to the poor performance of the model. To regularize this phenomenon, drop-out is introduced in terms of operations and limits the number of skip connections by two to select the final architecture. DG-DARTS regularizes skip connections by operation clustering, and collapse [9] can be prevented without additional work.

Experimental Environment and Data Set
CIFAR10 [14] was the base dataset used in our experiment. The CIFAR10 dataset contains 60,000 images spanning 10 categories with a resolution of 32 × 32 pixels. Among these images, 50,000 images were classified for training and 10,000 images for testing. In this study, 25,000 images were used for group network weight training and 25,000 images for architecture parameter training.

Implementation Detail
The architecture parameter α is a criterion. Every parameter used in this experiment was the same as that of DARTS, except for the batch size and epochs. The search space O is also the same as that of DARTS; in other words, DG-DART uses eight operations: (none, max_pool_3 × 3, avg_pool_3 × 3, skip_connect, sep_conv_3 × 3, sep_conv_5 × 5, dil_conv_3 × 3, and dil_conv_5 × 5), which are obtained from the operation search space of NASNet [1]. The batch size was set to 96. Two stages with 25 epochs each in this experiment had a total of 50 epochs, which is the same value as the 50 epochs per 1 training in DARTS. Initially, 16 channels and eight cells composed of six normal cells and two reductions constituted the search network. To train the network weight, the stochastic gradient descent (SGD) optimizer [15] was used, and we set the initial learning rate as 0.025, momentum as 0.9, and weight decay as 3 × 10 −4 as the parameter values. To train the architecture parameter, the Adam optimizer [16] was used, and we set the initial learning rate to 3 × 10 −4 , momentum in (0.5, 0.999), and weight decay to 0.001. The Elkan K-means clustering algorithm [12] was used to cluster operations with an initial iterator of 30, max iterator of 300, and tolerance of 10 −4 . With such clustering parameters, clustering was performed for 1~4 clusters. Table 2 shows the search cell performance for each number of clusters. If the number of clusters is four, too many skip connections are selected, and the number of skip connections is set to two. In other cases, there are 1~3 clusters, and skip connections were well-regulated. Table 3, the best case can be found when the number of clusters is three. For the architecture search, 0.22 GPU days (5.35 h) are consumed with 1 Tesla P100 GPU.  Clustering is performed after finishing stage 1, and clustering is used to solve the vote dispersion problem. Based on the number of clusters, the final model performance can be described as follows: one cluster indicates that all operations are in the same cluster, and it is the same result produced by DARTS searching the architecture twice with 25 epochs.

As shown in
With a larger number of clusters, the final model performance increases and we found the best performance when the number of clusters was three, since more clusters led the same structure as the DARTS structure with unregulated operations.
For example, seven clusters of operations create the same structure as that of DARTS without the effect of clustering.
With our clusters, as we experienced, skip_connect is not well-regulated; thus, too many skip_connect were selected, which caused the collapse phenomenon [9]. It is clear that models with five skip_connect operations out of eight will perform poorly, so two skip_connect operations were selected manually as is performed by P-DARTS [7].
This model from DG-DARTS showed a test error of 2.71%, whereas the three-cluster model achieved a test error of 2.51%, which is the best results.
We provide another result of our experiment for better verification of our strategy. Table 4 shows the effect of α. In our research, cluster 2, which has the largest sum of α, was selected for the next stage. Instead of choosing operation sep_conv_3 × 3 from cluster 2, the operations from clusters 1 and 3 were chosen to proove our strategy. Operation skip_connect from cluster 1 had an accuracy of 97.06%, and operation dil_conv_3 × 3 from cluster 3 had an accuracy of 96.94%. Operation sep_conv_3 × 3 from cluster 1 achieved the best accuracy of 97.49% as intended by our strategy. An example cell architecture found by DG-DARTS is shown in Figure 7 for the CI-FAR10 data set.

Discussion
Here, we discuss the outcome of the architecture's evaluation. The evaluation results are summarized in Table 2. DG-DARTS has a test error of 2.51 on the CIFAR10 [14] data set. ProxylessNAS [19] uses a different search space than NASNet [1] and incurs a large GPU cost. StacNAS [11] requires pre-calculation and has a structure of 17 cells for a search network using a high-performance GPU, whereas DARTS [3] and DG-DARTS have an eight-cell structure. In the case of P-DARTS [7], the number of cells is increased in three stages, for example, 5, 11, and 17, and limits exist on the additional regularization; finally, in order to limit the skip connections, DG-DARTS uses the same size search space and architecture as DARTS. In terms of the clustering calculation time, DG-DARTS consumes 0.22 days, without additional work and methodology. In other words, compared to DARTS, we applied the minimum of changes in DG-DARTS. Thus, DG-DARTS has sufficient potential for the NAS of AutoML, with a test error of 2.51%. This is due to reducing the amount of total computation by solving the vote dispersion problem.

Conclusions
In this study, we solved one of the possible problems faced by DARTS [3], which is vote dispersion. With the proposed DG-DARTS, the total amount of computation is reduced, so the total computation compared to DARTS is decreased with significantly lower GPU days while reasonably increasing accuracy. The vote dispersion problem, which is latent in the DARTS methodology, is solved by the operations of the search space being grouped based on the criteria of the gradient of the architecture parameter α over the training epochs. By the weights of the grouped operations and selections, useful operations that are discarded by DARTS survive and are used. With this AutoML approach, without manually changing the size of the search network, the test accuracy is increased to 97.49% on the CIFAR10 dataset [14]. For the same training epochs, the search cost is lower than that required for DARTS because DG-DARTS uses fewer operations in stage 1. In summary, DG-DARTS required 0.22 days in comparison to the 1.5 GPU days required for DARTS, representing a seven-fold increase in speed. Our future research will include experimental results on different datasets, such as CFAR100 [14] and ImageNet [24], to verify the effect of DG-DARTS on such datasets. In addition, we will apply DG-DART to other types of models, such as graph convolution [25] and RNN [3], to determine its effectiveness for search spaces with other types of operations. The application domains of DG-DARTS include natural language processing (NLP), EdgeML, and so on, with the concise models generated by DG-DARTS. For example, machine learning on edge devices may require simpler models given the restrictions of computational resources and communication bandwidth. Even though ARM Corex processors, which are usually used for edge devices, have relatively higher computation power, it is still impossible to achieve high-performance GPUs for edge devices; therefore, restrictions on computational power is still a problem for machine learning on edge devices. Additionally, low-power and low-bandwidth network technologies for IoT devices, such as LoRa, place another restriction on communication between edge devices. The simpler models generated by DG-DARTS provides one solution to these environmental restrictions.