NDARTS: A Differentiable Architecture Search Based on the Neumann Series

.


Introduction
Neural networks have seen great success in many different areas due to their powerful feature extraction ability, including machine translation [1,2], image recognition [3,4], and object detection [5,6].Despite their success, neural networks are still hard to design, and designing them requires substantial expert knowledge and much computational time [7,8].Manually designing the structure of a neural network is a trial-and-error process, and the search for network architectures is very time-consuming and labor-intensive, requiring a large amount of computational resources.Recently, there has been a growing interest in Neural Architecture Search (NAS) [9][10][11], which aims to automate the neural architecture designing process.NAS can be divided into three parts: search space, search strategy, and performance estimation strategy.The search space defines which architectures can be represented in principle.The search strategy details how to explore the search space.The performance estimation strategy defines which architecture performs well.
Many different search strategies can be used to explore the space of neural architectures, including the random search, reinforcement learning (RL) [12][13][14], evolutionary algorithm (EA) [15][16][17][18][19], Bayesian optimization (BO) [20][21][22], and gradient-based methods [23][24][25].In the RL based methods, the choice of a component of the architecture is regarded as an action.A sequence of actions defines the architecture of a neural network, whose validation set accuracy is used as the reward.In the original paper [10], they used the REINFORCE algorithm to estimate the parameters of a recurrent neural network (RNN), which represents a policy to generate a sequence of symbols (actions) specifying the structure of the CNN; the reward function was the classification accuracy on the validation set of a CNN generated from this sequence.Zoph B. et al. [12] extended this by using a more structured search space, in which the CNN was defined in terms of a series of stacked "cells".
An alternative to RL is to use an EA.In [15], they introduced the age mechanism, making their method more inclined to choose younger and better performing structures during the evolution.This ensures diversity and the survival of the fittest in the evolutionary process, which is called age evolution.In EA based methods, the search is performed through mutations and re-combinations of architectural components, where those architectures with better performances will be selected to continue evolving.Most of the Bayesian optimization methods use tree-based methods and the Monte Carlo Tree Search to effectively search the architecture space.
Despite the ability of these methods to learn network structures that outperform manually designed architectures, they are often plagued by issues such as high computational complexity and extended search times.Additionally, due to the discrete nature of their search space, these methods can only be indirectly optimized.As a result, the entire network search stage can feel more like a black-box optimization process, which necessitates the evaluation of a considerable number of networks.This inefficiency leads to a significant waste of both time and computational resources.
Rather than conducting a search over a discrete set of candidate architectures, gradientbased methods aim to convert the discrete search into an optimization problem within continuous space.This transformation allows for the utilization of gradient descent methods to effectively explore architectures by operating in a continuous search space.
In contrast to RL and EAs, gradient-based search methods operate within a continuous space to seek out architectures, thereby enhancing the overall efficiency of the process.Cai H et al. [26] proposed the ProxylessNAS method for different tasks and neural network structures, using fully parameterized hyper-networks and binary neural network path structures to reduce hardware computing resource consumption.Zela A et al. [27] proposed the R-DARTS algorithm, which improves the robustness of the DARTS algorithm through the data augmentation and L2 regularization methods.Chen X et al. proposed the SDARTS [28] and the P-DARTS algorithms.The former uses the random smoothing and adversarial training methods to improve robustness, while the latter uses the search space approximation methods to reduce computational resource consumption and increase the search stability.The SNAS algorithm proposed by Xie et al. [29] generates subnetworks through random sampling without retraining all the model parameters during the evaluation phase.Xu Y et al. [25] proposed the PC-DARTS algorithm, which uses channel sampling to reduce the required storage space, and uses the method of link edge normalization to improve the search stability.DARTS+ [30] prevents the collapse phenomenon, which dramatically increases the number of skip connections by analyzing the number of skip connections and the final architecture's performance.This means that the number of skip connections and the number of training epochs are decreased using methods such as early stopping.P Hou et al. [31] proposed Single-DARTS, which merely uses single-level optimization, updating network weights and architecture parameters simultaneously with the same data batch.In paper [32], the authors proposed the Self-Distillation Differentiable Neural Architecture Search (SD-DARTS) to alleviate the discretization gap.We utilized self-distillation to distill the knowledge from the previous steps of the supernet to guide its training in the current step, effectively reducing the sharpness of the supernet's loss and bridging the performance gap between the supernet and the optimal architecture.
In 2018, Liu et al. proposed the Differential Architecture Search (DARTS) [23] to search for neural network structures based on the gradient descent algorithm, which significantly improved the speed of the neural network structure search, and showed outstanding performance through continuous relaxation.Through the continuous relaxation of architecture, the search for a neural network's architecture can be transformed into a search for its weight coefficients.This approach, having a differentiable objective function, is amenable to gradient-based methods that efficiently explore such architectures.Despite the potential benefits of this approach, the DARTS algorithm has encountered several issues, such as high computational demands, performance gaps between discrete subnetworks and hyper-networks, and unstable search processes.In light of these challenges, this paper aims to enhance the efficiency and performance of the DARTS algorithm by building upon its foundations.Specifically, we propose a novel method named NDARTS, which employs the Neumann series to expand the super-gradient and approximates it using finite terms based on the representation of the super-gradient through the Implicit Function Theorem.Our empirical results demonstrate that NDARTS outperforms the baseline algorithm, DARTS, in terms of gradient approximation performance.
DARTS [23] based on gradient descent is one of the bases for our research.The model weights ω and architecture parameters α are toggled, and gradient descent is used for training in DARTS.Our article made some improvements to DARTS, and achieved good results.
The main contributions of our article are as follows: • Our proposed method, named NDARTS, utilizes the Neumann series to expand the super-gradient and employs approximations based on the representation of the super-gradient through the Implicit Function Theorem.Our experimental results demonstrate that NDARTS outperforms the baseline algorithm, DARTS, in terms of gradient approximation performance.

•
We use an improved evolutionary strategy for parameter optimization between architecture searches, which is more in line with weight-sharing hyper-network design than gradient methods.The training design uses small batch samples to reduce the computational complexity during the training process.
The rest of this article is organized as follows: Section 2 first briefly reviews the DARTS algorithm, and introduces our method, which optimizes the steps of the DARTS algorithm based on the Neumann series, and then performs convergence analysis.In Section 3, we first conducted ablation experiments in the NAS-Bench-201 search space, studied the influence of parameters on the NDARTS algorithm, and determined the optimal parameters for performance testing.Finally, the algorithm performance was tested on CIFAR-10, CIFAR-100, and ImageNet datasets and compared with other NAS algorithms.

Overview: DARTS
We first briefly review DARTS, the basis of our proposed architecture search method.Following the NASNet searches space [12], DARTS search for a computation cell as a building block of the final architecture, and the overall architecture of the networks is obtained by stacking two types of cells: a Normal cell that returns a feature map of the same dimension and a Reduction cell that reduces the size of input feature maps by half.A cell is a directed acyclic graph (DAG) consisting of an ordered sequence of N nodes.Each node x i is a latent representation (e.g., a feature map in convolutional networks).The first two nodes x 0 and x 1 in each unit are defined as input nodes, and they receive the outputs of the precursor cells as input.The node x N is the output of the current cell, which concatenates the outputs of all intermediate nodes x N−1 = concat(x 0 , . . ., x N−2 ).A directed edge e (i,j) , 0 < j < i < N − 1, indicates the existence of an operation to perform a transformation (convolution, pooling, etc.) on the feature representation.We use O to represent a set of candidate operations (e.g., convolution, max pooling, zero), where each operation represents some function o to be applied to x i .
Each intermediate node is computed based on all of its predecessors: To make the search space continuous, DARTS relaxes the categorical choice of a particular operation to a softmax over all possible operations: where the operation mixing weights for a pair of nodes (i, j) are parameterized by a vector α (i,j) of dimension |O|.It represents that there is a mixed operation parameterized by α (i,j) instead of selecting a certain operation on each edge.After relaxation, the task of architecture search then reduces to learning a set of continuous variables α = α (i,j) .DARTS used a bi-level optimization to jointly learn the architecture parameters α and the weights ω within all the mixed operations (e.g., weights of the convolution filters).
DARTS alternately optimizes the architecture α and the weights ω on the validation set and the training set, respectively.
The process of DARTS is as follows: 1.
Create a mixed operation o (i,j) parameterized by α (i,j) for each edge (i, j), initialize the hyper-net weight ω, and operate weight α; 2.
Train on the validation dataset D val , update α by Stop when the termination condition is met, otherwise return to step 2.

DARTS Optimization Algorithm Based on Neumann Series Approximation (NDARTS) 2.2.1. NDARTS
We perform a one-step expansion of ∇ α L 2 based on the Implicit Function Theorem (we denote L 1 and L 2 the training loss L train , and the validation loss L val , respectively).At present, there has been some research work in related fields, and Simonyan et al. [16] used this method to optimize neural network architecture search algorithms.
When ω * is the optimal weight of the model achieved by the training set, Based on this consequence, sub-gradient ∇ α L 2 can be represented as: According to Neumann series, for matrix A, when ||I − A|| < 1, we have So, we can formulate the sub-gradient: For this result, take the first K terms as an approximation, we can derive: In summary, this article proposes the NDARTS algorithm, which involves the following steps: 1.
Create a mixed operation o(i, j) parameterized by α(i, j) for each edge (i, j), initialize the hyper-net weight ω, and operate weight α; 2.
Train on the validation dataset D val , using Stop when the termination condition is met, otherwise return to step 2.

Proof
We provide some necessary assumptions that are relatively easy to satisfy before proving the convergence of NDARTS: (1) The function ω : α −→ ω(α) is Lipschitz continuous and has a Lipschitz constant L ω > 0 and L ∇ α ω > 0. ( (3) For any ω and α, L 2 (ω, •) and L 2 (•, α) are bounded and Lipschitz continuous, and have Lipschitz constant where α m and γ α m are the architecture parameters and the learning rate of architecture parameters founded during the m-th iteration, respectively.
We first prove two lemmas: Assuming L 1 is quadratic differentiable and µ-strongly convex for parameter ω, that the difference of the approximate value ∇ α L 2 and ∇ α L 2 in DARTS is satisfy: Lemma 2. Assuming assumptions (1)-( 4) are satisfied, the function L thus, Since L 1 is µ-strongly convex and γµI γ the sum of the right-hand series, And, as ∂L 2 ∂ω and ∂ 2 L 1 ∂α∂ω are bounded, so there exist a constant C L ωα 1 and C L ω 2 , subject to Proof of Lemma 2.
Among them, the left half on the right side of the equation, the right half, According to assumptions (1), (3), and (4), we obtain and According to assumptions (1) and (3), we know In summary, Proof of Theorem 1.
where, according to Lemma 2, we know L ∇ α L 2 exists and is bounded.
in that way, According to Lemma 1, therefore, for any , Equation ( 22) can be rewritten as: In the equation, select a small learning rate γ α m < 1−P , can result in Since the learning rate is positive, 1 − P should also be a positive number.This can be achieved by adjusting γ and K to make At this point, it can be seen from the recursive formula that as α m iterates, L 2 will decrease and the loss function L 2 is bounded, so L 2 is convergent.The difference in recursive equation Since L 2 is bounded, so that According to assumption, we can make lim m→∞ m In summary, we have proven that:

Evolutionary Strategy
Before experimenting, let us first introduce an improved evolutionary strategy, which is more effective in updating the parameters of neural networks than the gradient descent method.And we will use it to update the parameter during the process of architecture search.
The basic process of the (µ + λ) evolution strategy is to establish an initial population of POP 0 at the beginning of the search, which contains µ individuals.Starting from the initial population, iteratively calculates a series of populations.In each iteration, generate λ children from the current generation POP iter .In each case, generate a descendant by using a three-step calculation: 1.
Select two individuals as parents for recombination from the current generation POP iter .The choice of parents is unbiased.

2.
Generate a new individual through the recombination of the selected parents.

3.
Perform mutation and evaluation on the new individuals.At the end of the iteration, select µ superior individuals from the set of λ offspring and µ parents to form a new generation of POP iter+1 .
We use the OPENAI-ES [33] evolutionary strategy proposed by OPENAI, add the fitness calculation of the estimating gradient direction ω + α 1 nσ n ∑ i=1 L i i in evolutionary strategy.The corresponding optimization algorithm steps are as follows.
Given the parameters ω of the neural network, loss function L, learning rate α, and noise standard deviation σ, the optimization process of neural network parameters based on evolutionary strategy are as follows: 1.
Calculate the loss function value corresponding to each parameter L i = L(ω + i ).
L i i and corresponding fitness function value L(ω 0 ) to the population; 3.
Stop when the termination condition is met, otherwise return to step 1.
In comparison to the gradient descent algorithm, OPENAI-ES exhibits higher universality due to its ability to avoid the computation of real gradients.Furthermore, when compared to Monte Carlo gradient-based optimization methods, it incurs higher computational costs but effectively utilizes sampled data to enhance performance.By incorporating Monte Carlo gradients into the population, basic evolutionary strategies can also achieve improved convergence speeds.

Simulation Experiment
The performance testing of the NDARTS algorithm and its comparison with other algorithms will be divided into two main parts for experimentation.
First, we will conduct performance experiments of the NDARTS algorithm within the DARTS search space.Subsequently, we will compare the obtained experimental results with those of other algorithms operating within their respective search spaces.
The second experiment involves comparing the NDARTS algorithm with other algorithms, also operating within NAS-Bench-201 search space [34].The comparison will be based on the same search space and architecture evaluation criteria.To ensure a fair comparison, experiments in both search spaces will be conducted on CIFAR-10, CIFAR-100, and ImageNet datasets.In the NAS-Bench-201 search space, the NDARTS algorithm employs the same unit structure and policy evaluation as other algorithms, enabling a relatively fair comparison.However, it is worth noting that the unit structure of the NAS-Bench-201 search space is relatively simple compared to the DARTS search space, which has more nodes, operation types, and unit types.Therefore, the performance of the NDARTS algorithm within the DARTS search space may better represent its optimal performance on these datasets.The CIFAR-10 [35] dataset is one of the most popular public datasets in current neural network architecture search work.CIFAR-10 is a small-scale image classification dataset proposed by Alex in 2009.As shown in Figure 1, there are 10 categories of data, namely aircraft, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.Each category has 6000 images, each of which is a 32 × 32 sized RGB image.The entire CIFAR-10 dataset consists of 60,000 images, among these images, 50,000 images were classified for training and 10,000 images for testing.2. CIFAR-100 Dataset.As shown in Figure 2, the CIFAR-100 [35] dataset is similar to the CIFAR-10 dataset, except that it has 100 classes.The 100 categories in CIFAR-100 are divided into 20 major categories.Each image comes with a "fine" label (the class it belongs to) and a "rough" label (the large class it belongs to).During the experiment, the training set of CIFAR-10 and CIFAR-100 is randomly divided into two groups: one group will be used to update weight parameters while the other will serve as a validation set for updating schema parameters.This division is conducted for each category within the samples of the training set.
3. ImageNET Dataset.ImageNet [36] is an image dataset organized according to a WordNet hierarchy, where each node in the hierarchy is described by hundreds or thousands of images.The examples of the ImageNet dataset are shown in Figure 3.At present, there is an average of over 500 images per node, with a total number of images exceeding 10 million and a total of 1000 types of recognition.Compared to the CIFAR-10 and CIFAR-100 datasets, the ImageNet dataset has a larger number of images, higher resolution, more categories, and more irrelevant noise and changes in the images.Therefore, the recognition difficulty far exceeds that of CIFAR-10 and CIFAR-100.When optimizing neural network weight parameters through the gradient descent method, the optimization of shallower parameters relies on layer-by-layer backpropagation starting from the output layer, while the optimization of deeper parameters also relies on the feature data corresponding to the output values of shallower layers.Once subnetworks with different structures are obtained through training in this way, it is difficult to maintain the dependency relationship between shallow and deep layers, which can easily lead to significant deviations.During the training process using evolutionary algorithms, each weight parameter within the neural network holds a unique and independent position.This allows for individualized training and optimization of each parameter up to a certain extent.Then, different subnetworks can obtain better evaluation results when obtaining parameters from the super-network.Therefore, when conducting experiments in the DARTS search space, the improved evolutionary strategy in Section 2.3 was chosen to optimize the network weight parameters.

2.
Each epoch is trained using a random small batch of samples [37], which reduces the computational complexity of each epoch while maintaining the training effect.At the same time, the algorithm can break through the sample data size limit and can be extended to large data volumes for calculation.

Search Space
1. DARTS search space.
The DARTS algorithm, serving as a fundamental neural network architecture search approach based on gradient descent, encompasses a substantial quantity of nodes and operation types within the building block (unit).The resulting network structure is achieved by stacking two structural units, thus leading to a high architectural complexity.Consequently, the DARTS search space has the potential to produce network models with superior performance.Currently, the majority of gradient-based methods undergo performance evaluation within the DARTS search space and are subsequently compared with other algorithms.
The unit structure of the DARTS search space shown in Figure 4, c k−2 , c k−1 , c k represents the outputs of units k − 2, k − 1, and k, respectively.In the k-th unit, there are four nodes between the outputs of the first two units and the output of the k-th unit.Each edge represents a candidate operation, with eight types of operations including extended separable convolutions of 3 × 3 and 5 × 5, deep separable convolutions of 3 × 3 and 5 × 5, average pooling of 3 × 3, maximum pooling of 3 × 3, identity operation, and zero operation.Zero operation indicates that there is no connection between two nodes, while identity operation indicates that the data from the previous node is directly transferred to the next node.The entire network structure is composed of eight units, which are divided into standard units and down-sampling units.In the down-sampling unit, the first two nodes are connected to other nodes through pooling operations.The network consists of six standard units and two down-sampling units, which are located at one-third and twothirds of the entire network.
In current research, a growing number of NAS algorithms have been proposed.Despite their theoretical groundwork, many aspects of these algorithms display significant differences, including distinct search spaces, training strategies for architecture evaluation, and methods used to split validation sets.These differences lead to considerable challenges when comparing the performance of various NAS algorithms.As a result, researchers devote substantial computational resources to traverse and evaluate the performance of different search spaces and neural network structures, as well as their architectures within the designed network structure search space and producing datasets.Subsequent experiments on the NAS-Bench-201 search space [34] can obtain evaluation results through tabular queries without the need for retraining.

Ablation Experiment
We first conducted ablation experiments on the NAS-Bench-201 search space to investigate the impact of parameters T, K, γ, γ α on the performance of NDARTS and determine the optimal experimental parameters.The ablation experiment used CIFAR-10 as the training dataset, and a pre-trained model was used to reduce computational complexity.The effects of different parameters were analyzed through 30 epochs of results.
1. T. Parameter T is the number of steps for updating the weight parameter ω during the update interval of the architecture parameter α.In theory, the larger the T, the more steps the ω updates during each time α is updating, and the ω is closer to the optimal value ω * .The better the performance of ω, the better the performance of the super-network when evaluating the architecture, and the more accurate the evaluation results of the subnetwork architecture, which helps the algorithm achieve a more approximate supergradient estimation.However, the computational cost of the algorithm also increases with an increase in T. According to the experimental results, T = 4 achieved a good balance between computational cost and model performance.
The experimental results indicate that as T increases, the performance of the algorithm gradually improves.When T = 1, both NDARTS and DARTS only optimize the weight parameter ω once within the update interval of the architecture parameter α, with the only difference being that NDARTS uses ∇ α L 2 (ω − ξ∇ ω L 1 (ω(α), α)), while DARTS uses ∇ α L 2 (ω(α), α) to update the parameter α.
The experimental results are shown in Figures 6 and 7; it can be seen that NDARTS at T = 1 can search for a better framework with faster convergence speed and higher stability compared to the benchmark algorithm DARTS.When T increased from 1 to 4, the algorithm achieved better convergence speed and stability.

K.
The parameter K represents the number of truncated approximations utilized in the optimization formula.The larger the K, the smaller the error ∂ω∂ω is, which is between the approximate gradient value ∇ α L 2 and the true gradient ∇ α L 2 .Therefore, in theory, as K increases, the performance of the neural network should also increase.
From the experimental results in Figures 8 and 9, it can be seen that when K increases from 0 to 2, the accuracy of the model searched by the algorithm increases from 90.36% to 93.58%, and the stability of the search increases with an increase in K. Another conclusion is that K = 2 is already large enough, so that when K ≥ 3 is used, the algorithm performance cannot be further improved.But when K ≥ 2 continues to increase, the stability of the algorithm still shows an increasing trend, and when K = 2, the algorithm has already reached a good stability.

γ and γ α .
The calculation of NDARTS is based on the approximation of ∇L 2 using the Neumann series, with the potential condition that the learning rate γ should be small enough to make As shown in Figure 10, when γ = 0.001 and 0.005, the algorithm can maintain a good accuracy and stability.When γ = 0.01, the algorithm performance slightly decreases, and the accuracy decreases from 93.58% to 92.02%, indicating good stability.When γ = 0.025, the accuracy of the algorithm is 92.17%, but the stability of the model begins to decrease.When γ = 0.05, the performance of the algorithm decreases again, with a model accuracy of 91.22%, and the stability of the algorithm is significantly reduced.In NDARTS, the parameter γ α should take a small value to make the coefficient As shown in Figure 11, when γ α increases from 0.0001 to 0.001.The performance of the algorithm decreases from 93.58% to 92.04%, with relatively small stability changes.But when γ α increases from 0.001 to 0.005 and 0.01, the performance and stability of the algorithm are significantly reduced.In summary, NDARTS has strong sensitivity to parameters γ and γ α , chooses smaller γ, and γ α can help maintain the performance and stability of NDARTS.The dependence of NDARTS on parameters T and K is relatively small.Larger T and K can improve the convergence speed and performance of the algorithm, but also increase computational costs.Choosing the appropriate T and K can reduce computational costs while maintaining good performance of the NDARTS.

Performance Experiment
The ablation experiment determined the optimal parameters of NDARTS.Subsequently, we tested the performance of NDARTS in the DARTS search space and NAS-Bench-201 search space under the optimal parameters and compared it with other algorithms at similar model scales.Because of the limited computational resources, we mainly tested gradient-based algorithms, and other results are compared using the results of the original paper, marked with * in Tables.The algorithm tested in this article will provide a reference for the accuracy of the validation set as an additional condition, and the program list will be provided in Appendix A.
DARTS search space.
Comparing the results of the NDARTS algorithm in the DARTS search space with other NAS algorithms, the results on the CIFAR-10, CIFAR-100, and ImageNet datasets are shown in Tables 1-3, respectively.
Overall, compared with other types of methods, such as random search and RL, NDARTS has a significant increase in performance.And methods based on random search, RL, and EA have longer computational time, and NDARTS can achieve better performance in a short period.Although the PNAS [9] algorithm based on sequential models also requires smaller computational resources, the cost is to reduce the accuracy of the model.
Compared with the baseline algorithm DARTS, which is based on gradient descent, NDARTS improves the performance of the algorithm while reducing a certain computational cost.FairDARTS [38] relaxes the choice of operations to be collaborative, where the authors let each operation have an equal opportunity to develop its strength, but our algorithm performed better than FairDARTS.PDARTS [24] searches for neural network architecture progressively, achieving good accuracy at a higher computational cost, while NDARTS achieves better performance at a lower computational cost.PC-DARTS [25] reduces computational costs by sampling channels, and can quickly obtain models with better performance.Although NDARTS has a slightly slower speed than PC-DARTS, it has better performance.In the DARTS search space, NDARTS achieved optimal performance on all three datasets.The results on the CIFAR-10 dataset are shown in Table 1.This article ran some algorithms and displayed the accuracy of the test set in Figure 12 (truncated to 50 epochs due to different algorithm settings).It can be seen that the performance of models searched based on gradient methods is generally better than those searched by algorithms based on random search, RL, and other methods.Among gradient-based methods, NDARTS achieved the best performance on the CIFAR-10 dataset, with a test set error of only 2.37%, which is superior to other gradient methods such as FairDARTS, PDARTS, PC-DARTS, and DARTS with 2.54%, 2.50%, 2.57%, and 2.76%.And, it can be seen from the iteration curve other methods.From the iteration curve, it can be seen that NDARTS, PDARTS, and PC-DARTS have similar search speeds, they have faster search speeds and better performance than DARTS.The results on ImageNet are shown in Table 3.When extended to big data, the gradient method can no longer maintain its superiority over other methods in performance, but the NDARTS algorithm still achieved optimal performance with 24.3% and 7.3% TOP1 and TOP5 Test ERROR, respectively.
Figures 14 and 15 showed the structures of Normal Cells and Reduction Cells searched by NDARTS in the DARTS search space.As we can see, there are four nodes between the outputs of units k − 1, k − 2 and the output of unit k.There are some operations on each edge.There are also some pooling operations in the Reduction cell, which can reduce the feature map height and width by a factor of two.  the NAS-Bench-201 search space.The model searched by the NDARTS algorithm achieves optimal performance or approaches optimal performance on the three datasets.
The results of the CIFAR-10 dataset are shown in Table 4, and the results of 50 epochs intercepted using gradient-based methods are plotted in Figure 16.It can be seen that the methods based on RL and EA have shown good performance in the NAS-Bench-201 search space.DARTS experienced severe performance degradation, while GDAS, which performed poorly in the DARTS search space, maintained good model performance, while NDARTS had a slightly better test set accuracy of 93.67% than the 93.49% of GDAS.From Figure 16, it can be seen that DARTS is difficult to search for good architectures in the NAS-Bench-201 search space.The SETN [30] algorithm with good initial model performance has an unstable search process, while the GDAS and NDARTS algorithms can effectively search for better architectures.The results of the CIFAR-100 dataset are shown in Table 5, and the results of intercepting 50 epochs using gradient-based methods are plotted in Figure 17.It can be seen that on the CIFAR-100 dataset, NDARTS performed with an accuracy of 70.91%, almost approaching the optimal performance of 71.00% achieved by EA, while GDAS also maintained a good performance of 70.28%.However, DARTS and ENAS [14] methods even have poorer performance compared to methods based on random search.In Figure 17, when comparing the results of the gradient method, GDAS and NDARTS have similar performance and are superior to DARTS and SETN.
The results of the ImageNet dataset are shown in Table 6, and the results of intercepting 50 epochs using gradient-based methods are plotted in Figure 18.It can be seen that on the ImageNet dataset, the evolutionary algorithm-based method achieved the best performance with an accuracy of 44.23%, while the reinforcement learning-based method also achieved an accuracy of 42.23%.In gradient-based methods, GDAS achieves the best accuracy with 43.10%, while NDARTS performs slightly worse with 41.02%.The methods based on evolutionary algorithms and reinforcement learning can achieve good performance with the support of a large amount of computing resources.The NAS-Bench-201 search space uses a table query form for performance evaluation, especially on large datasets such as ImageNet, which saves a lot of computation.Therefore, the methods based on EA and RL have shown good performance on the NAS-Bench-201 search space and ImageNet dataset.The gradient-based methods has significant performance differences in the NAS-Bench-201 search space and ImageNet dataset.Specifically, as shown in Figure 18, GDAS and NDARTS perform better than SETN and DARTS.Although GDAS and NDARTS have poor initial states, they can still quickly optimize to better models, while SETN has a slower search speed and unstable search process, DARTS makes it difficult to search for better models in the NAS-Bench-201 search space.To compare the performance of NDARTS and GDAS in detail, the results of all epochs on the three datasets were plotted as shown in Figure 19.It can be seen that NDARTS has a faster search speed than GDAS.When approaching stability, NDARTS and GDAS search for the same model for a long time but ultimately stop at different architectures.At this time, NDARTS has better performance than GDAS on CIFAR-10 and CIFAR-100 datasets, but slightly worse performance on ImageNet datasets.

Experiment Results
DARTS search space is the most fundamental search space for gradient-based neural network architecture search algorithms, with two cell types and multiple nodes, as well as multiple operation types.When compared with other algorithms in the DARTS search space, NDARTS achieved better performance than mainstream NAS algorithms on CIFAR-10, CIFAR-100, and ImageNet datasets.
Compared to the DARTS search space, the NAS-Bench-201 search space has fewer unit types, fewer number of nodes in the unit, and fewer operation types.Although the performance of the searched model is relatively low, it can compare the performance of different NAS algorithms under a fair evaluation strategies conditions.When compared with other algorithms in the NAS-Bench-201 search space, the NDARTS algorithm achieved optimal performance on the CIFAR-10 dataset, outperforming than other gradient based methods and approaching optimal evolutionary algorithm performance on the CIFAR-100 dataset.The results on the ImageNet dataset also approached the performance of the optimal evolutionary algorithm.
In summary, NDARTS maintains a small computational cost at the same time improving its performance compared to mainstream NAS algorithms.It exhibits good adaptability in different structural search spaces and is a highly competitive neural network architecture search algorithm.

Conclusions
By approximating the iterative update formula of the DARTS algorithm based on Neumann series, it can be proven that the approximate iterative method is still convergent while reducing computational complexity.When optimizing weight parameters, choose an evolutionary strategy to avoid problems during the training process caused by gradient methods.Simulation experiments on CIFAR-10, CIFAR-100, and ImageNet datasets have shown that NDARTS is a highly competitive neural network architecture search algorithm.
Although our gradient approximation is more accurate and performs better than the DARTS algorithm.However, our algorithm still has some limitations, our algorithm has large gap between the architecture depths in search and evaluation scenarios, and performance gaps between discrete subnetworks and hyper-networks, which need to improvement in the further.
The algorithm model for designing the search space in the article is based on unit structure stacking, but it has certain limitations.The main reason is that the network instances obtained from the search can only be applicable to a specific configuration, and the generalization ability for different configurations is weak.So, for different experimental data scenarios, we need to develop an NAS algorithm that can search high performance network architecture.

Figure 3 .
Figure 3. ImageNET Dataset.3.1.2.Algorithm Settings 1.When optimizing neural network weight parameters through the gradient descent method, the optimization of shallower parameters relies on layer-by-layer backpropagation starting from the output layer, while the optimization of deeper parameters also relies on the feature data corresponding to the output values of shallower layers.Once subnetworks with different structures are obtained through training in this way, it is difficult to maintain the dependency relationship between shallow and deep layers, which can easily lead to significant deviations.During the training process using evolutionary algorithms, each weight parameter within the neural network holds a unique and independent position.This allows for individualized training and optimization of each parameter up to a certain extent.Then, different subnetworks can obtain better evaluation results when obtaining parameters from the super-network.Therefore, when conducting experiments in the DARTS search space, the improved evolutionary strategy in Section 2.3 was chosen to optimize the network weight parameters.2.Each epoch is trained using a random small batch of samples[37], which reduces the computational complexity of each epoch while maintaining the training effect.At the same time, the algorithm can break through the sample data size limit and can be extended to large data volumes for calculation.

Figure 4 .
Figure 4. Cell structure of darts search space.

Figure 6 .
Figure 6.The impact of T on NDARTS.

Figure 8 .
Figure 8.The impact of K on NDARTS.

Figure 13 .
Figure 13.Performance of different algorithms on the CIFAR-100 dataset.

Figure 14 .
Figure 14.Normal cell model found by NDARTS in DARTS search space.

Figure 15 .
Figure 15.Reduction cell model found by NDARTS in DARTS search space.

Figure 16 .
Figure 16.Performance of different algorithms on the CIFAR-10 dataset in the NAS-Bench-201 search space.

Figure 17 .
Figure 17.Performance of different algorithms on the CIFAR-100 dataset in the NAS-Bench-201 search space.

Figure 18 .
Figure 18.Performance of different algorithms on the ImageNet dataset in the NAS-Bench-201 search space.

Figure 19 .
Figure 19.Test Acc of NDARTS and GDAS on three datasets.The model cell structure searched by the NDARTS algorithm is shown in Figure 20.As we can see, there is just one kind of cell in the NAS-Bench-201 search space.And there are just three intermediate nodes between the input and output.Finally, we can obtain the network by stacking the searched cell.

Figure 20 .
Figure 20.Cells found by NDARTS in the NAS-Bench-201 search space.

Table 1 .
Results of NDARTS (based on DARTS search space) and other NAS algorithms on the CIFAR-10 dataset.