Efficient Network Architecture Search Using Hybrid Optimizer

Manually designing a convolutional neural network (CNN) is an important deep learning method for solving the problem of image classification. However, most of the existing CNN structure designs consume a significant amount of time and computing resources. Over the years, the demand for neural architecture search (NAS) methods has been on the rise. Therefore, we propose a novel deep architecture generation model based on Aquila optimization (AO) and a genetic algorithm (GA). The main contributions of this paper are as follows: Firstly, a new encoding strategy representing the CNN coding structure is proposed, so that the evolutionary computing algorithm can be combined with CNN. Secondly, a new mechanism for updating location is proposed, which incorporates three typical operators from GA cleverly into the model we have designed so that the model can find the optimal solution in the limited search space. Thirdly, the proposed method can deal with the variable-length CNN structure by adding skip connections. Fourthly, combining traditional CNN layers and residual blocks and introducing a grouping strategy provides greater possibilities for searching for the optimal CNN structure. Additionally, we use two notable datasets, consisting of the MNIST and CIFAR-10 datasets for model evaluation. The experimental results show that our proposed model has good results in terms of search accuracy and time.


Introduction
CNN is a widely used neural network architecture which is formulated by biologists Huber and Wiesel in their early work on cat visual cortices. It can be applied to many problems in image classification and segmentation [1][2][3] and natural language processing (NLP) [4,5]. However, designing CNN structures completely manually is a complex task. For example, traditional network designs, such as LeNet [6], AlexNet [7], and ResNet [8]. These networks not only rely heavily on existing deep learning architectures when designing manually but also require the re-selection of appropriate structures and hyperparameters for specific tasks.
However, owing to the high-dimensional properties of the neural networks, neuroevolution is mainly employed for reinforcement learning of shallow networks. Stanley et al. attempted to address this question with hypercubic-enhanced topological neuroevolution (HyperNEAT) [9], which acts as an indirect coding method that can generate millions of networks, but still has a low fitness for optimal CNNs and is therefore considered more suitable as a feature extractor.
In 2016, some academics proposed an approach that would allow the network structure to be constructed not by humans, but by giving the machine some basic elements of a neural network and allowing it to automatically construct a better network. This approach is called neural architecture search (NAS) [10,11], which combines and selects the basic elements of a neural network by taking them into a limited search space and finally arriving at a network with good network performance. However, it is necessary to constantly try and combine different architectures in the process, which consumes a lot of time and requires a powerful amount of computation.
In perceptual tasks, the success of deep learning is closely connected to the automation of the feature extraction process, where hierarchical feature extractors automatically acquire the information that we need in an end-to-end manner. However, as we manually design increasingly complex neural architectures for specific tasks, our demand for architecture engineering is growing. NAS enables automated neural network architecture design and thus is the next step in the automation of machine learning. NAS can be considered as a subdomain of AutoML. Most modern deep neural network architectures are created based on human experience, and most of them require a long and tedious iterative experimentation process, while NAS enables the automatic generation of effective deep learning frameworks for specific tasks without any human intervention.
There are many methods to perform an automatic search by combining reinforcement learning, intelligent optimization algorithms, and gradient descent with CNNs. Baker et al. suggest a reinforcement learning (RL)-based metamodeling approach for automatically designing CNN architectures, called MetaQNN [12], by training an agent to sequentially determine the types of layers and their parameters. NASNet [13] uses validation accuracy as a reward signal, a recursive network of controllers optimized by policy gradients to generate the best neural structure description. EvoCNN [14] combines with the design of CNNs, which use GA for searching CNN structures with particular operations. IPPSO [15] is the first method to use particle swarm optimization (PSO) [16][17][18] in the design of CNN networks. It is inspired by the IP address setting, where each layer of the CNN is assigned an IP address. By the above setting, a new coding method is set. In addition, it uses PSO to automatically search for the optimal structure of the CNN at variable lengths without any human intervention in the process. However, this algorithm has limitations on particle length, and cannot be encoded at variable lengths, and the results are limited to three data sets. Based on the IPPSO drawbacks, psoCNN [19] proposes a new difference and velocity operator, respectively, and uses variable-length particles to explore the optimal solution of the CNN architecture. However, the main drawback is that the search space is only a combination of existing schemes and does not make maximum use of the solution space. Therefore, Lawrence et al. [20] use the group operation in psoCNN to maximize the search solution space and reduce the computational cost to a certain extent, but it is limited by the length of the particle size.
With deeper research in NAS, some studies are increasingly proved to be not perfect. Therefore, some more mature and complete solutions have been proposed in related works. ENAS [21] proposes a faster and more efficient search network structure using weight sharing, by which the time to obtain the optimal subnetwork is reduced. However, this approach may lead to weight shifts. DNS [22] reduces the drawbacks coming from ENAS by modularizing the search space. Although these above algorithms obtain better results on the corresponding data sets respectively, on the one hand, the time and space cost is relatively high, and on the other hand, there is no good way to deal with the continuous space search problem, and our algorithm solves these two problems relatively well.
Based on two metrics, classification accuracy and search speed, and finding the balance between the two metrics, We use the AO and GA to maximize the exploration of the CNN optimal structure, which we call HAGCNN. The primary contributions of this paper are as follows: We design a new binary-based particle-encoding strategy that can efficiently encode CNN structures and develop new evolutionary computation algorithms based on the new encoding strategy; 2.
A new approach to position updating is proposed, which focuses on combining the AO with GA. Crossover, Mutation, and Selection operations are added before the position update, thus allowing various directions and scales to search the solution space and increase the chances of diverse network formation; 3.
The inclusion of skip connections allows the algorithm to search for reasonable CNN architectures at variable lengths and generate new models, which break the constraint of the fixed-length encoding of traditional AO; 4.
To obtain better network performance, we introduce the residual block structure and group strategy, which enables us to explore deeper network structures sensibly.
The remainder of this paper is as follows: Section 2 provides a detailed description of the AO, GA, CNN, and residual block. Section 3 describes the proposed HAGCNN algorithm in detail. In Section 4, we introduce the experimental details we have performed. Section 5, we conclude and analyze the experimental results. Section 6 summarizes the whole paper and presents future works.

CNN Introduction
CNN [23][24][25][26] was proposed in the 1960s and discovered by Hubel and Wiesel while studying neurons in the cat brain cortex for local sensitivity and an orientation selection. CNN does not require to be manually designed during the whole-feature extraction but is processed by convolution operation directly. Deep convolutional neural networks mainly contain input layers, pooling layers, fully connected layers, convolutional layers, and an output layer. The input layer is used to receive the input data and the convolution layer is mainly used to perform feature extraction on the data. The data downsampling process is performed on the pooling layer. The fully connected layer is a deep network with the multi-Layer perceptron (MLP) [27][28][29]. Furthermore, it is mainly implemented through the activation function to the final output layer. The structure of CNN is shown in Figure 1. Deep convolutional neural networks stack various layers on top of each other and the output of each layer is calculated as the sum of the product of the input of each layer and the internal weights, all of which will be used as input to the next layer, resulting in a recognition classification result.
In training a CNN, tens of thousands of parameters may have to be processed because of the different architectures. It may take hours or even days, so it is important to develop an automatic search mechanism.

Convolution Layer Pooling Layer Pooling Layer Convolution Layer
Fully Connected Layer

Residual Block
ResNet [8] is made up of a series of residual blocks, and a residual block can be expressed by Equation (1).
The residual block is classified into two parts, which are the direct mapping and the residual section. Among them, F(X l , W l ) is the residual section that consists of two to three convolution operations.
In a convolutional network, X l may not have the same number of feature maps as X l+1 , and it is then necessary to use the convolution of 1 × 1 for dimensioning up or down, in which case the residual block representation is given in Equation (2).
where, h(X l ) = W l X, and W l is the 1 × 1 convolution operation. However, generally this is only used when lifting the dimension.

Genetic Algorithm (GA)
The genetic algorithm (GA) [30,31] is a population-based optimization search algorithm derived from natural selection. It uses the concept of survival of the fittest. It tries to exploit the most suitable individuals through iteration, and it abstracts the problem space as a population of individuals. GA converts the original group of individuals to a group of individuals where each individual stands for the solution of the problem waiting to be solved. The computation is achieved by cycling through individuals in a population using crossover, variation, and selection.
To determine whether a particular string will be involved in the reproduction process, a selection operation is proposed. Here, there are many ways to select the best chromosome, for example roulette wheel selection, rank selection, tournament selection, Boltzmann selection, steady-state selection, and elitist selection, etc.
The crossover operation is implemented mainly by combining genetic information from two or more parents to generate offspring. This includes signal point, uniform, partial matching, two-point and k-point orders, crossover with priority retention, reduction of substitution, shuffling, and cycling.
The mutation operation is used to maintain genetic diversity from one population to the next. The well-known mutation operations are simple inversions, substitutions, and hybrid mutations.

Aquila Optimizer (AO)
The Aquila optimizer (AO) [32,33] is a population-based optimizer proposed in 2021. It is inspired by Aquila's social behavior of catching prey. Similar to other population-based metaheuristics, the algorithm begins with N individuals and the initial population is X. We perform this initialization process in Equation (3) : where UB j and LB j are the upper and lower bounds of the j-th exploration space, rand is a randomly generated parameter in the range from 0 to 1, and Dim is the whole search space dimension. The algorithm is divided into two processes: exploration and exploitation, in which there are four main implementation strategies. The algorithm can determine the possibilities of transferring the exploration stage to the exploitation stage by judging the condition t ≤ ( 2 3 ) × T. where t and T stand for the current iteration and the maximum number of iterations, respectively.
In the first strategy, the Aquila discriminates the prey region and selects the best hunting area by a vertical swoop of high-altitude soaring. This strategy considers the average agent X M and the best agent X b in the process of exploration. The mathematical equation is as Equation (4): where X i (t + 1), X b (t) denote the position of the individual at the (t + 1)-th iteration number and the best solved individual obtained at the t-th iteration, respectively. X M (t) shows the average value of the current solution position at the t-th iteration, which is calculated by Equation (5). t and T indicate the current and maximum number of iterations, and rand is a random value between 0 and 1. Dim is the dimension size and N is the number of candidate solution.
In the second strategy, the individual position is updated on the basis of the Lévy flight Lévy(D) distribution and X b . A mathematical formulation of this strategy is given by Equation (6) X where X i (t + 1) is the position of the i-th individual at the (t + 1)-th iteration. Lévy(D) is the Lévy flight distribution function and D represents the dimensional space, specifically calculated by Equation (7). X R (t) is the random individual sampled in the range [1, N] at the i-th iteration. x and y are computed by Equations (9) and (10), respectively. In Equation (7), s is a constant with a value of 0.01. u and v are random numbers ranging from 0 to 1. σ is calculated by Equation (8). In Equations (9) and (10), r is calculated by Equation (11), θ is calculated by Equation (12), r 1 is used to fix the count of search cycles, whose value is taken between 1 and 20, and U is a small amount that is fixed at 0.00565. ω is a small value of 0.005, and D 1 is an integer from 1 to the length of the search space (Dim).
The third strategy is mainly based on Aquila's low flying and slow falling attacks. This action is expressed mathematically as Equation (13): where δ and α denote the adjustment parameters of the exploitation process. The fourth strategy refers to Aquila's behavior of walking and catching prey. This behavior is mathematically shown in Equation (14): where Equation (15) is the QF expression. Equations (16) and (17) represent G 1 and G 2 , respectively. Lévy(D) is the Lévy flight distribution function calculated using Equation (7). t and T present the current iteration and the maximum number of iterations respectively.

The Proposed Algorithm
This section introduces the main parts of our proposed algorithm separately, mainly including six parts: convolutional layer representation, the block structure, the grouping strategy, the position update representation, the fitness representation, and the decoding.

Convolutional Layer Representation
When we consider the design of algorithms that handle complex structures (e.g., CNN architectures), it is important to choose an appropriate encoding method to represent the CNN architecture and to avoid additional hard-coded management rules to ensure a reasonable construction of the architecture. In this paper, we directly represent each block in the CNN with a binary encoding and use evolutionary computation to update the encoding for the purpose of tuning the neural network architecture. For each training, testing, and evaluation using CNN, the model is compiled from an array containing the details of each layer in the particle. However, only different CNN structures can be found using this method. Therefore, we search the relevant hyperparameters of each layer at the same time.
We use a binary expression to represent the actual meaning of what each layer represents. Each of the different types of the layer is represented by a three-bit binary encoding, with three bits representing the eight possible options, each containing at least one activation-function layer, two convolutional layers, and one regularization layer.
Where the binary number '000' indicates that the average pooling layer is selected; '001' indicates we use the maximum pooling layer in the architecture; '010' indicates that the skip-connection layer is used; '011' indicates the selection of a convolutional layer with a convolutional kernel size of 3 × 3 and a padding value of one; '100' means a convolutional layer with a kernel size of 5 × 5 and a padding value of two; '101' shows a convolutional layer whose kernel size is 7 × 7 and padding value is four; '110' implies choosing a convolutional layer having a kernel size of 3 × 3, a padding value of two, and a dilation size of two; and '111' indicates selecting a convolutional layer with a kernel size of 5 × 5, a padding value of four, and a dilation size of two. The specific representation is shown in Table 1. The details of the operation will be shown in Algorithm 1.  for i = 1 to N do 3: Decode the individual and generate the corresponding network. 4: Train the network using the training dataset.

5:
The network is tested on the test data set, and the accuracy is returned as the fitness value. 6: for i = 1 to N do 7: Update the mean value of the current solution X M (t).
if rand ≤ 0.5 then 10: P i is performed OR operate bit by bit with P best .

11:
The results were mutated randomly. 12: Then cross operate the mutation result with P best . 13: The encoding of the particle representation is divided into n groups and a pooling layer is inserted in the last position of each group that satisfies the condition. 14: Calculate the fitness value of the new individual. If the value is better than P best , update P best with the individual. 15: else 16: Random mutation of P best into several bits. 17: The encoding of the particle representation is divided into n groups and a pooling layer is inserted in the last position of each group that satisfies the condition. 18: Calculate the fitness value of the new individual. If the value is better than P best , update P best with the individual. 19: else 20: if rand ≤ 0.5 then 21: Update the mean value of the current solution P M . 22: The optimal individual P best and the average individual P M are XORed. 23: The result of XOR is crossed with random binary coding, and the crossing position is controlled to be in the second half of the coding. 24: The encoding of the particle representation is divided into n groups and a pooling layer is inserted in the last position of each group that satisfies the condition. 25: Calculate the fitness value of the new individual. If the value is better than P best , update P best with the individual. 26: else 27: Random mutation of P best into several bits, and the number of mutations should not exceed half of the total number. Output: The global best individual P best , and the best fitness value f (P best ).
Since the particle length is fixed after initialization, to search the CNN structure at variable length and to increase the possibility of searching, this paper achieves this purpose by using the inclusion of a skip-connection layer by setting the skip-connection layer to be denoted by '010'; if this layer is selected, the output of the previous layer is copied directly to the next layer, which is equivalent to reducing the actual number of valid layers in the whole network by one layer and thus achieving a variable length search.
The number of output channels per convolutional layer is another parameter that has a large impact on the network performance. Therefore, our algorithm also includes this parameter in the search space. Similarly, we convert the number of output channels of the convolutional layers to binary and concatenate it to the convolution type. Thus, our encoding string becomes a L × (3 + X)-bit binary string, where a complete convolution block is represented by 3 + X bits, which contain three bits of binary for the convolution type and X bits for the number of output channels of the layer. Here L denotes the number of layers present in the network and X remains fixed during a search.

Block Structure
In past studies, researchers have often combined a number of fixed layers together to form block structures, and using such block structures can dramatically improve the performance of the network. In this article, we also used a similar structure. We combine the convolution layer, batch-normalization layer, and activation layer into a base convolution block. Furthermore, when encoding, we encode only the convolution block and not the separate convolution layers. When combining these layers, we consider both the convolution kernel size, stride size, padding, and dilation to ensure that the resolution of the input block is the same as the resolution of the block output. To further improve the performance of the convolution block, we incorporate the idea of a residual block, and our convolution block similarly copies the input and adds it to the output of the convolution layer. We add a 1 × 1 convolution layer for channel fusion so that the number of input channels is the same as the number of output channels for direct summation for blocks where the number of input channels is different from the number of output channels. For example, Figure 2 shows a basic block structure. An example of a network with different encoding methods after combining different block structures is shown in Figure 3. As shown in Figure 3, the numbers indicate the basic block structures, and the corresponding codes are shown after each structure.

Grouping Strategy
In order to generate better architectures, we group the searched architectures. The purpose of grouping is to reasonably intersperse the pooling layers in the network structure to make our network achieve better results. We divide all the layers into n groups, assuming that the whole network contains L layers (including skip connections), there are L/n layers in each group, and we detect whether the last layer in each group is a pooling layer to avoid adding redundant pooling layers repeatedly. Figure 4 shows an example of grouping the entire network structure, and we make sure that the last layer of each group is the pooling layer.

Location Update Representation
After representing the CNN structure with a string of binary characters, this string is used as input to the evolutionary computation algorithm formula, and several operations are performed during the bring-up process. Algorithm 1 further describes this part of the specific process.

Crossover Operation
If the current particle position is multiplied by an arbitrary constant which is more than 1 in the formula, the algorithm performs a crossover operation. Note that it is rounded upwards to obtain the final constant if the current particle position is multiplied by a fractional number. Furthermore, we assume the length of the constant part is C and the length of the non-constant part is N − C. Then, the two coded strings are first truncated according to the constant digit and we cross-patch the two intercepted strings following the behavior of the chromosome crossover. Assuming the code A is code A = [c a,1 , c a,2 , . . . , c a,i , c a,i+1 , . . . , c a,N ], and the code B is code B = [c b,1 , c b,2 , . . . , c b,i , c b,i+1 , . . . , c b,N ]. If the truncation is from the i-th bit, the results after crossing code A and code B are as follows, code A = [c a,1 , c a,2 , . . . , c a,i , b,i+1 , . . . , c b,N ], code B = [c b,1 , c b,2 , . . . , c b,i , c a,i+1 , . . . , c a,N ]. The specific operation is shown in Figure 5. Figure 5a represents the initial structure of any two individuals and Figure 5b represents the encoded structure of two individuals after the crossover operation.

Mutation Operations Mutation Operation of Random Number
If the current particle position is multiplied by a random number from zero to one, the algorithm performs a mutation operation. Here we use a random number that is subject to a uniform distribution. Furthermore, the bits 0 to dim are randomly selected for the mutation. In this process, the parameter dim represents the dimension of the algorithm. The specific mutation is that the original 0 bit becomes 1 and the 1 bit becomes 0 for the mutation operation. Assuming a code of code A = [1, 0, 1], we randomly selected 2 bits for mutation. The result may be [0, 1, 1], [1, 1, 0], or [0, 1, 0]. The specific representation is shown in Figure 6. Figure 6a represents the original structure before the mutation operations and Figure 6b represents the structure after the mutation operations, where the orange part in Figure 6b is the coding of the mutation.

Mutation Operation of Lévy Flight
If the current particle position is multiplied by a Lévy flight [34,35] operand in the formula, the CNN structure corresponding to the best CNN structure is encoded as a mutation operation. The specific variation is performed in the same way as the random number mutation operation.

Selection Operation
If two particle positions are added or subtracted in the formula, the selection operation is performed. In this case, if the two-particle positions are subtracted, the operation is the Bitwise Exclusive OR operation. If the two-particle positions are added, the operation is the Bitwise OR operation.

Bitwise Exclusive OR
The operation of the Bitwise Exclusive OR is an operation in which two numbers participating in an operation are converted to binary (0, 1) and then subjected to the Exclusive OR operation, as long as 1 is given when the numbers in the corresponding bits are not the same and 0 when they are the same. Assuming the two codes are code A = [1, 0, 0, 1, 0], code B = [0, 0, 1, 0, 0], the result is [1, 0, 1, 1, 0]. The details of the relevant operation are more prominently represented in Figure 7, where particle A and particle B represent the positions of two random particles. When the positions of the two particles are subtracted, the Bitwise Exclusive OR operation is performed, the result is shown in Figure 7b.

Fitness Evaluation
After iterations of the algorithm, the encoding will change and the changed encoding still corresponds to specific network architecture. The network uses the Adam optimizer to adjust the trainable parameters in the network and CosineAnnealing is used to adjust the learning rate. For the first 10 epochs we let the learning rate decrease from a predetermined maximum to a predetermined minimum, and for the last 10 epochs we let the learning rate increase from the minimum to the maximum. To speed up the network training, each network structure searched by the algorithm during the iterative process is set to an epoch value of one. When the iteration of the algorithm is completed, the epoch value used for training the best-obtained network structure is set to 100. The loss function used for training is set to the cross-entropy loss, and the return value of this function is used as the fitness value. The return value of this function is used as the fitness value to guide the execution of the evolutionary computation algorithm. Finally, we transfer part of the training set as input to the network, train the network, and return the fitness value.

Decoding
According to Table 1, the different binary strings can be decoded into the corresponding layer and the corresponding parameter information, and these strings represent the parameter values for this layer. After decoding all the interfaces, the final CNN structure is achieved by concatenating all the decoding layers in the best global case. By concatenating all decoding layers in the same order as the interfaces in the particle vector, the final CNN architecture can be obtained.

Experimental Preparation
Our experiments are based on the NVIDIA RTX 3090, with an Intel Xeon Gold 6226R CPU @ 2.90 GHz.

Datasets
Considering the cost of time consumption during NAS processing, and the limitation of our computing device resources, we only select datasets with small input sizes for testing. We select two basic datasets to evaluate the final results, namely MNIST and CIFAR-10 datasets.
Specifically, the MNIST [36,37] dataset contains images of handwritten numbers from 0 to 9. Each sample is a 28 × 28 pixel. A total of 60,000 training samples and 10,000 test samples are contained in the dataset.
CIFAR-10 [38,39] is a computer vision dataset for universal object recognition; it is gathered by Hinton's students Alex Krizhevsky and Ilya Sutskever. It contains 60,000 32 × 32 RGB color images with a total of 10 classifications. All photos are divided into 10 different categories, namely 'airplane', 'bird', 'cat', 'automobile', 'dog', 'frog', 'horse', 'deer', 'ship', and 'truck'. Of these, 50,000 are used for the training set and 10,000 for the test set. Specific datasets information is shown in Table 2.

Parameter Setting
The parameter setting in this paper is divided into two main parts-on the algorithm side and the training side-the details of which are given in Table 3.
Among them, the population size and the number of iterations are used for the setting of the evolutionary algorithm. The population size indicates the number of networks tested in one iteration and the number of iterations limits the maximum iterations of the algorithm. The number of iterations limits the maximum number of iterations of the algorithm.
In addition, the main parameters are batch size, dropout rate, epochs for particle evaluation, and epochs for the global best in the CNN training process. Batch size controls the amount of data fed into the model for one training. The dropout rate controls the rate at which fully connected layers randomly discard neurons in a single training session to prevent overfitting of the network. The epochs for particle evaluation are the number of times the training set is reused in a single training session. Using lower epochs for particle evaluation can significantly reduce the training time with little impact on the overall algorithm performance. The epochs for the global best is the number at which the training set is repeatedly used when training the network represented by that individual after the optimal individual has been found. Therefore, moderate values need to be selected to train the network. Note that the trained network will not be capable of achieving the desired result if the value is too small. Similarly, a large value may lead to overfitting. Specific information is shown in detail in Table 3. For better evaluation of the performance of our model, we select the manually designed networks LeNet-5 [6], ResNet [8,40], CapsNet [41], and DropConnect [42], and the automatically designed network models EvoCNN [14], MetaQNN [12], IPPSO [15], psoCNN [19], and GeNet [43] as baseline models on the MNIST dataset, and compare them with our proposed HAGCNN model in terms of test-set error rate, and the number of parameters and running speed-related metrics are compared with our proposed model. Firstly, we compare the first convolutional neural network LeNet-5. Although they are now obsolete, a comparison is still necessary. Secondly, the EvoCNN automatic generation model, which is based on the GA, is chosen for comparison. In addition, MetaQNNbased reinforcement learning is used for performance comparison. Additionally, ResNet, CapsNet, and DropConnect are used as a benchmark model for comparison.
Furthermore, IPPSO is the first model to use the PSO algorithm for automatic CNN generation. It is mainly inspired by IP addresses, by mapping attribute data to the corresponding IP address ranges and processing them to obtain the optimal CNN model. Besides, the psoCNN model is used for comparison, combining CNN generation with the PSO algorithm by designing a novel velocity-position calculation. Additionally, it used 10 iterations on 20 particles for experiments, and the experimental results proved to have better experimental results.
VGGNet is a deep convolutional neural network developed by the Oxford University Computer Vision Portfolio together with researchers at Google DeepMind. In 2014, it was runner-up in the ILSVRC competition and winner of the localization project, with an error rate of 7.5 percent on top 5. Meanwhile, DenseNet starts with features and achieves better results and fewer parameters by making the best use of the feature.
NASNet, which no longer requires relevant experts to build convolutional network architectures with human knowledge, directly uses RNN to compute the hyperparameter and realize AI automatic learning, CoDeepNEAT, which optimizes the deep learning architecture through evolution. Through the extension of existing neuroevolutionary methods in terms of components, topology, and hyperparameters, the approach achieves results comparable to the best human designs in standard baselines in language modeling and object recognition.
Large-Scale Evolution modified the evolutionary algorithm so that it can be used for NAS, and it is worth noting that the neural network architecture it searches has no fixed network depth.

Experiment and Analysis
To have a clearer view of the effect of the proposed model, we analyze the effect of the proposed algorithm from five perspectives: experimental results, comparison experiment with baseline model, visualization of experimental results, comparison of the effect of random evolution and the proposed algorithm, and the change of the algorithm position with the number of iterations, respectively.

Experimental Results
We search through 20 particles with 10 iterations in the search space and train the network structure obtained from the search with 100 epochs and 200 epochs on the MNIST and CIFAR-10 datasets, respectively. We compare the performance of the best CNN structure searched by our proposed algorithm in terms of error rate, parameters, and search speed.
The experimental results show that the automatically generated CNN model reached an error rate of 0.31% on the MNIST dataset with parameters 1.33M and a search speed of 0.5 h/1 GPU.
Experimental results based on the CIFAR-10 dataset show that the automatically generated CNN model has an error rate of 6.82% with 10.7M parameters and a search speed of 12 h/1 GPU. To find a better architecture, the parameters used are different from those used in the experiment on the MNIST dataset. When we search the optimal network architecture for the MNIST dataset, we conduct 1-epoch training on the searched network. In CIFAR-10 dataset search architecture, the network accuracy after 1-epoch training can not guide the evolutionary computing algorithm to find a better structure, which makes the evolutionary computing algorithm fall into local optimization. Therefore, we train 20 epochs for each searched network and return the obtained accuracy as a fitness value to the evolutionary computing algorithm. Similarly, because it is more difficult to train the network on the CIFAR-10 dataset than on the MNIST dataset, it needs more rounds of training to obtain a satisfactory accuracy. Therefore, the optimal architecture that can be found at the end of the iteration of the evolutionary computing algorithm is trained 200 epochs.

Comparison Experiment with Baseline Model
On the MINIST dataset, we compare our obtained CNN model with the current CNN network-generated models in manual and automatic modes, respectively. As shown in Table 4, firstly, our model results in an error rate on the test set that exceeds the baseline LeNet-5 and ResNet, and it is also near to the results of the two advanced CapsNet and DropConnect. In the comparison with the automatically generated model, we can find that our proposed model is not only lower in error than IPPSO, psoCNN, and GeNet, but also lower in the number of parameters than the currently known IPPSO and psoCNN in terms of the number of parameters, and nearly three times the number of psoCNN parameters. Additionally, the speed of searching for the optimal structure is much higher than that of the three known NAS models. This indicates that our proposed algorithm can find a better architecture for the image classification task.
On the CIFAR-10 dataset, we compare the results of the proposed algorithm with those of current state-of-the-art neural architecture search algorithms. As shown in Table 5, although our proposed algorithm still has a large gap from these advanced search models in the item of error rate, we only beat VGGNet, GeNet, and CoDeepNEAT, but compared to MetaQNN, NASNet, and Large-Scale Evolution, we spend substantially less time to. Therefore, our proposed algorithm still has practical significance.

Visualisation of Results
To explore the trajectory trends of the particles in the evolutionary computation, we visualize the particles that varied with the number of iterations. We chose the principal component analysis (PCA) [48,49] method, which selects the two most important components from all the particle dimensions to form a three-dimensional landscape.
As shown in Figure 9, the x and y axes indicate the two extracted dimensions respectively, while the z axis indicates the accuracy of the network corresponding to that particle on the test set. This visualization shows the accuracy of the network obtained after the 1 epoch of training for different particle positions.
To investigate the effect of the number of iterations on the search algorithm, we delay the number of iterations to 30 and re-run the experiment. The results are shown in Figure 10. Under the result of 1-epoch training, the horizontal coordinate indicates the current number of iterations and the vertical coordinate indicates the best network accuracy on the test set that can be found in that iteration. When the number of iterations is 10, our proposed algorithm converges in the 6-th generation, and when the number of iterations is extended to 30, the algorithm still converges within 10 generations. This indicates that our proposed algorithm can converge to the optimal value quickly. This also means that our proposed algorithm is applicable to solve the neural architecture search problem.

Comparison Experiment between Random Evolution and Proposed Algorithm
As shown in Figure 11, Figure 11a shows the accuracy of the test set using the random evolution approach on the MNIST dataset using 20 particles with 10 iterations, and Figure 11b shows the accuracy of the test set obtained using the proposed model on the MNIST dataset under the same conditions. The comparison of the two subplots in the figure shows that our proposed CNN evolution model has a significant improvement, finally reaching 99.7% on the MNIST dataset, and using random evolution can only reach 96.5% on the MNIST dataset. We can also see from the figure that random search exhibits a high degree of disorder. The evolutionary computation algorithm, on the other hand, can use the information from the previous iteration to search for better network architecture in a more optimal direction. This shows the great advantage of evolutionary computing in searching neural network architectures. We believe that if we set up the algorithm using a larger number of iterations, we may have better evolutionary results.

Position Movement Analysis of Particles from One Generation to Another
To observe the patterns of the network architecture found by our proposed algorithm, we study the evolution process, how each particle changed in each generation, and compare the distance each particle moved in each generation. The result is shown in Figure 12. In this part of the experiment, we record the position of each particle after each iteration and calculated the distance of all particles from their positions after the previous iteration. According to the experimental data, in all 30 iterations, the average distance moved by each particle is 17.8, the maximum distance moved is 7.21, and the minimum distance moved is 1. On average, each particle moved 3.45 times during all iterations. This shows that the proposed algorithm can effectively explore the entire architecture space.

Conclusions and Future Works
In this paper, we present a novel deep architecture generation model based on the AO and GA. The framework performs optimal architectural searches in a specific search space. By combining hybrid binary coding and particle-layer locations, and adding skip connections and residual blocks to increase the range of searchable space, a high-accuracy and low-time-cost automatic generation model is finally obtained, and the computing cost of NAS has been reduced to an acceptable level. In future work, the inclusion of more block structures similar to residual blocks and the combination of search directions with a topology search can be considered to increase the possibility of searching for solutions. Additionally, the direct search can be performed with multi-objectives under the simultaneous consideration of objectives, such as time cost and delay rate, to enrich our search model. Meanwhile, we can consider applying the model to more areas for further research.