Optimal Design of Convolutional Neural Network Architectures Using Teaching–Learning-Based Optimization for Image Classiﬁcation

: Convolutional neural networks (CNNs) have exhibited signiﬁcant performance gains over conventional machine learning techniques in solving various real-life problems in computational intelligence ﬁelds, such as image classiﬁcation. However, most existing CNN architectures were handcrafted from scratch and required signiﬁcant amounts of problem domain knowledge from designers. A novel deep learning method abbreviated as TLBOCNN is proposed in this paper by leveraging the excellent global search ability of teaching–learning-based optimization (TLBO) to obtain an optimal design of network architecture for a CNN based on the given dataset with symmetrical distribution of each class of data samples. A variable-length encoding scheme is ﬁrst introduced in TLBOCNN to represent each learner as a potential CNN architecture with different layer parameters. During the teacher phase, a new mainstream architecture computation scheme is designed to compute the mean parameter values of CNN architectures by considering the information encoded into the existing population members with variable lengths. The new mechanisms of determining the differences between two learners with variable lengths and updating their positions are also devised in both the teacher and learner phases to obtain new learners. Extensive simulation studies report that the proposed TLBOCNN achieves symmetrical performance in classifying the majority of MNIST-variant datasets, displays the highest accuracy, and produces CNN models with the lowest complexity levels compared to other state-of-the-art methods due to its promising search ability.


Introduction
Various machine learning and deep learning models, such as feedforward neural networks (FNNs) [1], convolutional neural networks (CNNs) [2,3], and recurrent neural networks (RNNs) [4,5], were introduced in order to tackle different real-world tasks, such as object recognition [6][7][8], speech emotion recognition [9], fault detection [10][11][12][13][14][15], classification [16][17][18], and estimation problems [19].In recent years, deep learning models, such as CNNs, have gained overwhelmed popularity due to their superiority over human experts in tackling certain tasks [20].CNNs consist of two major components, known as feature extractor and classifier, that enable them to complete assigned tasks effectively without requiring manual pre-processing of raw data.With proper training of a CNN network, the convolution and pooling layers incorporated into feature extractor can automatically extract meaningful features from the raw input data and then feed into the fully connected layers of the classifier to perform the designated tasks with promising performance.
Optimal design of network architecture is one of the fundamental cornerstones that govern the network performance of CNN.Despite having promising performances, the network architectures of most existing CNN models, such as GoogleNet [21], AlexNet [22], InceptionNet [21], VGG [23], DenseNet [24] and ResNet [25], are handcrafted by designers with extensive knowledge of problem domains [26].These manual processes of designing CNN architecture are not only time consuming but also computationally expensive due to their trial-and-error natures.These manually designed network architectures might also have limited flexibility in handling different datasets with unique data distributions and hence, their performance may be compromised.Ideally, the optimal design of CNN network architectures should be automatically guided by the characteristics of given problems without requiring significant intervention from human experts to provide insights about specific problem domains.It is also notable that the majority of the manually handcrafted CNN models consist of redundant trainable parameters that lead to complex, networkheavy computational efforts.Hence, an efficient algorithm with symmetrical performance in achieving good classification accuracy and constructing CNN architectures with low complexity levels is worthy of investigation.
To alleviate the drawbacks of trial-and-error design, different strategies were devised to systematically determine the optimal network architectures of a CNN.These network architecture design methods are divided into three categories: (a) reinforcementlearning-based methods [27], (b) gradient-based methods [28], and (c) metaheuristic-searchbased methods [29].Baker et al. [30] proposed a reinforcement-learning-based method known as MetaQNN that utilized ten graphical processing units (GPUs) running on a CIFAR-10 dataset for ten days.For the reinforcement-learning-based method proposed by Zoph and Le [31], 800 GPUs were used to train the optimal CNN network architecture with CIFAR-10 for 28 days.Although reinforcement-learning-based methods can deliver promising performances, they are not feasible for researchers with limited computational resources.Gradient-based methods, such as those proposed by Liu, Simonyan, and Yang [28] are more efficient than reinforcement-learning-based methods, but the former methods did not have strong theoretical supports, and the CNN network architectures obtained were unstable.The construction of optimal CNN network architectures using gradient-based methods was also computationally expensive and required significant involvement from experts with rich domain knowledge.Finally, metaheuristic-search-based methods can design the optimal CNN network architectures by employing metaheuristic search algorithms (MSAs) without requiring any insights about specific problem domains.MSAs are populationbased algorithms in which the search operators inspired by different natural phenomena are used to locate the global optimum iteratively and are used to solve optimization problems.Some notable MSAs include the genetic algorithm (GA) [32], particle swarm optimization (PSO) [33], differential evolution (DE) [34], and teaching learning-based optimization (TLBO) [35].Given their appealing features, such as simple implementation, gradient-free characteristics, and strong global search ability, these MSAs are widely utilized to solve different real-world optimization problems [36][37][38][39][40][41][42][43][44][45][46].
Although the benefits of MSAs enable them to be naturally employed to optimize CNN network architectures, this optimization remains as a challenging task because of some newly arisen issues.For instance, the optimal CNN network architecture required to solve a particular dataset is unknown prior to the task.It is crucial to design a proper solution-encoding strategy that can facilitate the searching of CNNs with different network architectures (i.e., in terms of depths, layer types, etc.) when handling different problems.Appropriate constraints must also be defined to avoid the construction of invalid CNN network architectures in solution spaces without compromising the ability of MSAs to find novel architectures.Existing MSAs are originally designed to solve global optimization problems with specified dimensional size in a continuous search space.The optimization of CNN network architecture is more challenging than typical global optimization because it involves searching for solutions with different dimensional sizes (i.e., CNNs with different depths and combinations of layers) within the same solution space.Referring to the adopted solution encoding strategy, appropriate modifications must be made to the original search operators of MSAs to accommodate the searching of CNNs with flexible types of network architectures in solution space.When using population-based MSAs to optimize CNN network architectures, another concern is the time and computational resources required to evaluate the fitness value of each candidate solution.A computationally efficient fitness evaluation process is needed to ensure the practicability of MSAs in designing optimal CNN network architectures.Finally, it is notable that only classical MSAs (e.g., GA, PSO, and DE) are employed in optimizing the network architectures of CNNs despite the emergence of new MSAs inspired by different natural phenomena due to the "No Free Lunch Theorem" [47].While these new MSAs can deliver promising optimization performances when solving standard benchmark functions, it is essential to further investigate their potentials and feasibilities for handling increasingly challenging real-world problems, such as the optimization of CNN network architectures.
This paper aims to propose an effective and efficient network architecture design method to achieve symmetrical tradeoff between classification accuracy and network complexity.In particular, a relatively new MSA known as TLBO is employed to automatically search for the optimal network architecture of a CNN based on a given dataset without requiring human intervention.An appropriate solution-encoding strategy and design constraints are first defined for TLBOCNN, enabling it to search for the valid CNN network architectures with flexible sizes in the solution space.To ensure the practicability of TLBOCNN, a computationally efficient fitness evaluation process is employed to measure performance differences between the CNN network architectures represented by different TLBOCNN learners.Appropriate modifications are also proposed for the search operators of TLBOCNN during the teacher and learner phases, enabling the computation of population mean and new solutions from existing TLBOCNN learners with different lengths.
The remaining sections of this paper are organized as follows.Related works and research contributions of the current work are presented in Section 2. The overall search mechanisms of the proposed TLBOCNN are described in Section 3. Extensive simulation studies conducted for the performance evaluation of the proposed TLBOCNN are reported in Section 4. Finally, the conclusion and future works are summarized in Section 5.

Related Works 2.1. Teaching-Learning-Based Optimization (TLBO)
TLBO is inspired by the teaching and learning processes of an actual classroom [35].At the initialization stage, all learners are randomly generated with a population size of N in the search space.Each nth learner is considered as a candidate solution for solving the given problem and defined as X n = [X n,1 , . . . ,X n,d , . . . ,X n,D ], where n ∈ [1, N] refers to the population index d ∈ [1, D] and D refer to the dimension index and the total dimensional size of the problem, respectively.Meanwhile, the quality or fitness level of each nth learner is denoted as f (X n ).
During the teaching phase, the new solution of each nth learner can be obtained by learning from the difference between the best learner (i.e., teacher) and the population mean (i.e., mainstream knowledge of classroom), represented as X teacher and X mean , respectively.The mainstream knowledge of classroom X mean is first expressed as: Symmetry 2022, 14, 2323

of 35
Given X mean , the new solution of each nth learner, denoted as X new n , is calculated as: where r 1 ∈ [0, 1] is a random number generated from uniform distribution; T f ∈ {1, 2} is a teaching factor that determines the importance of X mean in updating each learner.
For the learner phase of TLBO, a peer interaction mechanism is simulated to update each nth learner.Define p ∈ [1, N] as the index of a randomly selected peer learner to be compared with each nth learner, where p = n.Let f (X n ) and f X p be the fitness values of learners X n and X p , respectively.For minimization problems, the new solution of nth learner X new n can be obtained from the learner phase as: where r 2 ∈ [0, 1] is a random number generated from uniform distribution.At the end of both the teacher and learner phases, the fitness value of X new n for each nth learner is evaluated as f (X new n ) and compared to f (X).
The knowledge enhancement of each TLBO learner through both teacher and learner phases is repeated until the termination criteria are satisfied.The teacher solution X teacher is then obtained as the best solution found for a given problem.

Convolutional Neural Networks (CNNs)
CNNs [48] are introduced as the combination of a feature-learning module with a trainable classifier, where the latter consists of at least one fully connected layer.Featurelearning modules aim to replace manual feature-extracting processes used in conventional machine learning methods to minimize the error due to data pre-processing.Contrary to FNNs, CNNs are applied directly to the raw data.After learning meaningful features from raw data, these features are then fed into the subsequent layer of trainable classifiers.The architecture of a sequential CNN is illustrated in Figure 1, where the feature learning module consists of two convolutional layers and two pooling layers, whereas the trainable classifier has three fully connected layers.mean (i.e., mainstream knowledge of classroom), represented as  ℎ and   , respectively.The mainstream knowledge of classroom   is first expressed as: Given   , the new solution of each nth learner, denoted as    , is calculated as: where  1 ∈ [0,1] is a random number generated from uniform distribution;   ∈ {1,2} is a teaching factor that determines the importance of   in updating each learner.
For the learner phase of TLBO, a peer interaction mechanism is simulated to update each nth learner.Define  ∈ [1, 𝑁] as the index of a randomly selected peer learner to be compared with each nth learner, where  ≠ .Let (  ) and (  ) be the fitness values of learners   and   , respectively.For minimization problems, the new solution of nth learner    can be obtained from the learner phase as: where  2 ∈ [0,1] is a random number generated from uniform distribution.At the end of both the teacher and learner phases, the fitness value of    for each nth learner is evaluated as (   ) and compared to ().If (   ) is better than (  ), the current   is replaced by new    .Otherwise,    is discarded.The knowledge enhancement of each TLBO learner through both teacher and learner phases is repeated until the termination criteria are satisfied.The teacher solution  ℎ is then obtained as the best solution found for a given problem.

Convolutional Neural Networks (CNNs)
CNNs [48] are introduced as the combination of a feature-learning module with a trainable classifier, where the latter consists of at least one fully connected layer.Featurelearning modules aim to replace manual feature-extracting processes used in conventional machine learning methods to minimize the error due to data pre-processing.Contrary to FNNs, CNNs are applied directly to the raw data.After learning meaningful features from raw data, these features are then fed into the subsequent layer of trainable classifiers.The architecture of a sequential CNN is illustrated in Figure 1, where the feature learning module consists of two convolutional layers and two pooling layers, whereas the trainable classifier has three fully connected layers.For each convolution layer, the filters with predefined filter width and filter height are first initialized.Then, the convolutional process is applied to the input images to generate feature maps.Each filter is first slid from the leftmost to the rightmost side of the input image with a step size defined as stride width.The filter is then moved downward with a step size known as stride height and slid through the input image from left to right again.This sliding process is repeated until the filter reaches the bottom-right of the input images and produces a complete feature map in which each feature map element is obtained as the sum of products for elements of filter and the corresponding elements of input images that overlap with filter.The number of filters required to produce a feature map is equal to number of channels defined in the input images.The connection weights in the filter are the learnable parameters in the convolution layer, whereas the hyperparameters considered during convolutional process include the width and height of filters, the number of filters, the number of feature maps, the width and height of stride, and the type of convolution.
The pooling layer is used for downsampling the feature maps produced by convolution process to achieve local translation invariance.During the pooling process, a kernel is first initialized with the predefined kernel width, kernel height, and types of pooling.Two popular operations used in the pooling layers are maximum pooling and average pooling.For maximum pooling, the maximum values of elements in patches of feature maps overlapped with the kernels are identified.Meanwhile, the average pooling is used to calculate the average values of elements observed in the patches of feature maps that overlap the kernel.The sliding process of the kernel is performed from the top-left to the bottom-right of input images based on the predefined stride width and stride height to obtain the downsampled feature maps.Contrary to the convolution layer, the pooling layer does not contain any learnable parameters.The hyperparameters involved in the pooling process include the kernel width and kernel height, the stride width and stride height, and the pooling type (i.e., maximum or average pooling).
The training of the CNN aims to minimize the errors between the predicted outputs of the network and the actual outputs stored in datasets.The trainable parameters of the CNN are optimized with gradient descent and backpropagation by minimizing cross-entropy loss.A simple CNN, such as that illustrated in Figure 1, may consist of a few hundred thousand to millions of trainable parameters.Depending on the network architectures of the CNN, the training process could consume up to several weeks even with the use of high-performance graphical processing units (GPUs).Given the time-consuming process of evaluating multiple CNNs, it is not feasible to design the network architecture of CNN using a trial-and-error approach.It is crucial to develop efficient network architecture design methods that can automatically determine the optimal CNN network architectures based on given datasets without requiring rich expert domain knowledge.

Existing Metaheuristic-Search-Based Methods in Optimizing Neural Networks
Given the gradient-free characteristics, MSAs are envisioned as competitive solutions for solving challenging black-box optimization problems, such as the optimal design of neural network architectures.The idea of neuroevolution was incepted two decades ago when MSAs were used to evolve the weights or architectures of small-or medium-sized artificial neural networks (ANNs).Due to the drawbacks of the backpropagation method, such as its high tendency of becoming trapped in local optima, MSA was initially used to update the weights of ANN with fixed network architectures [49].Despite having greater exploration strength to address local optima issues, MSAs require longer durations to train the weights of ANN than backpropagation does [50].Sophisticated neuroevolution algorithms known as topology-and weight-evolving artificial neural networks (TWEANNs) were introduced to optimize the weights and architectures of ANNs simultaneously.Some popular TWEANNs include the Neuroevolution of Augmenting Topologies (NEAT) [51], Evolutionary Acquisition of Neural Topologies (EANT) [52], and Hypercube-Based Neuroevolution of Augmenting Topologies (HyperNEAT) [53].A natural evolution concept of GA was adopted by NEAT [51] to evolve ANNs from simpler to more complicated architectures by increasing connection weights.Speciation was also incorporated to preserve the population diversity of NEAT while searching for more complex networks.It is infeasible to represent complex network architectures with high dimensional sizes using the direct encoding scheme of NEAT due to the excessive computational efforts required.EANT [52], with inner and outer layers, was also designed to evolve ANNs from simpler to more complex structures.The inner layer was used to govern the exploitation behavior of EANT by using evolution strategy to determine the optimal weight parameters of the network.The outer layer of EANT was more explorative because a mutation strategy was incorporated to evolve the network architectures.HyperNeat [53] used an indirect encoding strategy known as a connective compositional pattern producing network (CPNN) to represent the complex network architectures more efficiently by considering the given problem geometry.HyperNeat has worse performance than humans in solving classification problems, and it was recommended as a feature extractor for other machine learning algorithms instead [54].
There are growing trends of applying MSAs to optimize the network architectures of complex deep neural networks (e.g., CNN).PSO [33], inspired by the swarm behavior of animals in food searching, is a popular MSA used to find the optimal network architectures of CNNs.A PSO-based CNN (CNNPSO) was proposed in [3] to optimize the weights of a CNN for solving handwriting recognition problems with better accuracy.Although CNNPSO was able to solve the MNIST dataset [55] with a classification accuracy of 95% within four epochs, the processing time incurred was longer than that of a conventional CNN.A hybrid of PSO and stochastic gradient descent (PSO-SGD) was proposed in [56] to optimize the weights of a CNN by leveraging the explorative and exploitative behaviors of PSO and SGD, respectively.The connectivity weights of CNN were initialized using PSO, and then, a weight-training process was performed by SDG for a small number of iterations.The performance of PSO-SGD was evaluated using image datasets known as MNIST [55], CIFAR-10 [57], and SVHN [58].Despite outperforming most contemporary approaches in terms of classification accuracy, the computational efforts required by PSO-SGD to solve these datasets remain unknown, and the network architecture of PSO-SGD was not optimized.In [59], an indirect encoding strategy inspired by internet protocol (IP) was used by IPPSO to represent each particle as a potential network architecture.Each IPPSO particle can search for the optimal CNN network architecture within the predefined boundary limits while preserving its population diversity.The CNN architectures obtained by IPPSO have limited preset maximum lengths and are only used to solve three-image datasets.The IP-based encoding strategy of IPPSO also required frequent conversion of parameters between binary and decimal values.Thus, the particles encoded as CNNs with deeper architectures tend to require longer computational time.In [60], psoCNN was designed to automatically search for deep neural network architectures to solve given classification problems.A direct encoding strategy and novel velocity update operator were designed for psoCNN to search for the optimal CNN architectures with rapid convergence speed.Similarly, a PSO-based architecture optimization (PSOAO) algorithm was proposed in [61] to evolve the flexible convolutional auto-encoder (FCAE).An x-reference method was used by PSOAO to determine the differences of particles with variable lengths before updating the velocity and position of particles.
Genetic algorithm (GA) [32] and genetic programming (GP) [62] are two popular MSAs inspired by Darwin's theory of evolution and are widely used for CNN optimization.A CNN was proposed in [63] to solve the detection problems.To address the premature convergence issues of the backpropagation method, the potential weights of CNNs were encoded into the chromosomes and a standard GA was then used to train these connection weights.Despite producing a 92% success rate, the results obtained by this approach did not significantly outperform the backpropagation method.In [64], a human action recognition technique using GA and CNN was proposed.The initial weights of the CNN classifier were optimized with GA by minimizing classification error.A gradient descent algorithm was applied to further train the CNN classifier during the fitness evaluation of GA.Despite having a good accuracy of 99.98% when classifying UCF50 dataset [65], this approach only focused on the weight-updating process without optimizing the network architecture of the CNN.A multi-node evolutionary neural network for deep learning (MENNDL) was proposed in [66] by using GA to optimize the hyperparameters of CNN.MENNDL can identify the visited regions of hyperparameter space based on the results obtained from previous generation when solving the CIFAR-10 dataset [57].A neural architecture design method known as evolving deep convolutional neural network (EvoCNN) was proposed in [67] to solve image classification problems.A variable-length gene encoding strategy was adopted by EvoCNN to represent CNNs with different depths.The connection weights of EvoCNN were also encoded with a novel representation scheme to prevent premature convergence.A self-adaptive mutation neural architecture search algorithm (SaMuNet) was designed in [68] to automatically design the optimal CNN architecture for a given problem without requiring expert knowledge.Three types of mutation strategies (i.e., adding, removing, and replacing) were introduced and selected adaptively by SaMuNet to evolve CNN architectures with better exploration strengths.A selection scheme based on semicomplete binary competition was introduced for SaMuNet to preserve the elite solutions during optimization.A GP approach was proposed in [69] to automatically construct CNN architectures and solve image classification problems with better accuracy.A direct encoding scheme inspired by the Cartesian genetic program (CGP) [70][71][72] was employed to represent the network structure and connectivity weights of CNN with better flexibility.Although the GP approach has better performance than its compared methods in solving CIFAR-10 datasets [57], excessive computational efforts were required.
Differential evolution (DE) is another popular MSA used to optimize the weights or architectures of deep neural networks.A DE-based CNN (DECNN) for searching for optimal CNN architectures with a refined version of the IP encoding strategy for was proposed in [73].The refined IP encoding strategy of DECNN has eliminated the constraint of maximum network depth by representing a layer and its corresponding parameter with a single 2-byte IP address.Meanwhile, the overall information of CNN model was stored in the position vector of DE constructed by multiple two-byte IP address.Although DECNN can produce significantly better classification accuracy than IPPSO when solving six MNIS-variants datasets, both methods incurred high computational times due to the indirect encoding strategy used for network architecture representation.Another DEbased CNN (DE-CNN) was proposed in [74] for performing sentiment analysis in Arabic.Each parameter to be optimized (i.e., filter sizes, number of neurons, number of filters per convolutional filter sizes, initialization mode, and dropout rate) was stored in every dimensional component of the individual DE-CNN solution.Although DE-CNN has better performance than its peers in terms of classification accuracy and computational time, it has limited flexibility in terms of the construction of network architectures.An improved DE-based CNN (IDECNN) was proposed in [75] by incorporating a variable-length direct encoding strategy to represent network information with better flexibility.Each solution of IDECNN was stored with its unique parameters that represented the length of network, parameters of each layer, and sequence of layers.A new refined strategy was also used by IDECNN to effectively compute the differences between two encoded network structures during optimization.IDECNN exhibited better performance than 20 peer algorithms in terms of classification accuracy when solving eight image datasets.

Technical Contributions of Current Works
The research contributions of this paper can be summarized as follows: • A new network architecture design method known as TLBOCNN is proposed to automatically discover the optimal network architecture of CNNs (i.e., number of layers, type of layers, kernel sizes, number of filters, and number of neurons) for image classification without requiring rich expert domain knowledge.To the best of the authors' knowledge, no existing studies or only limited works have employed TLBO for the optimization of CNN network architectures.

•
TLBOCNN can accommodate the searching of CNN network architectures with flexible size by incorporating the appropriate solution-encoding strategy and design constraints for TLBO learners with variable lengths.These modifications not only prevent the construction of CNNs with invalid network architectures but also preserve the ability of TLBOCNN in discovering novel network architectures.A computationally efficient fitness evaluation process is also incorporated into TLBOCNN to ensure the practicability of the proposed network architecture design method.
• A new mainstream architecture computation scheme is introduced in the teacher phase of TLBOCNN to determine the population mean by referring to all TLBO learners encoded as CNNs with different network architectures.In order to maintain the simplicity of TLBO, a new difference operator is first introduced in both the teacher phase and the learner phase to compare the differences between existing learners with unique network architectures, followed by the design of a new position update operator used to search for the new TLBO learners.

•
Extensive simulation studies are conducted to evaluate the feasibility of proposed TLBOCNN in discovering the optimal network architectures of CNN automatically for nine popular datasets.The optimal CNN network architectures constructed by TLBOCNN are proven to have better classification performances than state-of-the-art works when solving majority datasets.

Functional Blocks Encoding Scheme
Generally, the optimal network architecture of CNN (in terms of network depth, layer types, kernel size, number of filters, number of neuron, etc.) required to solve a particular dataset is unknown beforehand because search for the best CNN network architecture should be guided based on the problem characteristics.The incorporation of an effective and efficient solution encoding scheme into TLBOCNN is therefore crucial to enable each learner to have better flexibility in searching for novel network architectures of CNNs that can solve different types of problems with desired performances.
A variable-length direct-solution encoding scheme known as the function blocks encoding scheme is incorporated into the proposed TLBOCNN.The position vector of the TLBOCNN learner is defined as a variable-length array to represent the potential CNN with unique network architecture, where each of its dimensional component is encoded as a CNN functional block along with its hyperparameters.Referring to Figure 1, three typical CNN functional blocks known as convolutional layer, pooling layer, and fully connected layer are considered when TLBOCNN is employed to automatically search for the optimal network architecture of the CNN.A functional block with the layer type of convolutional layer consists of hyperparameters, such as number of output filters and kernel sizes.For both average and maximum pooling layers, the hyperparameters are strides and pooling size.The number of neurons is included into the functional block assigned as a fully connected layer.Figure 2 shows a CNN network architecture represented by a TLBOCNN learner encoded with a list of functional blocks that consists of three convolutional layers, two pooling layers, and two fully connected layers.During the fitness evaluation process, the details of the functional block contained in each dimension of the TLBOCNN learner are decoded and compiled into the corresponding CNN network architecture for training and testing.
Depending on the types of functional block stored in each dimensional component of TLBOCNN learners, the feasible search ranges of their hyperparameters are defined to facilitate the search process of TLBOCNN within solution space.For convolutional (CV) layer, the number of output filters (numF) and kernel sizes (KS) are bounded in the search ranges of numF min max and KS min max , respectively.For both maximum pooling (MP) and average pooling (AP) layers, the pooling size (i.e., kernel width × kernel height) and stride size (i.e., stride width × stride height) are fixed at 3 × 3 and 2 × 2, respectively.For the fully connected (FC) layer, the number of hidden neurons (numNeu) is defined in the search range of numNeu min , numNeu max .Furthermore, the number of functional blocks (numB) assigned to each TLBOCNN learner can vary between numB min , numB max to facilitate the searching for CNNs with different network architectures (i.e., in terms of depths, types of layers, etc.) when handling different problems.The feasible search ranges of all parameters and hyperparameters considered by TLBOCNN for searching for the optimal network architecture of a CNN are summarized in Table 1.
the search range of [  ,   ].Furthermore, the number of functional blocks (  ) assigned to each TLBOCNN learner can vary between [  ,   ] to facilitate the searching for CNNs with different network architectures (i.e., in terms of depths, types of layers, etc.) when handling different problems.The feasible search ranges of all parameters and hyperparameters considered by TLBOCNN for searching for the optimal network architecture of a CNN are summarized in Table 1.Apart from the boundary constraints in Table 1, appropriate design constraints are also introduced for TLBOCNN to prevent the construction of invalid CNN network architectures without compromising its ability to discover the novel network architectures for different datasets.These design constraints include: (a) The first functional block assigned to each learner must be a convolutional layer; (b) the last functional block must be a fully connected layer, and the number of output neurons must equal to the number of output classes; (c) the maximum number of pooling layers assigned to each learner is restricted by sizes of input dataset.For instance, a maximum of three pooling layers is permissible for a CNN network architecture to handle input datasets with sizes of 28 × 28; (d) a fully connected layer cannot be inserted between the feature extraction modules (i.e., convolutional and pooling layers) because it can cause an overfitting issue due to the tremendous increase in trainable parameters in the CNN network architecture.If a fully connected layer is generated between the convolutional or pooling layers, all functional  Apart from the boundary constraints in Table 1, appropriate design constraints are also introduced for TLBOCNN to prevent the construction of invalid CNN network architectures without compromising its ability to discover the novel network architectures for different datasets.These design constraints include: (a) The first functional block assigned to each learner must be a convolutional layer; (b) the last functional block must be a fully connected layer, and the number of output neurons must equal to the number of output classes; (c) the maximum number of pooling layers assigned to each learner is restricted by sizes of input dataset.For instance, a maximum of three pooling layers is permissible for a CNN network architecture to handle input datasets with sizes of 28 × 28; (d) a fully connected layer cannot be inserted between the feature extraction modules (i.e., convolutional and pooling layers) because it can cause an overfitting issue due to the tremendous increase in trainable parameters in the CNN network architecture.If a fully connected layer is generated between the convolutional or pooling layers, all functional blocks after the fully connected layer are converted into fully connected blocks with different numbers of hidden neurons.

Population Initialization of TLBOCNN
An initial population with N learners that represents different CNN architectures (in terms of network depth, layer types, kernel size, number of filters, number of neurons, etc.) are randomly generated at beginning stage of TLBOCNN based on the function block encoding scheme, boundary constraints, and design constraints.Define numB n as the network depth of CNN corresponding to each nth learner that can be randomly generated between numB min , numB max during the initialization process.A list variable denoted as blocks_list with the size of numB n is also initialized as an empty list to record the type of functional block assigned to each jth dimension of the TLBOCNN learner, where j ∈ [1, numB n ].The first dimension (j = 1) and last dimension (j = numB n ) of blocks_list are assigned with the convolutional and fully connected blocks, respectively, to ensure a valid CNN architecture is generated.Assume that m n is the dimension index of blocks_list when a fully connected layer is first assigned, where m n ∈ [2, numB max ].Once a fully connected block is recorded in the j-th dimension of blocks_list where j = m n , the subsequent dimensions of blocks_list with the indices of j ∈ [m n + 1, numB n − 1] are also assigned as fully connected blocks to satisfy the design constraints.
The remaining dimensions of blocks_list with the indices of j ∈ [2, m n − 1] can be randomly assigned with a convolutional block or pooling block.Let block_type ∈ [0, 1]  be a value that is randomly generated with uniform distribution.If block_type ≤ 0.5, a convolutional block is assigned to the jth dimension of blocks_list.Otherwise, a pooling block is assigned when block_type > 0.5.Notably, a rectified linear unit (ReLU) is used as the activation function for all selected layers.Depending on the types of functional blocks selected for each jth dimension of blocks_list, their corresponding hyperparameters are also randomly generated during the initialization process.For a convolutional block, the number of filters (numF) is randomly generated between numF min and numF max , whereas its kernel size (KS) is randomly selected in the range of 3 × 3 to 7 × 7.For any jth dimension of blocks_list assigned as a pooling block, a parameter of pooling_type ∈ [0, 1] is randomly generated with uniform distribution to select the pooling type.An average pooling block is chosen if pooling_type ≤ 0.5, whereas a maximum pooling block is considered when pooling_type > 0.5.For fully connected blocks assigned in the blocks_list with indices of j ∈ [m n , numB n − 1], their numbers of hidden neurons are randomly generated as numNeu ∈ numNeu min , numNeu max .Finally, the number of output neurons for the fully connected block assigned in the last dimension of blocks_list is set equal to the numbers of output classes of the dataset, denoted as numOut.
Algorithm 1 presents the population initialization process of TLBOCNN.For each nth TLBOCNN learner, the functional block information of blocks_list will be stored in the position vector of X n .Blocks after completing the initialization process.The fitness of each nth TLBOCNN learner (i.e., X n .Blocks) is measured as classification accuracy, denoted as X n .Acc based on the fitness evaluation process that will be thoroughly explained later.The teacher solution of TLBOCNN, i.e., X teacher , is obtained by identifying the initial population member with the best fitness value (i.e., highest classification accuracy).

Fitness Evaluation of TLBOCNN
Fitness evaluation is essential for MSAs to measure the quality of a candidate solution when solving an optimization problem.For the proposed TLBOCNN that aims to search for an optimal design of CNN architecture, the fitness value of each learner is measured as the accuracy level of its corresponding CNN architecture when classifying the given image datasets.The learners that can produce CNN architectures with higher classification accuracies are more superior, and it will replace those with lower classification accuracies.Algorithm 2 presents the pseudocodes used to perform fitness evaluation on the new learners generated via the initialization, teacher phase, and learner phase of TLBOCNN.The fitness evaluation of each nth TLBOCNN learner is divided into two major steps:  To measure the fitness of each nth TLBOCNN learner, the functional block information stored in position vector X n .Blocks is decoded and compiled into a full-fledged CNN model.The trainable weights contained in all convolutional layers and fully connected layers of this CNN model are initialized using the He Normal weight initializer [20] and stored into a set variable of Θ = {θ 1 , θ 2 , . ..}.The CNN model represented by each nth learner is trained using the Adam optimizer [76] in a predefined epoch, e train , based on Step train batches of data from Data train .In each kth training step of the CNN model where k = 1, . . ., Step train , the corresponding cross-entropy loss is calculated as f (Θ, Data train,k ) based on the current weight Θ and kth batch data Data train,k .The new weights are updated as Θ new by deducting the current weights stored in Θ from the product of learning rate and gradient of cross-entropy loss ∇ Θ f (Θ, Data train,k ), i.e., After completing the training process, the fitness value of the trained CNN model represented by the nth learner is measured by computing its classification accuracy (i.e., X n .Acc) when handling the validation dataset denoted as Data valid with a size of |Data valid |.Simi- larly, the evaluation of trained CNN model is performed with multiple steps, Step valid , by dividing |Data valid | with batch_size, where In each kth evaluation step of trained CNN model, its classification accuracy is calculated based on the trained weights and kth batch data of Data valid before storing this value into the list variable denoted as acc_list.Notably, the value of classification accuracy obtained by the trained CNN model in every kth evaluation step is different due to the employment of different batch data.After all Step valid batch data stored in Data valid are evaluated, the mean classification accuracy of the trained CNN model is computed from acc_list as X n .Acc to indicate the fitness value of the nth learner, i.e., X n .Acc = ∑ Step valid k=1 acc_list [k] Step valid (7) Evidently, the fitness evaluation process is considered the main bottleneck of the proposed TLBOCNN because each learner represents a potential CNN model with unique architecture that must be trained based on its current weights with Data train before its final classification accuracy can be obtained from Data valid .It is not computationally feasible to perform full training on every potential CNN model with large training epoch numbers, e train , especially when it involves the population-based MSAs that require searching for the optimal architecture design of a CNN in multiple iterations.This undesirable drawback can be addressed by training each potential CNN model with a smaller e train during fitness evaluation.Although the final classification accuracy of a potential CNN model cannot be measured accurately with a smaller e train , the performance trends of all TLBOCNN learners during fitness evaluation can be observed, and this becomes the main consideration in assessing the quality of each learner.The potential CNN model represented by a TLBOCNN learner is more likely to have good final classification accuracy if it can perform better in several training epochs first.Full training with a higher e train is only performed to measure the final classification accuracy when the optimal design of CNN architecture is obtained after the termination of TLBOCNN.Notably, dropout and batch normalization can be added between the layers to prevent the overfitting issue [67].

Computation of Mainstream CNN Architecture
In the teacher phase of original TLBO, the population mean (i.e., X mean ) is calculated using Equation (1) to describe the mainstream knowledge used to guide the population search and to update the new solutions of learners.It is non-trivial to calculate the population mean of TLBOCNN because the X n .Blocks of each nth learner has different lengths to represent a potential CNN model with unique architecture (in terms of network depth, layer types, kernel size, number of filters, number of neurons, etc.).A novel mechanism is introduced in the teacher phase of TLBOCNN to calculate the mainstream CNN architecture based on functional block information encoded in all learners with variable length.
Figure 3 shows the mechanisms used to construct a mainstream CNN architecture from the TLBOCNN population consisting of five learners (i.e., N = 5) that represents CNN models with different architectures.Define X n .Blocks[j] as functional block information stored in the jth dimension of the nth TLBOCNN learner, where n = 1, . . ., N and j = 1, . . ., numB n .Let X f requent .Blocks[j] be a list variable used to record the most frequently occurring functional block in the jth dimension of all learners, where j = 1, . . ., numB f requent and numB f requent are the largest network depth in the current population, i.e., For instance, the second and fourth learners of the TLBOCNN population shown in Figure 3 have the largest network depth.Therefore, the total dimensional size of X f requent .Blocks is set as numB f requent = 6.Let I CV,j , I Ave,j , I Max,j , and I FC,j be the frequencies of convolutional block, average pooling block, maximum pooling block, and fully connected block occurring in the jth dimension, respectively, where j = 1, . . ., numB f requent .For every jth dimension, the functional block with the highest frequency of occurrence is assigned to X f requent .Blocks [j].As shown in Figure 3, the average pooling block has the highest frequency of occurrence of I Ave,j = 2 for j = 4, whereas the fully connected block appears most frequently at j = 5 with I FC,j = 4. Therefore, the average pooling block and fully connected block are assigned to X f requent .Blocks [4] and X f requent .Blocks [5], respectively.For any jth dimension with more than one functional block appears with the highest frequency, such as I Max,j = I CV,j = 2 for j = 2.As shown in Figure 3, random selection is performed to choose one of these most frequently appearing functional blocks for the corresponding component of X f requent .Blocks[j].Referring to X f requent .Blocks, the mainstream CNN architecture used for guiding the population search of TLBOCNN is then derived as X mean .Blocks.Suppose that numB mean is the network depth of mainstream CNN architecture represented by X mean .Blocks and that it is calculated as: where numB n is the network depth of the CNN model represented by the nth TLBOCNN learner via X n .Blocks; round(•) is a rounding operator.According to Equations ( 8) and ( 9), numB mean ≤ numB f requent .The mainstream CNN architecture of X mean .Blocks is obtained by extracting the first numB mean elements from X f requent .Blocks, i.e.,: Apart from the type of functional block to be assigned in every jth dimension of mainstream CNN architecture, i.e., X mean .Blocks[j], the hyperparameters of the selected functional block can also be calculated with the proposed mechanism.Supposing that a convolutional block is assigned to X mean .Blocks[j], the corresponding numbers of output filters (i.e., numF mean j ) and kernel size (i.e., KS mean j ) are calculated as: where i CV,j = 1, . . ., I CV,j refers to the index of a learner that is assigned as a convolutional block in the jth dimension; I CV,j is the frequency of convolutional blocks occurring in the j-th dimension.Meanwhile, if maximum pooling block or average pooling block is assigned to X mean .Blocks[j], their pool size and stride size are set as 3 × 3 and 2 × 2, respectively, according to Table 1.Finally, if a fully connected block is assigned to X mean .Blocks[j], the corresponding numbers of hidden neurons numNeu mean j are calculated as: where i FC,j = 1, . . ., I FC,j refers to the index of learner that is assigned as fully connected block in the j th dimension; I FC,j is the frequency of the fully connected block occurring in the jth dimension.To satisfy the design constraints mentioned, the first (i.e., j = 1) and last (i.e., j = numB mean ) blocks of mainstream CNN architecture are assigned to the convolutional and fully connected blocks, respectively.Furthermore, the number of output neurons assigned to the last fully connected block of mainstream CNN architecture in X mean .Blocks[numB mean ] is set equal to the numbers of output classes of datasets, i.e., numNue mean numB mean = numOut.The procedures used to generate the mainstream CNN architecture are summarized in Algorithm 3.

Algorithm 3: Computation of Mainstream CNN Architecture
Input: P = {X 1 , . . . ,X n , . . . ,X N }, N, numOut 01: Calculate numB f requent using Equation (8) and initialize X f requent .Blocks ← ∅ ; 02: for j = 1 to numB f requent do 03: Calculate I CV,j , I Ave,j , I Max,j and I FC,j ; 04: if more than one functional block has highest frequency of occurrence do 05: Randomly select one of these functional blocks and assign to X f requent .Blocks end for Output: X mean .Blocks

Computation of Differences between Two Learners
For the teacher phase in the original TLBO described in Equation (2), a new solution X new n of each nth learner is updated based on the differences between the teacher solution X teacher and the mainstream knowledge of population X mean .It is not trivial to determine the differences between CNN models represented by the teacher solution (i.e., X teacher .Blocks) and mainstream CNN architecture (i.e., X mean .Blocks) of TLBOCNN because they tend to have different network architectures (in terms of network depth, layer types, kernel size, number of filters, number of neurons, etc.).Furthermore, the information encoded in each dimension of X teacher .Blocks and X mean .Blocks refers to the functional block types (i.e., CV, MP, AP, and FC), which cannot be directly subtracted from each other.
Figure 4 illustrates the overall mechanisms used to measure the differences between CNN architectures represented by two TLBOCNN learners with variable lengths.Define L1 and L2 as two temporary list variables used to store the functional block information of X teacher .Blocks and X mean .Blocks, respectively.The feature extraction module (i.e., FE) and fully connected layers (i.e., FC) of L1 and L2 are compared separately to ensure that the new CNN architectures obtained can comply to design constraints, i.e., the fully connected block cannot be inserted between the convolutional and pooling blocks within the feature extraction module.Let L Di f f be a list variable used to store the differences between L1 and L2 (i.e., L1 − L2); it has the total dimensional size of numB Di f f , calculated as: where numB FE L1 and numB FE L2 represent the total numbers of convolutional and pooling layers encoded in the feature extractor modules of L1 and L2, respectively, and numB FC L1 and numB FC L2 represent the total numbers of fully connected layers encoded in L1 and L2, respectively.For instance, Figure 4 shows that learner L1 has five convolutional and pooling layers in its feature extractor module (i.e., numB FE L1 = 5) and two fully connected layers (i.e., numB FC L1 = 2).Meanwhile, learner L2 has four convolutional and pooling layers in its feature extractor module (i.e., numB FE L2 = 4) and three fully connected layers (i.e., numB FC L2 = 3).Therefore, the list variable L Di f f used to store the differences between L1 and L2 has the total dimensional size of numB Di f f = 8 according to Equation (14).
The following guidelines are used to compare the differences between CNN architectures encoded in every jth dimension of learners L1 and L2 and stored in L Di f f [j], where j = 1, . . ., numB Di f f .When both L1 and L2 have different functional block types encoded in the jth dimension, the information contained in L1[j] (i.e., functional block type and its hyperparameters) are extracted and assigned to L Di f f [j].For instance, if L1[j] = 'CV and L2[j] = 'AP , then the jth dimension of L Di f f is assigned with the convolutional block, i.e., L Di f f [j] = 'CV , along with its hyperparameters.L Di f f [j] is assigned as '0' when both L1 and L2 have same functional block type encoded in the jth dimension, implying that there are no changes in functional block in this dimension when it is used to calculate the CNN architecture of a new learner.When comparing L1 and L2 with different network depths, it is possible for L1 to have more functional blocks than L2 or vice versa.For any jth dimension, if L1 has more functional blocks than L2 in the feature selector module or the fully connected layer, L Di f f [j] is assigned as '+B' to imply that a new functional block should be added by referring to that of L1[j], where B can refer to CV, AP, MP, or FC blocks.On the other hand, L Di f f [j] is assigned as '-' to indicate the removal of an existing functional block when L1 has lesser functional block than L2 in the jth dimension of the feature selector module or the fully connected layer.The overall procedures used to compare the differences between L1 and L2 are summarized in Algorithm 4.
Symmetry 2022, 14, x FOR PEER REVIEW 17 of 37 respectively.For instance, Figure 4 shows that learner L1 has five convolutional and pooling layers in its feature extractor module (i.e.,  1  = 5) and two fully connected layers (i.e.,  1  = 2).Meanwhile, learner L2 has four convolutional and pooling layers in its feature extractor module (i.e.,  2  = 4) and three fully connected layers (i.e.,  2  = 3).Therefore, the list variable   used to store the differences between L1 and L2 has the total dimensional size of   = 8 according to Equation (14).
The following guidelines are used to compare the differences between CNN architectures encoded in every jth dimension of learners L1 and L2 and stored in      hand, X new n .Blocks[j] only refers to the functional block of X n .Blocks[j] if L Di f f [j] is an empty value.Algorithm 5 presents the pseudocode used to calculate X new n .Blocks based on L Di f f and X n .Blocks.For any X new n .Blocks[j] assigned with an empty value in the jth dimension, it implies the absence of functional block information and can be eliminated.The numbers of AP and MP blocks contained in X new n .Blocks must be adjusted based on the input sizes of datasets.Excessive pooling layers must be removed from X new n .Blocks one by one starting from the last layer if the new solution of the nth learner is found to have more pooling layers than allowed by the sizes of input datasets.Therefore, the actual dimensional size of X new n .Blocks (i.e., numB new,actual n ) can be smaller than that of numB new,max n obtained from Equation ( 15) after these post-processing processes. . must be adjusted based on the input sizes of datasets.Excessive pooling layers must be removed from    . one by one starting from the last layer if the new solution of the nth learner is found to have more pooling layers than allowed by the sizes of input datasets.Therefore, the actual dimensional size of    . (i.e.,   , ) can be smaller than that of   , obtained from Equation ( 15) after these post-processing processes.if L Di f f [j] = '0' and X n .Blocks[j] has a functional block then 06: Assign X new n .Blocks[j] with the functional block information of X n .Blocks[j]; 07: original solution of the nth learner is retained.A similar mechanism is used to determine if the new solution obtained by every nth learner can be used to replace the teacher solution.

Overall Framework of TLBOCNN
The overall framework of the proposed TLBOCNN is presented in Algorithm 8, where υ is a counter variable used to record the current iteration number and υ max is the maximum iteration number used as terminate criterion of TLBOCNN.After loading the training and validation datasets (i.e., Data train and Data valid ), the initial population of TLBOCNN is generated using Algorithm 1.The new CNN architectures represented by TLBOCNN learners are iteratively generated via the teacher phase (Algorithm 6) and the learner phase (Algorithm 7) until the termination criterion of υ > υ max is satisfied.After completing the search process, X teacher is returned as the best solution found by TLBOCNN.As explained in earlier subsection, the CNN architectures represented by all TLBOCNN learners are trained with a small epoch number of e train during the fitness evaluation process (Algorithm 2) to prevent excessive computational overhead of the proposed algorithm, but this approach cannot solve real-world applications with optimal performance.Therefore, a full training process with larger epoch numbers of e f ull train is performed on the CNN architecture represented by X teacher .Blocks after TLBOCNN is terminated.It is noteworthy that the mechanisms of full training process are same as those of Algorithm 2 except that a larger e f ull train is used to ensure the optimal performance of CNN architecture obtained.Dropout and batch normalization can be added between layers to address the overfitting issue.At the end of the full training process, information related to the fully trained CNN model represented by X teacher .Blocks (i.e., architecture, classification accuracy, and number of network parameters) is returned.
Algorithm 8: TLBOCNN Input : Data train , Data valid , batch_size, e train , e f ull train ,,υ, υ max , N, numB min , numB max , numF min , numF max , KS, numNeu min , numNeu max , numOut 01: Load training dataset Data train and validation dataset Data valid from the directory; 02: Population initialization of P = {X 1 , . . . ,X n , . . . ,X N } with Algorithm 1; 03: for υ = 1 to υ max do 04: Perform teacher phase to update P and X teacher with Algorithm 6; 05: Perform learner phase to update P and X teacher with Algorithm 7; 06: end for 07: Perform full training on the best CNN architecture of X teacher .Blocks with e f ull train using Algorithm 2; 08: Calculate the numbers of network parameters of fully-trained best CNN architecture represented by X teacher .Blocks and stored as X teacher .Params; Output : X teacher

Experimental Design and Results Analysis 4.1. Image Datasets
In this section, the classification performance of the proposed TLBOCNN algorithm is evaluated by using nine image datasets with different characteristics and compared to stateof-the-art peer classifiers.The selected image datasets are Modified National Institute of Standards and Technology (MNIST), MNIST with Rotated Digits (MNIST-RD), MNIST with Random Background (MNIST-RB), MNIST with Background Images (MNIST-BI), MNIST with Rotated Digits and Background Images (MNIST-RD+BI), Rectangles, Rectangles with Images (Rectangles-I), Convex, and Fashion.These image datasets are publicly available at http://www.iro.umontreal.ca/~lisa/icml2007data/(accessed on 3 June 2022), while the sample images of each selected dataset with a size of 28 × 28 are illustrated in Figure 6.Meanwhile, Table 2 summarizes the characteristics of nine selected image classification benchmark datasets in terms of their input size, number of output classes, numbers of training samples, and numbers of testing samples.
Symmetry 2022, 14, x FOR PEER REVIEW 23 of 37    MNIST [55] is a popular image dataset used to test classification performance of machine learning or deep learning algorithms.Different mechanisms, such as rotation, random background noise, background images, and combinations of rotation and background images are introduced into original MNIST to produce MNIST-RD, MNIST-RB, MNIST-BI, and MNIST-RD+BI, respectively [77].These MNIST variants contain irrelevant information and are useful for evaluating the generalization capability of classifiers.The Rectangle dataset is a collection of grayscale images of rectangle outlines with different sizes, and it aims to train the machine learning or deep learning models to recognize the larger dimension of rectangles (i.e., height or width).Rectangle with Image (Rectangle-I) dataset is more challenging because it requires the trained models to identity if the images patches are located inside the rectangle or the background of images.The Convex dataset is a collection of grayscale geometrical shape images, and the models are trained to recognize if the geometrical shape is convex or non-convex.The Fashion dataset [78] contains the grayscale image collection of fashion products that can be categorized into 10 output classes: trousers, dresses, coats, top, bags, sneakers, sandals, ankle boots, pullovers, and shirts.It is considered to be a more challenging dataset and is used to benchmark the classification performances of machine learning and deep learning algorithms.

Selection of Peer Algorithms and Simulation Settings
The well-established machine learning and deep learning models that have solved the nine selected datasets with promising classification accuracy are selected as the peer algorithms of the proposed TLBOCNN.In particular, 13 peer algorithms-RandNet-2 [79], LDANet-2 [79], CAE-1 [80], CAE-2 [80], ScatNet-2 [81], SVM+RBF [77], SVM+Poly [77], PCANet-2 [79], NNet [77], SAA-3 [77], DBN-3 [77], EvoCNN [67], and psoCNN [60]-are chosen for performance-comparative studies with TLBOCNN in solving eight datasets: MNIST, MNIST-RD, MNIST-RB, MNIST-RI, MNIST-RD+BI, Rectangle, Rectangle-I, and Convex.Another 15 peer algorithms compared to TLBOCNN for the Fashion dataset were AlexNet [22], VGG [23], ResNet [25], SqueezeNet-200 [82], EvoCNN [67], psoCNN [60], 2C1P2F+Dropout, 2C1P2F, 3C2F, 3C1P2F + Dropout, GRU+SVM, GRU+SVM+Dropout, HOG+SVM, MLP 256-128-64, and MLP 256-128-100.The results of all peer algorithms are either extracted from the literature (e.g., https://github.com/zalandoresearch/fashionmnist(accessed on 3 June 2022)) or simulated based on the original source codes provided by their authors.Notably, EvoCNN [67] and psoCNN [60] are the only metaheuristic-searchbased algorithms that share similar concepts with the proposed TLBOCNN, i.e., the best network architectures for given datasets are searched iteratively until termination criteria are satisfied.All parameter settings of TLBOCNN are presented in Table 3, and their values are chosen based on the conventions of deep learning and MSA communities.Simulations of TLBOCNN are conducted for 30 independent runs on a personal computer installed with Python 3.8.5 and an Nvidia GeForce RTX 3090 to obtain statistically meaningful results.4 presents the best classification accuracies produced by TLBOCNN and other peer algorithms in solving the eight image datasets, MNIST, MNIST-RD, MNIST-RB, MNIST-BI, MNIST-RD+BI, Rectangles, Rectangles-I, and Convex.Only the test sets of these eight image datasets are used to obtain the best classification accuracies of all algorithms for comparing their generalization capabilities.The compared algorithms that solve each image dataset with the best and second-best results are indicated with boldfaced and underlined fonts, respectively, as shown in Table 4.The signs of "(+)", "(−)", and "(=)" are also defined in Table 4 to indicate if the best classification accuracy produced by the proposed TLBOCNN in solving the test set of given image dataset is better than, worse than, or equal to those of its peer algorithms, respectively.Given that the results of most compared algorithms are extracted from the literature, "NA" in Table 4 implies that the results of compared algorithms to solve particular image datasets are unavailable.The performances of all compared algorithms in solving the eight selected image datasets are summarized as w/t/l and #BCA in Table 4.In particular, w/t/l implies that the optimal network architectures found by TLBOCNN outperform its peer algorithm in w datasets, ties with its peer algorithm in t datasets, and underperforms its peer algorithm in l datasets.Meanwhile, #BCA represents the number of best classification accuracy values produced by each compared algorithm to solve the eight selected image datasets.
Table 4 reports the proposed TLBOCNN as producing the best classification accuracies of 99.55%, 96.44%, 98.06%, 97.13%, 83.64%, 99.99%, 97.25%, and 97.84% when solving the MNIST, MNIST-RD, MNIST-RB, MNIST-BI, MNIST-RD+BI, Rectangles, Rectangles-I, and Convex datasets, respectively.The mean classification accuracies produced by TLBOCNN for these eight datasets are 99.52%,95.73%, 97.72%, 96.96%, 81.14%, 99.94%, 95.72%, and 97.53%, respectively.From Table 4, the proposed TLBOCNN is best among all compared algorithms due to its excellent capability to solve the eight selected image datasets with the best accuracy.TLBOCNN is also observed to completely dominate RandNet-2.LDANet-2, CAE-1, CAE-2, SVM+RBF, SVM+Poly, PCANet-2, NNet, SAA-3, DBN-3, and psoCNN when solving these eight selected image datasets.Notably, the mean classification accuracies produced by TLBOCNN can outperform the other 13 peer algorithms when solving five out of eight image datasets (i.e., MNIST, MNIST-RD, MNIST-RB, MNIST-BI, and MNIST-RD+BI).For the Rectangles dataset, the proposed TLBOCNN, EvoCNN, and ScatNet-2 have produced the same best classification accuracy of 99.99%, followed by psoCNN, with a classification accuracy of 99.93%.Another interesting trend observed from Table 4 is that at least one of the metaheuristic-search-based methods (i.e., proposed TLBOCNN, EvoCNN, and psoCNN) has emerged as one of the top two performers in solving the eight selected image datasets.These observations have verified the promising potential of applying MSAs to automatically discover the optimal CNN network architectures and solve the given problems with promising performance without human intervention.Furthermore, the superior classification performances produced by the TLBOCNN compared to EvoCNN and psoCNN in solving the majority of image datasets also imply that the proposed method is incorporated with more robust search mechanisms that can achieve better balancing of exploration and exploitation strengths in searching for more appropriate CNN network architectures to solve the given datasets.Apart from the quantitative analysis in Table 4, the performances of TLBOCNN in solving the eight selected datasets are also analyzed qualitatively.Figure 7 presents the boxplots indicating the distributions of test errors produced by TLBOCNN to solve these eight image datasets.TLBOCNN is proven to be able to consistently solve all image datasets with high classification accuracy.Another more balanced evaluation method used to analyze the effectiveness of optimal CNN architectures found by TLBOCNN is based on the receiver operating characteristic (ROC) curves and their corresponding area under curve (AUC) values, as shown in Figure 8.These AUC-ROC curves are constructed based on the true positive and false positive rates of TLBOCNN under different threshold settings when solving the eight selected image datasets.The values of AUC range from 0 to 1, where higher values imply better performance of the classifier.An ideal classifier has an AUC value of 1, whereas a classifier that makes random guesses has an AUC value of 0.5.Meanwhile, a classifier with AUC value of 0 tends to suffer with severe failure in the modelling process because it tends to predict a positive class as a negative class and vice versa.Referring to the locations of ROC curves and their corresponding AUCs that have values above 0.94, it is concluded that the optimal CNN network architectures found by TLBOCNN have good capabilities to distinguish one class from other classes for all selected image datasets.
To further analyze the performance difference among TLBOCNN and other peer algorithms in solving the eight selected image datasets, a set of non-parametric statistical analyses [83,84] was performed based on the results in Table 4.Both CAE-1 and CAE-2 are excluded due to the absence of their results for solving the Convex dataset.A Wilcoxon signed-rank test [84] is performed for pairwise comparison between TLBOCNN and other 11 peer algorithms to solve the eight image datasets.The pairwise comparison results of R + , R − , p-value, and h-value are reported in Table 5. R + and R − are the sum of ranks at which TLBOCNN outperforms and underperforms each of its peers, respectively.The p-value is a minimum significance level used to detect differences between two algorithms.The algorithms with p-values smaller than a threshold value of α = 0.05 are significantly better than their compared methods.The h-value is obtained based on p andvalues to determine if TLBOCNN is significantly better (h = "+"), insignificant (h = "="), or significantly worse (h = "−") than its peer algorithms at solving all eight selected image datasets.From Table 5, it can be concluded that the proposed TLBOCNN is significantly better than all peer algorithms, as indicated by the h-values of "+" for all pairwise comparisons.A Friedman test [83,84] is performed as a multiple comparisons analysis to evaluate the overall performance differences between TLBOCNN and the other 11 peer algorithms when solving the eight image datasets.From Table 6, all compared algorithms are ranked by the Friedman test based on their classification accuracies from best to worst, as follows: TLBOCNN, psoCNN, EvoCNN, LDANet-2, PCANet-2, ScatNet-2, RandNet-2, DBN-3, SAA-3, SVM+RBF, SVM+Poly, and NNet.Table 6 also reveals that the p-value of Friedman test is lower than α = 0.05, implying significant global differences between all compared algorithms.Three post hoc statistical procedures [83] known as Bonferroni-Dunn, Holm, and Hochberg are then performed to further analyze the concrete differences by assigning TLBOCNN as a control algorithm.The results of z-values, unadjusted p-values, and adjusted p-values (APVs) are presented in Table 7.All post hoc procedures confirm the significant improvement of TLBOCNN over NNet, SVM+Poly, SVM+RBF, SAA-3, and DBN-3 because their APVs are smaller than α = 0.05.Holm and Hochberg procedures verify the significant improvements of TLBOCNN over RandNet02 and ScatNet-2.92.60% NA ResNet-18 [25] 94.90% 11 M VGG-16 [23] 93.50% 26 M AlexNet [22] 89.90% 60 M SqueezeNet-200 [82] 90.00% 500 k MLP 256-128-64 1  90.00% 41 k MLP 256-128-100 1  88.33% 3 M EvoCNN [67] 94.53% 6.68 M psoCNN [60] 92.81%Although the classification accuracy produced by the proposed TLBOCNN for the Fashion dataset is slightly outperformed by those of ResNet-18, VGG-16, EvoCNN, and psoCNN, their performance differences are marginal, i.e., only 2.18%, 0.78%, 1.81%, and 0.09%, respectively.On the other hand, the best network architecture found by TLBOCNN to solve Fashion dataset only has 0.414 million parameters, and it is much less complicated that those of ResNet-18, VGG-16, EvoCNN, and psoCNN, which have total network parameter numbers of 11 million (i.e., 25.57times higher), 26 million (i.e., 62.80 times higher), 6.68 million (i.e., 16.14 times higher), and 2.58 million (6.23 times highers), respectively.As compared to ResNet-18, VGG-16, EvoCNN, and psoCNN, the proposed TLBOCNN has exhibited better capability to achieve proper tradeoffs between classification accuracy and network complexity when designing optimal network architectures that can solve any given classification problems.In recent years, there has been growing demand for deploying deep learning algorithms in edge devices to solve various real-world applications, such as road condition monitoring systems, vehicle autopilot systems, and mobile devices.Most of these edge devices have limited computational power and battery capacity.Therefore, it is more desirable to deploy deep learning models that require lesser computing power and energy supply.The proposed TLBOCNN is envisioned as a potential solution for manufacturers given its promising capability to automatically search for less complex network architectures with good performances.
Other peer algorithms, such as AlexNet, 3C1P2F+Dropout, 2C1P2F+Dropout, and MLP 256-128-100, are observed to solve the Fashion dataset with worse accuracies than that of TLBOCNN despite the four former methods having more complex network architectures that consists of 60 million, 7.14 million, 3.27 million, and 3 million parameters, respectively.An important fact is revelated through these observations, i.e., most existing deep learning models designed with handcrafted approaches have excessive amounts of redundant parameters that tend to significantly increase the computational efforts without leading to any notable performance improvement in models in terms of classification accuracy.On the contrary, TLBOCNN can solve the Fashion dataset with competitive performance without requiring any data augmentation techniques nor complicated network architectures.Unlike most handcrafted deep learning networks that might feature feedback or parallel connections, the optimal network architectures found by TLBOCNN are simpler because the learners are initialized with smaller network architectures that can converge at a faster rate.The promising classification accuracy of TLBOCNN also implies the possibility of using smaller network architectures to achieve the state-of-the-art results.

Optimal Network Architecture Designed by TLBOCNN
The CNN network architectures designed by the proposed TLBOCNN to solve all nine selected image datasets with the highest classification accuracies are presented in Table 9. Accordingly, the optimal network architectures found by TLBOCNN to solve all image datasets only consist of one fully connected layer.This observation is consistent with recent findings in [85], i.e., a CNN network architecture with a single fully connected layer may produce better results than those with multiple fully connected layers.Table 9 also reveals that it is not always necessary to insert a pooling layer between two convolutional layers to solve certain image datasets (e.g., MNIST-RF, MNIST-RB, MNIST-BI, MNIST-RD+BI, Rectangle, and Convex) with the best classification accuracy.It is also possible to construct a CNN network architecture that can solve a dataset with the lowest error without incorporating any pooling layer, as shown in MNIST and MNIST-BI.In other words, the search mechanisms incorporated into the proposed TLBOCNN can prevent the inclusion of any redundant layers or parameters into network architectures if they are unable to offer any meaningful network performance gains.This desirable characteristic has justified the excellent capability of the proposed TLBOCNN to automatically discover the optimal network architectures that can solve the given classification tasks without requiring any domain knowledge of the problems.cilitate the representation of CNN network architectures with flexible sizes by learners with variable length.Design constraints are also introduced to prevent the construction of invalid network architectures without compromising the ability of TLBOCN to search for novel network architectures.During the teacher phase of TLBOCNN, a novel mainstream architecture computation scheme is designed to determine population mean by referring to all learners with different lengths.A new difference operator is also introduced in both the teacher and learner phases of TLBOCNN to compare the differences between two learners with variable length followed by the design of a new position update operator used to search for the new CNN models that are represented by updated TLBO learners.The proposed TLBOCNN is compared to various state-of-the-art deep learning algorithms using nine different image datasets.Extensive simulation studies reveal that TLBOCNN can perform significantly better than most peer algorithms by solving the selected image datasets with higher classification accuracies.TLBOCNN is also able to search for optimal network architectures that can achieve proper tradeoffs between classification accuracy and network complexity when solving the given problems.
Despite the promising performances exhibited by TLBOCNN, some of its limitations are explained as follows.First, the current version of TLBOCNN only considers three types of functional blocks (i.e., convolutional, pooling and fully connected layers) when searching for the optimal network architectures of CNN.The potentials of other more sophisticated building blocks (e.g., ResNet, DenseNet, Inception, NASNet, etc.) to further enhance network performance are yet to be investigated by TLBOCNN.Some recent studies conducted in [68] revealed the feasibility of considering both ResNet and DenseNet blocks when designing the CNN network architectures without having to make substantial modifications to the original search operators of MSAs.Second, the search operators of TLBOCNN are similar to those of the original TLBO except for the modifications made to handle the variable-length learners issue.The original TLBO tends to suffer from the premature convergence issue when solving complex problems because its search operators rely on historical information (i.e., teacher solution and population mean), which is less frequently updated in the latter optimization stage.Similar challenges might be encountered by TLBOCNN when solving more complex real-world classification problems.Third, the network design method considered in the current study is a single optimization problem in which classification accuracy is the only criterion used to evaluate the quality of network architectures represented by each TLBOCNN.In practical scenarios, manufacturers must consider multiple criteria when selecting suitable network architectures for their applications.It is more desirable have a network design method that can generate multiple network architectures so that the manufacturers can make better decisions based on their current needs.Referring to these limitations, some future works can be proposed as extensions of the current work.First, the proposed TLBOCNN can be further enhanced by considering other sophisticated building blocks, such as ResNet block, DenseNet block, Inception block, NASNet block, etc., when it is used to design optimal CNN network architectures for solving given classification problems.Second, further modifications can be introduced to the search operators of TLBOCNN to achieve better balancing of exploration and exploitation searches, hence reducing its likelihood of suffering from premature convergence and enhancing its ability to search for more promising network architectures when solving more complex real-world classification problems.Finally, it is also worth investigating the possibility of formulating network architecture design as a multi-objective optimization problem in which different contradictory requirements, such as classification accuracy, network complexity, inference speed, etc., can be taken into account during the optimization process to produce multiple CNN network architectures with different characteristics.

Figure 1 .
Figure 1.Typical network architecture of a sequential CNN.Figure 1.Typical network architecture of a sequential CNN.

Figure 1 .
Figure 1.Typical network architecture of a sequential CNN.Figure 1.Typical network architecture of a sequential CNN.

Figure 2 .
Figure 2. Representation of a potential CNN network architecture by a TLBOCNN learner.

Figure 2 .
Figure 2. Representation of a potential CNN network architecture by a TLBOCNN learner.
(a) training of the CNN model using training datasets and (b) evaluation of the trained CNN model using validation datasets.Detailed mechanisms of these two major steps are explained herein.Define Data train as a training dataset used to train the CNN model represented by each TLBOCNN learner; it has a size of |Data train |.

Symmetry 2022 , 37 Figure 3 .Figure 3 .
Figure 3. Mechanisms used to construct mainstream CNN architecture of TLBOCNN with N = 5. 3.4.2.Computation of Differences between Two Learners For the teacher phase in the original TLBO described in Equation (2), a new solution [],where  = 1, . . .,   .When both L1 and L2 have different functional block types encoded in the jth dimension, the information contained in 1[] (i.e., functional block type and its hyperparameters) are extracted and assigned to   [].For instance, if 1[] = ′′ and 2[] = ′′, then the jth dimension of   is assigned with the convolutional block, i.e.,   [] = ′′ , along with its hyperparameters.  [] is assigned as '0' when both L1 and L2 have same functional block type encoded in the jth dimension, implying that there are no changes in functional block in this dimension when it is used to calculate the CNN architecture of a new learner.When comparing L1 and L2 with different network depths, it is possible for L1 to have more functional blocks than L2 or vice versa.For any jth dimension, if L1 has more functional blocks than L2 in the feature selector module or the fully connected layer,   [] is assigned as '+B' to imply that a new functional block should be added by referring to that of 1[], where B can refer to CV, AP, MP, or FC blocks.On the other hand,   [] is assigned as '-' to indicate the removal of an existing functional block when L1 has lesser functional block than L2 in the jth dimension of the feature selector module or the fully connected layer.The overall procedures used to compare the differences between L1 and L2 are summarized in Algorithm 4.

Figure 4 .
Figure 4. Graphical illustration of mechanisms used to calculate the differences between two learners, L1 and L2, that represent CNN models with different network architectures.Figure 4. Graphical illustration of mechanisms used to calculate the differences between two learners, L1 and L2, that represent CNN models with different network architectures.

Figure 4 .
Figure 4. Graphical illustration of mechanisms used to calculate the differences between two learners, L1 and L2, that represent CNN models with different network architectures.Figure 4. Graphical illustration of mechanisms used to calculate the differences between two learners, L1 and L2, that represent CNN models with different network architectures.

Figure 5 .
Figure 5. Mechanisms used to calculate    . based on   and   ..The following guidelines are used to determine the functional block information assigned to each jth dimension of    . by comparing those of   and   ., where  = 1, . . .,   , .If   [] is assigned as '0' and   .[] is not empty, then    .[] can inherit the functional block and hyperparameters of    .[].For instance, if   [] = 0 and   .[] = ′′, then the jth dimension of    . is assigned as a convolutional block, i.e.,    .[] = ′′ along with its hyperparameters.On the other hand,    .[] is assigned as an empty value if   [] is assigned as '0' and   .[] is empty.If   [] is assigned as "-", then    .[] is also assigned as an empty value regardless of the presence of any functional block in   .[].When   [] is assigned "+B" where B can refer to the functional blocks of Conv, AP, MP, or FC,    .[] is added with a new functional block as specified in B regardless of the presence of any functional block in   .[].For instance, if   [] = ′ + ′ and   .[] is an empty value, then an average pooling block is added to    .[] .It is also notable that    .[]always refers to the functional block assigned in   [] without considering the presence of any functional block in   .[] .On the other hand,    .[] only refers to the functional block of   .[] if   [] is an empty value.Algorithm 5 presents the pseudocode used to calculate    . based on   and   ..For any    .[] assigned with an empty value in the jth dimension, it implies the absence of functional block information and can be eliminated.The numbers of AP and MP blocks contained in   . must be adjusted based on the input sizes of datasets.Excessive pooling layers must be removed from    . one by one starting from the last layer if the new solution of the nth learner is found to have more pooling layers than allowed by the sizes of input datasets.Therefore, the actual dimensional size of    . (i.e.,   , ) can be smaller than that of

Figure 5 .Algorithm 5 :
Figure 5. Mechanisms used to calculate X new n .Blocks based on L Di f f and X n .Blocks.

Table 1 .
Feasible search ranges of parameters and hyperparameters.

Table 1 .
Feasible search ranges of parameters and hyperparameters.

end if 31: end for Output: P
The training process of CNN is performed in multiple steps, Step train , by dividing |Data train | by a batch size number of batch_size as follow: N, numB min , numB max , numF min , numF max , KS, numNeu min , numNeu max , numOut 01: Initialize X teacher .Blocks = ∅ and X teacher .Acc = −In f ; 02: for n = 1 to N do03:Randomly generate numB n ∈ numB min , numB max and m n ∈ [2, numB max − 1] for n-th learner; with a convolutional block (CV).Randomly initialize numF between numF min and numF max , as well as KS between 3 × 3 to 7 × 7; 08:else if j = numB n then 09:Assign blocks_list[j] with a fully − connected block (FC) and set numNeu as numOut; 10:else if m ≤ j ≤ numB n − 1 then 11:Assign blocks_list[j] with a fully − connected block (FC) and randomly initialize numNeu between numNeu min and numNeu max ; Assign blocks_list[j] with a convolutional block (CV).Randomly initialize numF between numF min and numF max , as well as KS between 3 × 3 to 7 × 7; = {X 1 , . . . ,, . . . ,} X n .Blocks, Data train , Data valid , batch_size, e train , 01: Compile X n .Blocks to a full-fledged CNN; 02: Calculate Step train and Step valid using Equations (4) and (6), respectively; 03: Initialize weights Θ = {θ 1 , θ 2 , . ..} of complied CNN model with He Normal initializer; 04: for e = 1 to e train do /* Train the compiled CNN model for e train epoch*/ 05: for k = 1 to Step train do 06: Compute the f θ, Data train,k of CNN model based on its current Θ and Data train,k ; Initialize acc_list ← ∅ with the size of Step valid ; 11: for k = 1 to Step valid do Algorithm 1: Population InitializationInput:16:else if block_type > 0.5 then 17:Randomly generate the pooling_type ∈ [0, 1]; 18: if pooling_type ≤ 0.5 then 19: Assign blocks_list[j] with an average pooling block (AP), where the pool size and stride size are set as 3 × 3 and 2 × 2, respectively; 20: else 21: Assign blocks_list[j] with a maximum pooling block (MP), where the pool size and stride size are set as 3 × 3 and 2 × 2, respectively.n .Blocks to obtain X n .Acc with Algorithm 2; 28: if X n .Acc is better than X teacher .Acc then /* Compare the accuracy of n-th learner and teacher */ 29: X teacher .Blocks ← X n .Blocks , X teacher .Acc ← X n .Acc ; /* Update teacher */ 30: /* Evaluate the compiled CNN model using validation dataset */ 12: Perform classification on Data valid,k with the compiled CNN model represented by X n .Blocks; 13: Store the classification accuracy of complied CNN model on the Data valid,k into acc_list[k]; 14: end for 15: Calculate X n .Acc of compiled CNN model represented by X n .Blocks using Equation (7); Output: X n .Acc and X n .Blocks[j] has no functional block then n .Blocks one by one starting from last layers if it is found to have more pooling layers than that allowed by the sizes of input datasets.Output: X new n .Blocks;

Table 2 .
Overview of the image datasets used for performance evaluation of TLBOCNN.

Table 2 .
Overview of the image datasets used for performance evaluation of TLBOCNN.

Table 3 .
Parameter settings of proposed TLBOCNN for performance analyses.

Table 5 .
Wilcoxon signed-rank test results between TLBOCNN and its peer algorithms.

Table 5 .
Wilcoxon signed-rank test results between TLBOCNN and its peer algorithms.

Table 6 .
Average ranking and associated p-values produced through Friedman test.

Table 8
compares the performances of all algorithms in solving the Fashion dataset in terms of the classification accuracies and the network complexity represented by total numbers of parameters.Accordingly, ResNet-18, VGG-16, EvoCNN, psoCN, and TLBOCNN have emerged as the top five performers in solving the Fashion dataset, with classification accuracies of 94.90%, 93.50%, 94.53%, 92.81%, 92.72%, respectively.Similarly to Table4, the simulation results in Table8also verify the benefits of using MSAs to automatically search for optimal CNN network architectures that can solve different classification problems competitively without requiring rich expert domain knowledge.

Table 8 .
Classification accuracies and number of parameters produced by the proposed TLBOCNN and other algorithms to solve Fashion dataset.