Monarch Butterﬂy Optimization Based Convolutional Neural Network Design

.

CNNs have become a fast-growing field in recent years, though their evolution started much earlier. In 1959, Hubel and Wiesel [9] published one of the most influential papers in this area. They conducted many different experiments with the aim to understand how neurons work in the visual cortex. They found that the primary visual cortex in the brain has a hierarchical organization with simple and complex neurons and the visual processing always starts with simple structures such as oriented edges, and the complex cells receive input from the lower-level simple cells.
The first example of an artificial neural network model, the Neocognitron, was introduced by Fukushima in 1980 [10]. The Neocognitron had the same logic with simple and complex cells that were discovered by Hubel and Wiesel. Fukushima built up a hierarchy with alternative layers of simple cells

Convolutional Neural Networks and Hyperparameters' Optimization
CNNs consist of a sequence of different layers. The layers in the architecture mimic the visual cortex mechanism in the brain. The essential layers of the CNN are the convolutional layers, pooling layers, and dense layers. In every CCN, the first layer takes the input image and convolves it by filters, and upon the result, the activation function is applied. In this way, low-level features are extracted from the image, the resulting output becomes the input for the next layer, and each consequent layer extracts more and more complex, higher level features. The pooling layer is used for downsampling, and its kind is max or average pooling. At the end, the architecture has one or more flattened dense layers, and the final one classifies the image.
During the network training, in the weight learning process, the loss function should be optimized. For this purpose, in the modern literature, many optimizers have been proposed, such as stochastic gradient descent, momentum, Adam, rmsprop, adadelta, adagrad, and adamax [17][18][19]. Over-fitting occurs in the network, when the difference between the training accuracy and test accuracy is high; in other words, the network learns specific data, and the model is unable to predict new input data. To avoid this, different regularization techniques can be used. Some of the practical regularization methods are L 1 and L 2 regularization [20], dropout [21], drop connect [22], batch normalization [23], early stopping, and data augmentation.
The transfer function (activation function) that is applied to the convolved result is used for mapping the input to a non-linear output. Some of the most widely utilized transfer functions are sigmoid, tanh, and rectified linear unit (ReLU) [24]. ReLU represents the de facto standard in the transfer function choice, and its value is calculated as f (x) = max(x, 0).
The operation in the convolutional layer and the activation function application can be defined as follows: i,j,k = a(w where the resulting feature map (activation map) is represented by z [l] i,j,k , w k is the k th filter, x i,j denotes the input at the i, j location, and b is the bias term. The superscript l denotes the l th layer, and the activation function is represented as a(·).
For further reading about general CNNs' foundations and principles, please refer to [12,25]. The accuracy of any CNN depends on its structure, which in turn depends on the values of variables known as hyperparameters [26]. Some of the hyperparameters include the number of convolutional and fully-connected (dense) layers, the number of kernels and kernel size of each convolutional layer, weight regularization in the dense layers, the batch size, the learning rate, the dropout rate, the activation function, etc. With the goal of establishing satisfying classification accuracy for each particular problem instance, a CNN with specific structure (design) should be found.
The universal rule for finding the optimal network structure for a given problem does not exist, and the only way to create a CNN that will perform satisfactorily for a specific task is by utilizing the "trial and error" approach. Unfortunately, with each trial, the generated network should be trained and tested, and these processes are very time consuming and resource intensive.
In accordance with the above, the process of generating an optimal (in most cases, near optimal) CNN network structure for a specific task represents one of the most important challenges and issues from this domain, and it is known in the literature as the hyperparameter optimization problem [27]. Since there are many CNN hyperparameters, as more and more hyperparameters are used in optimization, the difficulty of the problem grows exponentially. That is why this challenge is categorized as an NP-hard task.
As in the case of any NP-hard challenge, the application of deterministic algorithms for tackling CNN hyperparameters' optimization tasks is not plausible, and to solve it, it is necessary to employ stochastic methods and algorithms. Researchers world-wide have recognized the need for developing a framework that will generate, train, and test different CNN architectures with the goal of finding one that is the most suitable for a specific task by using metaheuristic approaches [28][29][30]. The process of automatically discovering the most suitable CNN structures (CNN hyperparameters' optimization) for a specific tasks by using evolutionary, as well as other nature-inspired metaheuristics is known in the modern computer science literature as "neuroevolution" [31].
As part of the research that is presented in this paper, we also tried to develop such a framework by taking into account various CNN hyperparameters and by utilizing the enhanced version of one promising and recent swarm intelligence metaheuristic. The details of swarm intelligence methods, as well as the relevant literature review are given in Section 2.

Research Question, Objectives, and Scope
The research proposed in this paper is based on and represents an extension of our previously conducted experiments and simulations from this domain [32][33][34]. In our most current previous published research [34], we enhanced and adapted two swarm intelligence metaheuristics to generate CNN architectures automatically and tested them on an image classification task. In this research, during the optimization process, we utilized the following CNN hyperparameters: the number of convolutional layers with the number of filters and the kernel size for each layer, as well as number the size of the dense layers.
Inspired by the approaches presented in [31], in the research that is proposed in this paper, with the objective to generate network structures that will perform a given classification task with higher accuracy than other networks, we included more CNN hyperparameters for the optimization process than previously presented works from this domain [29,30,34]: the number of convolutional layers along with the number of kernels, the kernel size and activation function of each convolutional layer, the pooling size, the number of dense (fully-connected) layers with the number of neurons, the connectivity pattern, the activation function, the weight regularization, the dropout for each dense layer, as well as the batch size, the learning rate, and the rule as general CNN hyperparameters. We tried to generate state-of-the-art CNN structures by employing the hybridized monarch butterfly optimization (MBO) algorithm, since the original MBO [35] has proven to be a robust method for solving various NP-hard optimization challenges [36,37].
In our current, as well as in previous research with the MBO metaheuristics [38][39][40], we noticed some deficiencies, which were particularly emphasized in the exploration phase of the search process. However, by conducting empirical simulations with the basic MBO, we also noticed that the exploitation phase, as well as the balance between intensification and diversification could be further enhanced. For that reason, to tackle the CNN hyperparameter optimization problem, in this paper, we present newly developed hybridized MBO metaheuristics that obtains significantly better performance in terms of convergence and the results' quality than the original MBO approach. The developed hybrid MBO incorporates mechanisms from two well-known swarm algorithms, artificial bee colony (ABC) and the firefly algorithm (FA), to address the shortcomings of the original MBO.
In this paper, we present two sets of experiments. In the first group of simulations, the developed hybridized MBO is firstly tested on a set of standard unconstrained benchmarks, and comparative analysis with the original MBO, as well as one other state-of-the-art improved MBO approach is performed, the results of which were published in respected international journals [41]. We followed a good research practice in such a way that when a new method has been developed, before testing it on a practical problem, it should be tested on a wider benchmark set to evaluate its performance in more depth.
Afterwards, the proposed hybrid MBO was adapted and tested for the CNN design problem, and the results were compared with recent state-of-the-art approaches that were tested under the same experimental conditions and for the same image classification benchmark dataset [31]. Moreover, since the implementation of the original MBO for this problem has not been found in the modern literature, with the goal of more precise evaluation of the proposed hybridized MBO improvements over the original version, we also implemented basic MBO for this problem and performed a comparative analysis. The CNN structures generated were evaluated against the well-known image classification domain, hand-written recognition. For testing purposes, we utilized the MNIST database. The main reason for using this database was the fact that MNIST has been extensively reviewed and used for evaluating many methods, including the algorithms presented in [31], and that we have used them as a direct comparison with our proposed hybrid MBO approach.
According to the subject of research that was conducted for the purpose of this paper, the basic research question can be formulated as follows: Is it possible to generate state-of-the-art CNN architectures that will establish better classification accuracy than other known CNN structures by utilizing enhanced swarm algorithms and by taking into account more CNN hyperparameters in the optimization process?
The objective and scope of the proposed research is twofold. The primary objective is to develop an automated framework for "neuroevolution" based on swarm intelligence methods that will design and generate CNN architectures with superior performance (accuracy) for classification tasks. Besides our previously conducted research [32][33][34], other research has also addressed this NP-hard task [28][29][30]42,43]. However, as already noted above, contrary to all enumerated works, in the proposed research in this paper, we included more hyperparameters in the optimization process, as in [31]. Incorporating more hyperparameters makes the search space exponentially larger, as well as the optimization process itself.
The secondary objective of the proposed research is to enhance the basic MBO algorithm by overcoming its deficiencies. The basic motivation for utilizing this method is that the MBO, even in its original implementation, obtains outstanding performance, and our basic assumption is that the MBO in the enhanced version can potentially take its place as one of the best nature-inspired algorithms.

Structure of the Paper
The remainder of this paper is organized as follows. Section 2 provides insights into similar research from the CNN design domain that can be found in the modern computer science literature, followed by Section 3, in which we present the original, as well as the proposed hybrid MBO metaheuristics along with the detailed explanations regarding the observed drawbacks of the basic MBO approach.
In order to establish a better structure for the paper, the details of the experiments (simulations) conducted are organized into two sections, Sections 4 and 5. First, in Section 4, we present the details of the simulation environment and the datasets that are utilized in the simulations for the unconstrained benchmarks, as well as for the CNN hyperparameters' optimization. Afterwards, the obtained results along with the visual representation and comparative analysis with other state-of-the-art methods for both types of experiments (unconstrained benchmark and practical CNN design) are given in Section 5. In the final Section 6, we provide a summary of the research conducted along with the scientific contributions and insights into the future work in this promising domain.

Swarm Intelligence Algorithms and Related Work
Swarm intelligence metaheuristics belong to a wider family of nature-inspired stochastic algorithms. Swarm algorithms' mechanisms that guide and direct the optimization process are inspired by natural systems, like colonies of ants, hives of bees, groups of bats, herds of elephants, etc. These methods start execution with a pseudo-random initial population, which is generated within the lower and upper boundaries of the search space. Afterwards, the initial population is improved in an iteration-based approach.
In each iteration, two major processes guide the search process: exploitation (intensification) and exploration (diversification). Intensification performs the search in the neighborhood of existing solutions, while exploration tries to investigate unknown areas of the search space. Exploitation and exploration equations are specific for each swarm algorithm, and they model an approximation of the real-world system. One of the major issues that has been addressed in many papers, in every swarm intelligence algorithm, is the exploitation-exploration trade-off [44,45].
The other group of algorithms, which also belongs to the family of nature-inspired methods, are evolutionary algorithms (EA). The EA approaches model the process of biological evolution by applying selection, crossover, and mutation operators to the individuals from the population. In this way, only the fittest solutions manage to "survive" and to propagate into the next generation of the algorithm's execution. The most prominent subcategories of EA are genetic algorithms (GA) [89], evolutionary programming (EP) [90], and evolutionary strategies [91].
Based on the literature survey, swarm intelligence and EA methods have been applied to the domain of CNN hyperparameters' optimization and neuroevolution with the goal of developing an automatic framework that will generate an optimal or near-optimal CNN structure for solving a specific problem. However, due to the complexity of this task and computer resource requirements, many of them have tried to optimize only a few CNN hyperparameters, while the values of the remaining parameters were set to be fixed (static).
Two interesting PSO-based algorithms for CNNs design were presented in [27,92]. In [92], basic PSO was improved with gradient penalties for the generated optimal CNN structure. The proposed method was validated against three emotion states of subjects that were collected using EEG signals and managed to obtain significant results. An orthogonal learning particle swarm optimization (OLPSO), that was shown in [27], was used for optimizing the values of hyperparameters for VGG16 and VGG19 CNNs that were applied to the domain of plant disease diagnosis. The OLPSO outperformed other state-of-the-art methods that were validated for the same dataset in terms of the classification accuracy.
Four swarm intelligence algorithms, FA, BA, CS, and PSO, that were implemented and adapted for addressing the over-fitting problem, were presented in [43]. By using these approaches, this issue was overcome by establishing a proper selection of the regularization parameter dropout. All algorithms were tested against the well-known MNIST dataset for image classification, and satisfying accuracy was obtained. Another new PSO method for generating adequate CNN architecture was shown in [26]. The canonical PSO for CNN (cPSO-CNN) managed to adapt to the variable ranges of CNN hyperparameters by improving the canonical PSO exploration capability and redefining PSO's scalar acceleration coefficients as vectors. The cPSO-CNN was compared with seven outstanding methods for the same image classification task and obtained the best results in terms of classification accuracy, as well as computational costs. In [28], by applying a PSO-based method, the authors managed to generate CNNs with a better configuration than AlexNet for five image classification tasks. Hybrid statistically-driven coral reef optimization (HSCRO) for neuroevolution was proposed in 2020 [93]. This method was used for optimizing the VGG-16 model for two different datasets, CIFAR-10 and CINIC-10, and managed to generate CNNs with a lighter architecture along with an 88% reduction of the connection weights.
A method that successfully combines GA and CNNs for non-invasive classification of glioma using magnetic resonance imaging (MRI) was proposed in [94]. The authors developed an automatic framework for neuroevolution by evolving the architecture of a deep network using the GA method. Based on the results for the two test studies, the proposed method proved to be better than its competitors. In [95], the authors presented a method for generating a differentiable version of the compositional pattern producing network (CPP), called the DPPN. Microbial GAs are used to create DPNNs with the goal of replicating CNN structures. Recently, an interesting project for an automating deep neural network architecture, called DEvol, was established [96]. DEvol supports a variable number of deep, as well as convolutional layers. According to the presented documentation, the framework managed to obtain a test error rate of 0.6% on the MNIST dataset, which represents the state-of-the-art result.

Proposed Method
As was already stated in Section 1.2, for tackling the CNN hyperparameter optimization problem, in this paper, we propose a hybridized version of the MBO swarm intelligence metaheuristics. In this section, we first describe the original MBO algorithm and, afterwards, present the devised hybrid method. Moreover, in this section, we also give the details of the original MBO's drawbacks that our hybrid approach addresses.

Original Monarch Butterfly Optimization Algorithm
The first version of the MBO was proposed by Wang et al. in 2015 [35]. The algorithm was motivated by the monarch butterfly migration process. In the very first paper, where the MBO was presented for the first time, the authors compared the MBO to five other metaheuristic algorithms and evaluated it on thirty-eight benchmark functions [35]. The MBO achieved better results on all instances than the other five metaheuristics.
Monarch butterflies live in two different places, in the northern USA and southern Canada, and they migrate to Mexico. There are two behaviors in how they change their location: by the migration operator and the butterfly adjusting operator, respectively. That is to say, these two operators define the search direction of each individual in the population.
The basic rules of the algorithm, which perform approximation of the real system, include the following [35]: 1.
The population of the individuals is in two different locations (Land 1 and Land 2); 2.
The offspring are created on both places, by utilizing the migration operator; 3.
If the new individual has better fitness than the parent monarch butterfly, it will replace the old solution; 4.
The solutions with the best fitness value remain unchanged for the next iteration.

Migration Operator
The entire population, denoted as SN (solution number), of monarch butterfly individuals (solutions) is divided into two sub-populations, Sub-population 1 (SP1) and Sub-population 2 (SP2). SP1 and SP2 correspond to the two lands where they are located, Land 1 and Land 2, respectively.
where p represents the individual's migration ratio in the Sub-population 1.
The process of migration is executed in the following way: where the j th element of the i th individual at iteration t + 1 is denoted by x t+1 i,j . x t r 1 ,j and x t r 2 ,j represent the locations of r 1 and r 2 individuals, which are randomly selected from SP1 and SP2, respectively at iteration t, and j corresponds to the j th component.
The parameter r decides whether the j th element of the new solution will be selected from Sub-population 1 or Sub-population 2. The value of r is calculated as the product of a random number between zero and one (rand) and the period of migration (peri).
where the suggested peri value is 1.2 [35].

Butterfly Adjusting Operator
The second mechanism to guide the individuals toward the optimum within the search space is the butterfly adjusting operator. In this process, if rand ≤ p, the position will be updated according to the following formula: where the j th parameter of the fittest solution at iteration t is denoted by x t best,j ; and the monarch butterflies' updated position is indicated by x t+1 i,j . Contrariwise, if the random number is greater than the ratio of migration (rand > p), the position update proceeds according to the following formula: where x t r 3 ,j indicates the j th element of a randomly selected solution from Sub-population 2 in the current iteration.
Furthermore, if the uniformly distributed random number is greater than the adjusting rate (butterfly adjusting rate, BAR) , the individual is updated based on the next equation: where dx j represents the walk step of the i th individual; dx is obtained by Lévy flight: The scaling factor α is calculated as follows: where the upper limit of the walk step that an individual can take at a time is represented by S max and t points out the current iteration. The parameter α is responsible for the right balance between intensification and diversification. If its value is larger, the exploration (diversification) is predominant; on the other hand, if the value of α is smaller, the search process is executed in favor of intensification (exploitation).
The pseudo-code of the basic MBO version is shown in Algorithm 1.

Hybridized Monarch Butterfly Optimization Algorithm
Recently, we developed improved and hybridized versions of MBO. All devised implementations showed better performance than the original MBO, and they were successfully applied to one practical problem [39] and also evaluated against global optimization benchmarks [38,40].
As also noted in our previous research that the major drawbacks of the basic MBO included the lack of exploration power and the inadequate balance (trade-off) between the intensification and diversification [38][39][40]. However, we have recently been conducting additional simulations with the original MBO and concluded that also the exploitation phase, as well as the balance between exploration and exploitation could be further improved.
For that reason, similar to our previous MBO's improvements, we incorporated the ABC exploration mechanism, as well as the ABC's control parameter, which adjusts the intensification [97]. Furthermore, with the goal of facilitating the exploitation phase of the original MBO, we adopted the search equation from the FA metaheuristics, which proved to be very efficient [49]. Based on the performed hybridization, we named the newly devised approach MBO-ABC firefly enhanced (MBO-ABCFE).
To incorporate all changes, MBO-ABCFE utilizes three more control parameters than the original MBO algorithm. The first parameter is the exhaustiveness parameter (exh). In every iteration, when a solution cannot be improved, its trial value is incremented. When a trial for a particular solution reaches the threshold value of exh, a new random solution is generated according to the following formula: where the upper and lower bound of the j th element are denoted by ub j and lb j . φ is a random number from the uniform distribution. The "exhausted" solution is then replaced with the newly generated random individual. In this way, the exploration capability of the original MBO is enhanced.
However, this approach can be dangerous. For example, if this exploration mechanism is triggered in late iterations of the algorithm's execution (with the assumption that the search has converged to the optimal region), then possibly good solutions may be replaced with random ones. To avoid this, we included the second control parameter, the discarding mechanism trigger (dmt). By utilizing the dmt parameter, we established the stop condition for ABC's exploration. In other words, if the current iteration number (t) is greater than the value of dmt, the ABC's exploration is not executed.
Moreover, it is also dangerous to perform "too aggressive" exploitation in early phases of algorithm's execution. In this way, if the algorithm has not converged to the optimal region, the search may be stuck in some of the suboptimal domains of the search space. To adjust the exploitation and to avoid the scenario of converging to suboptimal regions, we adopted one more parameter from the ABC metaheuristics, the modification rate (MR). This parameter is used within the search procedure of Sub-population 1, and it is applied only if the θ ≤ MR condition is met. The value of θ is a randomly generated number between zero and one. We note that the best value for MR of 0.8 was established in previous research [48,97]. In our implementation, this value is hard coded and cannot be adjusted by the user.
Finally, the third control parameter, the firefly algorithm process (FAP), which is included in the proposed implementation, is used to control whether or not the FA's search equation will be triggered.
The exploitation equation for the FA search is formulated as follows [49]: where the randomization parameter α, as well as parameters β 0 and γ represent basic FA search parameters. The notation κ denotes the random number between zero and one. According to the previous experiments, as well as the recommendations from the original FA paper [49], the values of α, γ, and β 0 were set to 0.5, 1.0, and 0.2, respectively, and they were hard coded, so they could not be changed by the user. For more details on the firefly algorithm, refer to [49].
The parameter FAP is generated for each solutions' parameter, and its value is between zero and one. If its value is less than 0.5, the firefly search equation will be utilized, otherwise standard MBO's search equations will be used.
Finally, we note that the MR parameter and the FA search equations are only utilized in Sub-population 1 of the algorithm.
Taking into account all the presented details regarding the proposed MBO-ABCFE metaheuristics, the pseudo-code is given in Algorithm 2, while the flowchart diagram is depicted in Figure 1.  Randomly initialize the population of SN solutions (monarch butterflies); initialize the parameters: migration ratio (p), migration period (peri), adjusting rate (BAR), maximum step size (S max ), exhaustiveness parameter (exh), discarding mechanism trigger (dmt), and modification rate (MR); Evaluate the fitness; set t, the iteration counter, to one and trial to zero, and define the maximum number of iterations (MaxIter) while t < MaxIter do Sort the solutions according to their fitness value Divide the whole population into two sub-populations (SP1 and SP2) for all i = 1 to SP1 (all individuals in Sub-population 1) do for all j = 1 to D (all elements of the i th individual) do Generate a random number between zero and one for θ if θ ≤ MR then if FAP<0.5 then Generate a new component by using Equation (12)  else Generate rand (a random number), and calculate the value of r by using Equation (5) if r ≤ p then Choose a solution from SP1, and create the j th element of the new solution by Equation (4)  Generate rand (a random number), and calculate the value of r by using Equation (5) if r ≤ p then Create the j th element of the new solution by utilizing Equation (6)  else Choose an individual from SP2, and create the j th element of the new solution by Equation (7) if r > BAR then Apply Equation (8)

Simulation Setup
As noted in Section 1.3, two sections of this paper are devoted to experimental (practical) simulations. In this first experimental section, we show the control parameter setup of the proposed MBO-ABCFE and the dataset details utilized in unconstrained benchmark simulations and for the CNNs' neuroevolution (hyperparameter optimization) challenge.

Parameter Settings and Dataset for Unconstrained Simulations
The proposed hybridized swarm intelligence-based MBO-ABCFE algorithm was first applied to 25 unconstrained benchmark functions. In this way, we first wanted to establish the performance comparison and improvements of the original MBO approach on a wider set of functions.
Benchmark functions used in this paper were also used for testing purposes of the original MBO [35]. We also wanted to compare the proposed hybridized MBO with one other improved MBO implementation, the greedy strategy and self-adaptive crossover operator (GCMBO) [41], and we included this approach as well in the comparative analysis. In these simulations, the algorithm was executed in 50 independent runs.
Additionally, the proposed method was compared with other metaheuristics (hybrid ABC/MBO (HAM) [98], ABC, ACO, and PSO). In this way, we wanted to establish a better validation of the proposed method by comparing it with other state-of-the-art metaheuristics, the results of which were presented in the modern literature. In these simulations, the algorithm was tested on 12 benchmark functions over 100 runs.
For the purpose of making a fair comparison between the proposed MBO-ABCFE and the original MBO and GCMBO algorithms, as well as with other approaches, we performed the simulations under the same simulation condition and under the same dataset as in [35,41,98].
The population size (SN) was set to 50, the migration ratio (p) to 5/12, the migration period (peri) to 1.2, the adjusting rate (BAR) to 5/12, and the maximum step size (S max )to 1. The value of two sub-populations, Land 1 and Land 2, were set to 21 and 29, respectively. The newly introduced control parameters of the hybridized MBO-ABCFE algorithm were adjusted as follows: the exhaustiveness (exh) was initialized to the value of four the, and discarding mechanism trigger (dmt) was set to 33. As noted in Section 2, the value of the FAP parameter was randomly generated for each solution component. The values of ABC and FA specific parameters, the MR, α, γ, and β 0 , were hard coded and set to 0.8, 0.5, 1.0, and 0.2, respectively.
Satisfying values of control parameters exh and dmt were determined by executing the algorithm many times for standard benchmarks. There was no universal rule for determining the optimal values of the metaheuristics' control parameters, and the "trial and error" approach was the only option.
However, in the case of the MBO-ABCFE algorithm, after many trials, we derived the expression for calculating the value of the exh parameter: where round() represents a simple function that rounds the argument to the closest integer value. Moreover, in the case of MBO-ABCFE, we found that the value of the dmt parameter could be calculated by using the formula: round(maxIter/1.5).
The values of ABC and FA specific parameters, the MR, α, γ, and β 0 , were taken from the original papers [49,97]. In these articles, the authors suggested optimal values of these parameters that had also been determined empirically.
We also note that in order to obtain satisfying results, the control parameters' values should be adjusted for a particular problem or type of problem. For example, if with one set of the control parameters' values, the algorithm establishes a promising result when tackling Problem A, it does not necessarily mean that with the same parameter adjustments, satisfying results will be accomplished for Problem B as well.
The complete list of control parameters is displayed in Table 1. In the original FA implementation [49], the trade-off between exploration and exploitation depended on the value of the α parameter, and it was dynamically adjusted during the run of an algorithm. In the proposed MBO-ABCFE metaheuristics, we used the same approach by utilizing the following equation [49,50,70,71,99]: where t max denotes the maximum number of iterations in one run. At the early stages of a run, the value of α was greater (higher exploration power), and as the search progressed during iterations, the value of α decreased (the exploitation power increased, and the exploration decreased). The formulations of benchmark functions (dataset) that were utilized in the simulations are given in Table 2. Table 2. Benchmark function details.

ID Function Name
Function Definition

F1 Ackley
F4 Dixon and Price

Parameters' Settings and Simulation Setup for Cnns' Neuroevolution Simulations
Since the original MBO has not been previously tested on the CNN neuroevolution problem, with the aim of comparing the performance of our proposed MBO-ABCFE with the original algorithm, we also implemented and adapted the basic version for CNNs' simulations. The same values of the algorithms' control parameters that were used in unconstrained simulations were utilized in CNNs' neuroevolution experiments for both methods, MBO and MBO-ABCFE.
It was already mentioned in the previous subsection that with one set of control parameters' values, metaheuristics were not able to tackle every NP-hard challenge successfully, and parameter adjustments should be made for each particular problem. There is always a trade-off; for example, when tweaking some algorithm parameters, the performance could be enhanced for Problem A, but at the same time, the performance may be degraded for Problem B. That is the basic assumption behind testing the algorithm first on a wider set of benchmark functions, before applying it to the concrete NP-hard challenge.
In this case, the proposed MBO-ABCFE approach with the control parameters' settings that were shown in the previous subsection managed to obtain satisfying results on a wider set of tests. Guided by this, we assumed that with the same parameter adjustment, MBO-ABCFE would also be able to obtain promising results when tackling CNNs' neuroevolution. Since CNNs' neuroevolution is a very resource intensive and time consuming task, we would need a sophisticated hardware platform to perform simulations with different MBO-ABCFE control parameter settings for this NP-hard challenge.
Thus, we note that the proposed MBO-ABCFE potentially would be able to obtain even better results in tackling CNNs' neuroevolution with some other control parameters' values; however, we will investigate this in some of our future research from this domain.
In order to reduce the computation time, the values of all hyperparameters were discretized and defined within lower and upper bounds. To determine the set of values for each hyperparameter, we used the Gray code substring. The same strategy was utilized in [31].
Furthermore, since the approach presented in [31] was used for direct comparative analysis with our proposed MBO-ABCFE, we included the same hyperparameters in CNNs' neuroevolution and used the same experimental environment setup as in [31].
In the following two subsections, we describe in detail the calculation of the convolutional layer and dense (fully-connected) layer sets of parameters, as well as the setup of the general CNN hyperparameters.
The filter size ( f s ) is determined as: where q = 0, 1, 2, 3, 4, 5, 6, 7; accordingly, the set of possible solutions of the filter size is defined as f s = {2, 3, 4, 5, 6, 7, 8, 9}. The type of the activation function (a c ) has two possible values, a c = q, where q is zero or one. If the value is zero, the ReLU function is utilized; otherwise, if the value of q is equal to one, a linear activation function is selected.

Configuration of the Fully-Connected Layer
The second category of hyperparameters is the fully-connected layer (FC l ), which contains six hyperparameters, and this set is defined as: The number of fully-connected layers is denoted by f c s , and its set is determined by the formula: where q is zero or one; accordingly, the set is f c s = {1, 2}. The connectivity pattern (cp) has three possible values, 0, 1, or 2. If the value is zero, rnn is used; if the value is one, lstm is employed; in case the value is two, the dense layer is employed.
The number of hidden units is defined as: where n u represents the hidden unit number and parameter q takes values between zero and seven, which results in a set n u = {8, 16, 32, 64, 128, 256, 512, 1024}. The type of activation function in the final layers (a f ) has two possible values, a c = q, where q is zero or one. If the value is zero, the ReLU function is utilized; otherwise, if the value is equal to one, a linear activation function is selected.
Weight regularization (wr) is defined as: if q is equal to zero, no regularization technique used; if it is one, L1 regularization is applied; if q is two, L2 regularization is employed; and if it has a value of three, L1L2 is utilized.
In the case of the dropout (d) hyperparameter, the following formula is used: d = q/2. Further, we implemented two options: if the value of q is set to zero, the dropout method is not utilized, and if the value of q is one, the dropout is applied in the fully-connected layer with the dropout rate of 0.5, which is hard coded.

Configuration of the General Hyperparameters
The set of general hyperparameters contains the batch size, learning rule, and learning rate, which can be described as follows: The size of the mini-batches are determined by the formula: where q = 0, 1, 2, 3.; as a result, the elements of the batch size set are b s = {25, 50, 100, 200} The selection of the learning rule is made between eight different optimizers; it is defined as: where lr indicates the optimizer and q can have a value between zero and seven; if q = 0, sgd is used; if q = 1, momentum is used; if q = 2, Nesterov is used; if q = 3, adagard is used; if q = 4, adamax is used; if q = 5, Adam is used; if q = 6, adadelta is used; if q = 7, rmsprop is used.

Benchmark Dataset
The well-known MNIST [100] benchmark dataset was used for the evaluation purposes of the proposed MBO-ABCFE and original MBO. The MNIST database is an extensive database that contains grayscale images of handwritten digits ranging from zero to nine. The database includes 70,000 labeled images with a size of 28 × 28 pixels. The MNIST database is a subset of an even larger dataset of the NIST Special Database. Initially, the images in the NIST dataset had a size of 20 × 20 pixels, while preserving their aspect ratio, by using anti-aliasing technique, the images were centered in a 28 × 28 pixel image. In our simulation, we split the dataset into training, validation, and test sets. For training purposes, fifty-thousands samples were used, while for validation and testing, 10,000 images. Examples of the digit images are shown in Figure 2.

Experimental Results and Discussion
In this section, first we present the result of the proposed MBO-ABCFE algorithm on unconstrained benchmark function experiments and the comparison with other metaheuristics. Afterward, the CNN design's experimental results are presented.
Basic MBO and the proposed hybridized MBO-ABCFE were implemented and developed in Java Development Kit 11 (JDK 11) in the IntelliJ IDE (Integrated Development Environment). To evaluate the proposed method, the CNN framework was implemented in a deep learning programming library, the Deeplearning4j library. The simulation tests were performed on a 6 x NVIDIA GTX 1080 graphical processing unit (GPU) and machine with Intel R CoreTM i7-8700K CPU, 32GB RAM, and Windows 10 OS.
Since the CNNs' neuroevolution challenge is very resource intensive, to speed up the computations, we used CUDA TM technology and executed our code on six GPUs in parallel, instead of the CPU.

Experiments on Unconstrained Benchmark Functions
The unconstrained benchmark function experiment was conducted to test the performance of the proposed MBO-ABCFE algorithm on 25 standard benchmark problems with 20 and 40 dimensions. The obtained results were first compared to the original MBO [35] and one other variant of the MBO, the GCMBO [41] algorithm. Furthermore, the proposed method was compared with hybrid ABC/MBO (HAM), ABC, ACO, and PSO state-of-the-art approaches on 12 test functions with 20 dimensions.
For more details about the benchmark functions, please refer to Table 2. Due to the stochastic behavior and random nature of swarm intelligence algorithms, their performance cannot be judged based on a single run. Thus, the experimental results were calculated on average for 50 independent runs, and the additional comparison with four metaheuristics was calculated on average for 100 runs.
The obtained experimental results of the objective function and the comparison of statistical results in the comparative analysis with basic MBO and GCMBO on 20-dimensional test instances are shown in Table 3, while the results on 40-dimensional benchmarks are presented in Table 4. The comparison with four other state-of-the-art metaheuristics is given in the Table 5. In both tables, the best solutions are shown in boldface. As can be seen from the presented tables, we took the best, mean, standard deviation, and worst metrics for performance evaluations, while in the third comparative analysis, we present the best and mean results for each of the comparative algorithms.
The results of the basic MBO and GCMBO were retrieved from [41]. In this paper, the authors used the number of fitness function evaluations (FFEs) as the termination condition (8000 FFEs for 20-dimensional and 16,000 FFEs for 40-dimensional tests) with 50 solutions in the population. Moreover, the authors executed 50 independent runs for each benchmark. We also show the averaged results in 50 independent runs of our proposed MBO-ABCFE in this comparative analysis (Tables 3 and 4).
Simulation results for HAM, ABC, ACO, and PSO were taken from [98]. Here, we used the number of iterations (generations), which was set to 50, as the termination condition. We also show results averaged over 100 independent runs of the algorithm's execution. To make a fair comparative analysis between HAM, ABC, ACO, PSO, and our proposed MBO-ABCFE in this comparative analysis (Table 5), we also show the average results over 100 runs. Here, we note that we could not perform a comparison for all 25 benchmarks as in the case of the comparative analysis with the basic MBO and GCMBO, since not all testing results were available in [98].
In all performed comparative analyses (Tables 3-5), we utilized 50 iterations as the termination condition, as in [98].    The results presented in Table 5 proved the effectiveness of MBO-ABCFE. The algorithm outperformed all other compared metaheuristic algorithm for both best and mean indicators, in seven out of 12 test functions. In test instances with ID 11, 18, and 20, MBO-ABCFA was not able to establish the best values for the mean indicator; however, it showed the best performance compared to all other competitors for the best indicator. Only in the case of Griewank and Generalized Penalized Function 1 (benchmarks with ID 6 and 10, respectively), HAM and ACO obtained better results than our proposed MBO-ABCFE.
It should be noted that when the results were averaged over 100 runs (Table 5), for some test instances, MBO-ABCFE established better values for the best and/or mean indicators than in simulations with 50 runs (Table 3). This meant that when tested with more runs, MBO-ABCFE performed even better.
We performed an analysis of the obtained results in simulations with 100 runs and noticed that in tests instances when the global optimum was reached, in many runs, the algorithm managed to converge to the optimal solution, and that in turn also improved the mean values. The implications of this were that MBO-ABCFE had a strong exploration in early iterations (ABC's diversification mechanism), when it converged to the optimal domain of the search space, while in later iterations, due to FA's exploitation ability, MBO-ABCFE performed a fine-tuned search in and around the promising region of the search space.
Finally, in order to visually represent the improvements of the proposed MBO-ABCFE over the original MBO, we tested MBO-ABCFE with 8000 and 16,000 FFEs for 20-dimensional and 40-dimensional benchmark instances, respectively, as was performed with the basic MBO in [41]. The convergence speed graphs for some test instances are shown in Figure 3. The convergence graphs were generated by using the same number of FFEs as in [41]. As a general conclusion for the unconstrained tests, it could be stated that the proposed MBO-ABCFE significantly improved the performance of the original MBO and also established better performance metrics than the improved GCMBO, as well as than the other state-of-the-art metaheuristic approaches, the results of which were retrieved from the most current computer science literature sources. Consequently, MBO-ABCFE is promising for real-world applications.

Convolutional Neural Network Design Experiment
In the second part of the simulation experiments, we optimized the CNN hyperparameter values by employing the original MBO algorithm and the proposed MBO-ABCFE metaheuristics. The MBO-ABCFE framework that was develop for tackling CNNs neuroevolution was named MBO-ABCFE-CNN.
The parameters that were subject to optimization were divided into three categories. The first category included the hyperparameters of the convolutional layer; the second category included the hyperparameters of the fully-connected (dense) layer; and the third category consisted of general CNN hyperparameters.
In Table 6, we summarize the hyperparameter values for each category. For more details regarding the hyperparameters' setup, please refer to Section 4.2. In order to reduce the computation time, the values of all hyperparameters were discretized and defined within lower and upper bounds, and their detailed formulation is described in Section 4.
The population size (SN) in the optimization algorithm was set to 50 solutions, which corresponded to the solution of the CNN structure (S), and it is defined as follows: where each CNN structure S consists of the three hyperparameter categories. Each CNN structure is encoded as the following set: where C l , FC l , and G h denote sets of convolutional, fully-connected (dense), and general hyperparameters, respectively. In the convolution layer category (C l ), each convolutional layer consisted of nested hyperparameters (number of filters, filter size, activation function, and pooling layer size).
Similarly, in the fully-connected hyperparameter category, each FC-layer further contained hyperparameters, such as connectivity pattern, number of hidden units, activation function, weight regularization, and dropout. The third category, the general hyperparameters, had only one level, with three hyperparameters: batch size, learning rule, and learning rate.
Each individual from the population (one CNN instance with specific hyperparameters) represented a data structure that contained all values (attributes) for each hyperparameter and for each CNN layer. In Figure 4, we show the proposed scheme for encoding the CNN structure.

Fully-connected layer hyperparameters
General hyperparameters  In the initialization phase of the algorithm, first the initial population of CNN structures was generated randomly, according to the following equation:

Convolutional layer hyperparameters
where the jth parameter of the ith solution in the population is denoted by S i,j , rand is a random number between zero and one, and min j , max j are the lower and upper bounds of the jth parameter. During the process of optimization, the algorithm searches for an optimal or near optimal solution, and after 100 iterations, the algorithm generates optimal and/or near optimal CNN structures. Similarly to [31], we implemented stop condition, and if there was no improvement, the algorithm stopped after 30 iterations. By using the stop condition, we avoided wasting expensive computational resources.
The reported CNN structure after each run was the solution with the best fitness value. The objective function that should be minimized was the classification error rate of the corresponding CNN architecture S. The fitness function was inversely proportional to the objective function, and it is defined as follows: where F(S) denotes the fitness function of CNN structure S and f (S) indicates the objective function of the corresponding structure.
We used the same values of the metaheuristic control parameters in the CNN hyperparameter optimization like in the unconstrained benchmark function experiment, and the summary of these parameters is presented in Table 1.
In order to speed up the computation, the structure was trained in five epochs with 50% of the data. After completion, the 20 best CNN architectures were fully trained in 30 epochs, and due to the stochastic behavior, the training process was repeated 20 times for each structure. The statistical results of the 20 best solutions (CNN architectures) of the MBO algorithm and MBO-ABCFE algorithm are presented in Tables 7 and 8, respectively. The boxplots of the error rate distribution of best 20 solutions generated by MBO and MBO-ABCFE algorithm are depicted in Figures 5 and 6, respectively. Error rate (%) Figure 5. Boxplot of the error rate distribution of the best 20 solutions generated by the MBO algorithm. The minimum classification error rate of the 20 best CNN structure generated by the MBO algorithm ranged between 0.36 % and 0.5%, with a median value of 0.44%; on the other hand, the maximum classification error rate ranged between 0.45% and 0.71%, with a median value of 0.575%. In the case of the proposed MBO-ABCFE algorithm, the minimum classification error rate of the 20 best CNN structures ranged between 0.34% and 0.47%, with a median value of 0.415%, and the maximum classification error rate ranged between 0.39% and 0.59%, with a median value of 0.545%.
The best architecture resulted in a 0.36% error rate on the test set. The optimized architecture consisted of two convolutional layers, the filter size in the first layer being 64 and in the second layer 128. The filter size was 5 × 5 in the first layer and in the second 3 × 3. The pooling layers' size was 2 × 2 after each convolution. In the convolution layers, as well as the FC layers, the ReLU activation function was used. The architecture had only one full-connected layer before the classification layer, and it had 1024 hidden units. As regularization, only a dropout with probability 0.5 of keeping was selected. The batch size was 100, and the structure was trained by the adamax optimizer with a learning rate α = 10 −3 . The best resulted architecture is depicted in Figure 7. The resulting structures, which were generated automatically by metaheuristics algorithms, may have the same or similar structure and design as the traditional CNN structures. The main difference was that when using evolutionary or some other metaheuristics, hyperparameters optimization was performed automatically by the algorithm, instead of manually by performing "trial and error", as is the case in traditional approaches. In both cases (traditional and metaheuristics), the main layers stacked on top of one another and formed the full structure; also, in both cases, additional building blocks could be incorporated. The utilization of a metaheuristic approach allowed the evolution of a better accuracy rate without manual modification of hyperparameter values, which is the case in the traditional approach.
As already noted a few times, the state-of-the-art method that was presented and tested under the same experimental conditions [31] was utilized as a direct comparison with our proposed MBO-ABCFE. In [31], the neuroevolution framework by using GA and grammatical evolution was proposed. The proposed framework managed to obtain the lowest classification error rate on the MNIST dataset of 0.37%. Compared to this framework, the original MBO established slightly better performance, resulting in a 0.36% error rate, while the proposed MBO-ABCFE managed to perform classification of the MNIST dataset with only a 0.34% error rate.
A comparative analysis between the proposed MBO-ABCFE metaheuristics, MBO, and the neuroevolution framework by using GA and grammatical evolution is given in Table 9. With the goals of establishing more objective validation of the proposed MBO-ABCFE-CNN framework and to provide a more informative and extensive review to the evolutionary computation and deep learning communities, in the presented comparative analysis, we also included the results of some traditional CNNs tested on the same dataset. All results were reported and retrieved from the relevant literature sources. Table 9. Comparative analysis of the classification error rate on the MNIST dataset.

Method
Error Rate (%) CNN LeNet -5 [1] 0.95 CNN (2 conv, 1 dense, ReLU) with DropConnect [22] 0.57 CNN (2 conv, 1 dense, ReLU) with dropout [22] 0.52 CNN (3 conv maxout, 1 dense) with dropout [101] 0.45 CNN with multi-loss regularization [102] 0.42 Verbancsics et al. [103] 7.9 EXACT [104] 1.68 DEvol [105] 0.60 Baldominos et al. [31] 0.37 MBO-CNN 0.36 MBO-ABCFE-CNN 0.34 The visual representation of the comparative analysis between the proposed MBO-ABCFE-CNN framework and some methods, the results of which are presented in Table 9, is given in Figure 8. It can be concluded that the proposed MBO-ABCFE approach is very promising in the application of CNN design. The metaheuristic approach led to very good results, and it did not require expertise in the domain of the convolutional neural network for fine-tuning the hyperparameters. Since the design was done automatically, it would save the time and effort of the researchers who are conducting the experiments.

Conclusions
Convolutional neural networks, as a fast growing field, have a wide range of applications in different areas and represent important machine learning methods. One of their major limitations is that a structure for a given problem requires the fine-tuning of the hyperparameter values in order to achieve better accuracy. This process is very time consuming and requires a lot of effort and researchers expertise in this domain.
In the literature survey, research that addressed this problem could be found. The general idea was to develop an automatic framework for generating CNN structures (design) that would perform specific classification tasks with high accuracy. Recently, the term "neuroevolution" for generating such networks was proposed. Since the CNN hyperparameters' optimization belongs to the group of NP-hard challenges, all research has proposed heuristic and/or metaheuristics methods for its solving.
The research proposed in this paper is an extension of our previously conducted simulations and experiments from this domain. However, in the research that was presented in this paper, with the objective to generate network structures that would perform a given classification task with higher accuracy than other networks, we included more CNN hyperparameters for the optimization process than previously presented works from this domain: the number of convolutional layers along with the number of kernels, the kernel size and activation function of each convolutional layer, the pooling size, the number of dense (fully-connected) layers with the number of neurons, the connectivity pattern, the activation function, weight regularization, the dropout for each dense layer, as well as the batch size, learning rate, and rule as general CNN hyperparameters.
We tried to generate state-of-the-art CNN structures by developing an automatic framework using hybridized MBO swarm intelligence metaheuristics, which was also proposed in this paper. By conducting practical simulations with the original MBO, we noticed some deficiencies that were particularly emphasized in its exploration process and in the established balance between intensification and diversification. Moreover, we concluded that the basic MBO's exploitation process could be further improved. To address these issues, we hybridized the original MBO with ABC and FA swarm algorithms. We first adopted the exploration mechanism and one parameter that adjusted intensification from the ABC metaheuristics. Second, we incorporated a very efficient FA search equation into our approach.
In compliance with the established practice in the scientific literature, the proposed hybridized MBO-ABCFE was firstly tested on a standard group of unconstrained benchmarks, and later, it was applied to the practical CNNs' neuroevolution problem. For the tests on unconstrained function instances, we performed comparative analysis with the original MBO, as well as with one other improved state-of-the-art MBO algorithm. Our proposed MBO-ABCFE proved to be a significantly better approach in terms of convergence (mean values) and also in the results' quality (best individuals).
In the second group of practical simulations, by using the proposed MBO-ABCFE, we developed the framework for CNN hyperparameters' optimization and performed tests on the standard MNIST dataset. Since the implementation of the original MBO for this challenge was not found in the literature survey, to establish a more precise comparative analysis, we also implemented original MBO for this problem. The proposed hybrid metaheuristics was compared with several other state-of-the-art methods that were tested on the same dataset and under the same experimental conditions, and we obtained better classification accuracy. Moreover, the original MBO managed to establish high classification accuracy.
The scientific contributions of the presented research can be summarized as follows: • An automated framework for "neuroevolution" based on the hybridized MBO-ABCFE algorithm, which managed to design and generate CNN architectures with high performance (accuracy) for image classification tasks, was developed; • In the CNN hyperparameters' optimization problem, we included more CNN hyperparameters than most previous works from this domain; • We managed to enhance the original MBO approach significantly by performing hybridization with other state-of-the-art swarm algorithms; and • The original MBO was implemented for the first time for tackling CNNs' optimization challenge.
The detailed description of the experimental conditions along with the control parameters' values were presented in this paper.
Since the domain of the proposed research represents a very promising area, in our future work, we plan to continue research, experiments, and simulations with CNNs' design challenge. We will also try to improve other swarm intelligence algorithms and adjust them to tackle this problem. Moreover, we plan to invest significant effort into developing frameworks that will include even more CNN hyperparameters in the optimization process and perform tests on other datasets as well, like CIFAR-10, SEED, and Semeion Handwritten Digits.