Optimizing Convolutional Neural Network Hyperparameters by Enhanced Swarm Intelligence Metaheuristics

: Computer vision is one of the most frontier technologies in computer science. It is used to build artiﬁcial systems to extract valuable information from images and has a broad range of applications in various areas such as agriculture, business, and healthcare. Convolutional neural networks represent the key algorithms in computer vision, and in recent years, they have attained notable advances in many real-world problems. The accuracy of the network for a particular task profoundly relies on the hyperparameters’ conﬁguration. Obtaining the right set of hyperparameters is a time-consuming process and requires expertise. To approach this concern, we propose an automatic method for hyperparameters’ optimization and structure design by implementing enhanced metaheuristic algorithms. The aim of this paper is twofold. First, we propose enhanced versions of the tree growth and ﬁreﬂy algorithms that improve the original implementations. Second, we adopt the proposed enhanced algorithms for hyperparameters’ optimization. First, the modiﬁed metaheuristics are evaluated on standard unconstrained benchmark functions and compared to the original algorithms. Afterward, the improved algorithms are employed for the network design. The experiments are carried out on the famous image classiﬁcation benchmark dataset, the MNIST dataset, and comparative analysis with other outstanding approaches that were tested on the same problem is conducted. The experimental results show that both proposed improved methods establish higher performance than the other existing techniques in terms of classiﬁcation accuracy and the use of computational resources.


Introduction
Convolutional neural networks (ConvNets or CNN) are biologically-inspired structures and represent a special type of neural network. CNN is most commonly used to process 2D signals for digital image processing, but it can also be used to process 1D signals. ConvNets are very successful in different visual tasks, such as classifying images, detecting objects in pictures [1], segmenting images [2], generating image descriptions [3], recognizing faces [4], pose estimation [5], and others. LeCun et al. proposed the first CNN in 1998 [6]. It was successfully applied to digit recognition. AlexNet [7] was the first architecture that achieved significant results, popularized CNN, and brought a revolution in computer vision. AlexNet has a very similar architecture to LeNet, but it is much deeper. Some of the other well-known architectures are VGG [8], and more modern networks are GoogleNet [9], ResNet [10], DenseNet [11], and SENet [12].

Research objectives and paper's structure
The network's performance highly depends on the hyperparameters' configuration, and the most critical issue is that there is no general method for determining the optimal set of CNN's hyperparameters and for designing the networks' structure. Moreover, another challenge is that there is no general architecture that can be applied for every problem ; each particular problem instance requires an individually designed CNN architecture. Considering that, the technique for finding the right set of hyperparameters and defining the CNNs architecture is commonly known as "guesstimating" (estimating by using guess).
The process of guessing the right CNN architecture (value of hyperparameters) is both computational resource intensive and time consuming [13] and requires a trial and error approach. For this reason, an approach that is able to choose the right CNN architecture automatically for a particular problem should be devised. Based on the facts that the swarm intelligence algorithms proved to be robust and efficient methods in solving NP hard tasks like CNN hyperparameter optimization and that those methods have already successfully been applied to this problem, the basic motivation behind the research is to develop improved swarm intelligence approaches that will generate CNNs architectures that are able to outperform other architectures in terms of classification accuracy.
According to the statement above, the research question behind the presented research can be formulated as follows: Is it possible to further enhance the classification accuracy of generated CNN architectures by using swarm intelligence algorithms?
The primary objective, as well as the scope of the research that is presented in this paper are to develop automatically a method by using swarm intelligence algorithms for designing CNN architectures that are able to obtain greater performance than existing methods from the literature. The secondary objective is to improve existing versions of two swarm intelligence algorithms (the tree growth algorithm and the firefly algorithm) by enhancing the exploitation and/or exploration processes of the original approaches.
When a new method is devised, it is good practice to test it on a standard set of benchmark functions. Only in this way, the real performance metrics of efficiency and robustness of a particular method can be established. For this reason, we first tested the devised enhanced algorithms on a wider set of standard unconstrained benchmarks, and afterwards, we applied them to a practical NP hard CNN hyperparameter optimization challenge for the image classification task.
In the practical simulations, the hyperparameters that were taken into account during the optimization process were the number of convolution layers along with the number of kernels per layer and the kernel size and number of fully-connected (dense) layers along with the size of each fully-connected layer. We did not include the type of pooling layer, activation function, and regularization technique in the automatic search because of the limitations of available computing power.
To establish the performance of the proposed methods, we utilized simulation as the research methodology in a standard environment and with a standardized dataset.
The experiments were conducted on the MNIST benchmark dataset [70], which was comprised of handwritten digits from zero to nine. Each image was in grayscale, and it consisted of 28 × 28 pixels. The dataset consisted of 70,000 labeled samples. In the research, 50,000 samples were used for training, 10,000 samples for validation, and 10,000 samples for testing purposes. The grayscale images that had pixel values between zero and 255 were normalized in the range from zero to one.
The evaluation study results were compared with similar techniques that utilized metaheuristic algorithms and automatically designed the CNN architectures and were tested under the same experimental conditions [17]. We also performed comparative analysis with swarm approaches, tested under different conditions, that focused on dropout probability estimation [18] in order to further validate our approaches. Furthermore, the results were also compared to state-of-the-art architectures that were manually designed. In order to present unbiased performance comparison, we utilized the initial parameter settings in a framework similar to [17].
The scientific implications of the presented research are twofold: first, we managed to devise a framework that performed classification with greater accuracy than other state-of-the-art methods for this particular dataset, and secondly, we were able to improve basic versions of the tree growth and firefly algorithms.
We note that we have been researching this domain for some time and that the research presented in this paper is the extension of our previous research [71,72].
The remainder of this paper is as follows: Section 2 gives a concise explanation of the problem statement. Section 3 provides a description of the original swarm algorithms and the proposed enhanced versions. In Section 4, we present the details of the improved swarm implementations for the CNNs' design and the parameter setup. In Section 5, we show the results of the proposed method and the performance comparison study with other state-of-the-art algorithms, while Section 6 concludes the paper.

Problem formulation and related literature review
In this section, we first give a literature review of the metaheuristics applications for CNN optimization. Afterwards, we show a mathematical formulation of the model that was optimized in the experimental section by swarm intelligence algorithms.

Metaheuristics applications for CNN optimization
Based on the literature survey, it can be seen that both types of metaheuristics, evolutionary algorithms and swarm intelligence, have many applications and implementations in the domain of CNNs and computer vision. These methods were validated against various benchmark and practical datasets. For example, in [73], an approach that combined GA and CNNs was proposed for non-invasive classification of glioma using magnetic resonance imaging (MRI). The structure of the deep network was evolved by using GA, and in this way, a typical selection of neural network for a particular problem based on trial and error was avoided. The proposed method was validated against two case studies. In the first case study, 90.9% classification accuracy for three glioma grades was obtained. In the second study, glioma, meningioma, and pituitary tumor types were classified with 94.2% accuracy. A framework for automating the CNN design by using EAs implemented in the Python programming environment was shown in [74]. The framework utilized two EAs: GA and grammatical evolution. It was proven that the proposed framework managed to obtain better results than many state-of-the-art CNN architectures.
Another interesting method was presented in [20], where the authors proposed the gradient-priority PSO (GPSO) with gradient penalties for selecting the CNN structure. The proposed method was tested by using three types of emotion states for each subject and simultaneously collecting EEG signals corresponding to each emotion category. Experimental results proved that the GPSO managed to obtain competitive classification performance over the dataset. An orthogonal learning particle swarm optimization (OLPSO) algorithm that optimized hyperparameters' values for VGG16 and VGG19 CNNs that have been developed for the task of plant disease diagnosis by classifying the leaf images as healthy and unhealthy was presented in [14]. By performing practical simulations, the authors proved that their approach managed to obtain higher performance than other methods that were tested for the same datasets. Another state-of-the-art method was presented in [75]. The developed method fell into the group of hybrid stochastic training algorithms. The authors employed the relatively novel grasshopper optimization algorithm (GOA) for multilayer perceptrons (MLPs) neural networks (GOAMLP). The GOAMLP model was applied to five practical datasets: Parkinson's, diabetes, breast cancer, coronary heart disease, and orthopedic patients.
In [13], a novel variant of the PSO algorithm, named cPSO-CNN, was proposed for determining the CNN architecture. The cPSO-CNN method enhanced the canonical PSO's exploration ability and redefined PSO's scalar acceleration coefficients as vectors to better adapt to the variant ranges of CNN hyperparameters. It was proven that the cPSO-CNN established competitive performance in terms of accuracy and overall computation costs when compared to seven other state-of-the-art methods. One more PSO-based approach used for defining CNN architecture that is worth mentioning was presented in [19]. By using their PSO adaptation, the authors managed to obtain better parameter settings than those found by the AlexNet configuration for five different image datasets.

Mathematical model
In this part of this section, we establish the context of this research. This research aims to develop an efficient method for automatically optimizing the hyperparameters and designing a CNN structure for a specific image classification task.
More details regarding the CNNs layers, evolution, and functional principles can be retrieved from [20,76].
The performance of the CNN architecture highly depends on the hyperparameters' values. Different real-world visual tasks require different network structures; thus, each network has to be configured independently, and the appropriate set of hyperparameters varies from task to task. Finding the right set of hyperparameters for a given problem is a very time-consuming process, and it requires knowledge from the domains of deep learning and CNNs. To address this issue, we propose an enhanced swarm intelligent-based method to search automatically for the right set of hyperparameters to build a structure that will produce high classification accuracy and that can be used by users who do not have prior knowledge from the area and can apply it to a specific problem of interest.
Since the optimization of all CNN hyperparameters would be a very time and computational resource consuming process, in the proposed research, we optimized the following CNN hyperparameters: the number of convolution layers, the number of filters, the filter size, the number of FC layers, and the hidden units in the FC layer.
In this work, the optimal CNN architecture was encoded as a set of hyperparameters, which is defined as: where h conv denotes the set of convolutional layer parameters and h f c is the set of FC layer parameters. The set of convolutional layer parameters is defined as: where the number of convolutional layers is denoted by n. Each convolution layer in the ith layer contains a two-tuple: where the pairs k count and k size are the number of filters and the size of the filters of the ith layer, respectively. The set of the fully-connected layers is formulated as: where s represents the hidden unit number in the FC layer of the ith layer and m is the number of FC layers. By H, we denote the possible set of CNN architecture configurations. The aim is to find a configuration h ∈ H that minimizes the error rate of a given task. Thus, in our research, we took the classification error rate (err) as an objective function.
This problem belongs to the class of NP hard challenges, and therefore, exact (deterministic) optimization algorithms are not applicable. However, swarm intelligence algorithms showed very promising results in solving NP hard optimization problems, and that was the basic reason why we implemented and adapted swarm intelligence algorithms for CNN hyperparameters' optimization.

Proposed swarm intelligence approaches
In this section, we first describe the original tree growth algorithm [39] followed by our proposed enhanced version. Afterwards, the original procedure of the second metaheuristic algorithm, the firefly algorithm [31], is described along with our proposed hybrid firefly algorithm.

Original tree growth algorithm overview
The tree growth algorithm (TGA) is a novel nature-inspired metaheuristic optimization algorithm, proposed by Cheraghalipour et al. in 2018. The algorithm was tested on 30 standard benchmark functions, three engineering benchmarks, and two industrial engineering optimization problems; the obtained results showed the robust performance of the algorithm.
TGA was inspired by the competition of trees in a forest. Trees are competing for absorbing sunlight and food. Two main phases of the algorithm are exploration and exploitation, respectively. In the exploration phase, the trees move towards the sunlight, which allows exploring new places. In the exploitation phase, the trees are already satisfied with the light; hence, they move toward better nutrients in the roots, analogous to moving toward the local or global optimum. The population of trees is divided into four groups in the forest: 1. Group of the best trees; 2. Group of trees that are competing for light; 3. Remove or replace group; 4. Reproduction group.
The first group of trees is satisfied with the light source, and they compete for a food source. In the second group, trees move between the two closest best solutions to compete over light. In the third group, the worst trees are removed and replaced by new trees. In the last group, new plants are created by the best trees.
In the first step, the algorithm randomly generates the initial population of trees (solutions) within the lower and upper bounds; and the fitness value is evaluated for each solution. The initial population is generated according to the following formula: where x i,j represents the jth parameter of the ith solution in the population, rand is a random number drawn from a uniform distribution, and min j is the lower bound and max j the upper bound of the jth parameter.
In the next step, the population is sorted according to the value of the fitness, and the current best solution at the jth iteration is found. The global best solution is denoted by T j GB . The best solutions are assigned to the first group (N 1 ), and the solutions form this population conduct the local search by using the following formula: where T j+1 i is the new ith solution, T j i represents the ith solution in the jth iteration, the reduction rate of power is denoted by θ, and r is a random number between zero and one.
The greedy selection is applied to the newly generated solutions; if the new solution has a better fitness value than the current solution, the new solution replaces the current one; otherwise, we keep the current solution for the next generation.
The second-best solutions are assigned to N 2 subpopulation. Each solution from the N 2 group should be moved between the two closest solutions (from the first and second subpopulation) under different α angles. The distance between the solutions is calculated by the Euclidean distance as follows: where d i represents the distance of the ith solution, the current tree is denoted by T j N 2 , and the ith solution in the population is denoted by T j i . After calculating the distance, the current solution moves toward the two nearest solutions with the minimal distance. The linear combination of the two closest solutions is calculated as follows: where x 1 and x 2 are the first and second nearest solutions and λ is the control parameter, the values of which are between zero and one.
To move the solution between the nearest solutions under different α angles, the following formula is used: where the current solution is denoted by T j N 2 ; the angle of the ith solution is denoted by α i , and its value is between zero and one; y is the linear combination of the two adjacent solutions.
The third subpopulation N 3 contains the worst solutions from the population. These solutions are replaced by newly generated random solutions. N 3 is calculated as follows: (11) where the population size is denoted by N; N 1 and N 2 are the first and second subpopulations, respectively.
In the next step, the new population (N) is created by summing the first three groups (N 1 , N 2 , and N 3 ).
The final group N 4 consists of randomly generated new solutions. Each solution in N 4 is modified by using the mask operator with respect to the best solution from the first group (N 1 ) and the modified solutions merged with the population. The new population is sorted according to the fitness value, and the best N solutions are selected for the next iteration. The process is repeated until the termination criteria are met. In the final step, the best optimal solution is selected.

Enhanced tree growth algorithm
In every metaheuristics algorithm, it is essential to maintain the trade-off between the exploration and exploitation processes. The identified deficiencies of the original TGA metaheuristics refer to the insufficient exploitation power and to the inappropriate balance between exploitation and exploration. More about TGA's deficiencies can be found in [77,78].
In this work, we propose a modified TGA algorithm that enhances the exploitation capability of the original TGA with an additional operation in the algorithm's procedure. The proposed novel enhanced algorithm is named exploitation enhanced TGA (EE-TGA). The modification speeds up the convergence rate of the algorithm, and at the same time, the search process reaches the global optima faster.
In the best solution group (N 1 ), the first 40% of solutions in the subpopulation are scattered around the current best solution by using a normal distribution. The solution's position is updated according to the following formula: where T j+1 i is the new solution; T j GB denotes the current best solution; N(·) is the normal distribution, where the mean is the current best solution and the variance at iteration t is σ 2 . The variance determines the step size, and it is decreasing over the course of the iteration.
The distribution around the current best solution is calculated as follows: Due to the implementation of the normal distribution, the new solutions are closer to the best solution, which yields increasing the possibility to find the global optimum faster.
As mentioned above, the step size (σ) is decreasing as the algorithm approaches the maximum iteration and gets closer to zero, σ t → 0. The step size updates at every iteration t according to the following equation: where t is the current iteration, σ 0 indicates the initial step size, and t max is the maximum iteration. The pseudocode of the proposed EE-TGA optimization algorithm is shown in Algorithm 1.

Initialization:
Randomly generate the initial population of N trees x i , i = 1, 2, . . . , n Define the initial step size σ Set the number of maximum iterations (t max ) Set the current iteration (t) to 1 Calculate the fitness of each solution Sort the population according to the fitness value while t max do Perform local search by using Equation (13) If the new solution has a better fitness value than the current solution, replace the current solution with the new one Update the step size by utilizing Equation (15) end for Perform local search by using Equation (6) If the new solution has a better fitness value than the current solution, replace the current solution with the new one end for Calculate the distance between the solutions by using Equation (7) and Equation (8) Move solutions between the two nearest solutions by using Equation 9 and Equation (10) If the new solution has a better fitness value than the current solution, replace the current solution with the new one end for Remove N 3 worst solutions, and replace with new random solutions Create new population N by using Equation (12) Generate N 4 subpopulation and apply the mask operator Evaluate the fitness of each solution Sort the population according to the fitness value Select N best solutions for the next iteration end while Return the best solution In the presented pseudocode, t max represents the maximum number of iterations in one algorithm's run.

Original firefly algorithm overview
The firefly algorithm (FA) was proposed by Xin-She Yang in 2008. In this algorithm, the fireflies represent the solutions. The fireflies are attracted by other fireflies based on their brightness level, and each solution moves toward brighter solutions. The algorithm procedure follows three simple rules: 1. Each firefly is unisex and can be attracted by any other firefly; 2. The brightness determines the fireflies' attractiveness; as the distance become shorter, the attractiveness of fireflies increases, and they become brighter; 3. The fireflies' brightness determines the fitness function.
The solutions are generated randomly within the lower and upper bounds: where lb j and ub j are the lower and upper bound, respectively. The solution is denoted byx i,j ; and rand is a random number between zero and one. How close two solutions are to each other is calculated by the Euclidean distance, as follows: where r i,j indicates the distance between the ith and jth solutions and d denotes the dimension of the problem. The light intensity is calculated according to the following formula: where the source's light intensity is denoted by I 0 , the light intensity is denoted by I(r), where r indicates the distance, and γ represents the light absorption coefficient. The attractiveness level between two solution is evaluated by the following formula: where β indicates the attractiveness and the attractiveness at distance zero is denoted by β 0 . The solutions reach a new location in the search space according to the movement formula, which is defined as follows: where x t+1 i denotes the new position of the solution; x t i denotes the current position, which moves toward the brighter solution x j ; α is a control parameter, and its value is drawn from the Gaussian distribution; and rand represents a random number between zero and one.

Enhanced firefly algorithm
In the original FA, the exploitation and exploration balance mostly depends on the value of parameter α, which is in most studies dynamically adjusted during one run of the algorithm according to the following equation [49,[79][80][81]: where t max denotes the maximum number of iterations in one run. It means that at the beginning of a run, the value of α is greater (higher exploration power), and as the search progresses during iterations, the value of α is decreased (and thus, the exploration power decreases in favor of exploitation).
Despite the fact that the original FA exhibits relatively stable and efficient exploitation, according to the results of the performed simulations on standard unconstrained benchmarks, we concluded that the exploitation process of the original FA could be further enhanced, as well as the intensification-diversification trade-off by incorporating an additional procedure in the original FA algorithm.
Moreover, the original FA approach in some runs shows the behavior of premature convergence because the whole population converges towards the current best solutions. For this reason, we enhanced the exploration of the original FA by incorporating a procedure that introduced randomness in the population.
By incorporating two procedures: one that enhances the exploitation and one that improves exploration, we created the exploitation and exploration enhanced FA (E 3 -FA).
The first procedure, the one that enhances the exploitation power of the original FA, introduces a new randomization parameter in the algorithm p, which determines whether the position of a solution is being updated by Equation (20) or will be distributed around the current best solution by utilizing the normal distribution: where x t+1 i is the new solution; x t * ,j j denotes the current best solution; N(·) is the normal distribution, where the mean is the current best solution and the variance at iteration t is σ 2 . The variance determines the step size, and it is decreasing over the course of iterations.
The distribution around the current best solution is calculated as follows: By incorporating Equation (23), the new solutions are closer to the best solution, which yields increasing the possibility to find the global optimum faster.
As mentioned above, the step size (σ) is decreasing as the algorithm approaches the maximum iteration and gets closer to zero, σ t → 0. The step size updates at every iteration t according to the following equation: where t is the current iteration, σ 0 indicates the initial step size, and t max is the maximum iteration.
The second procedure (one that enhances exploration) operates in a way that at the end of each iteration, the solutions are sorted according to their fitness value, and the last 20% of worst solutions are replaced by a new random solution utilizing the initialization formula (Equation (16)). In this way, if an algorithm converges to the wrong part of the search space, new, random solutions are generated, and the trap of being stuck in suboptimal regions is avoided.
The pseudocode of the proposed E 3 -FA is given in Algorithm 2.

Algorithm 2 Pseudocode of the exploitation and exploration enhanced FA (E 3 -FA).
Objective function f (x) Randomly generate the initial population of N fireflies x i , i = 1, 2, . . . , n Calculate the light intensity by I i = f (x i ) Set light absorption coefficient γ Set the iteration counter t to zero, and define the number of maximum iterations t max Initialize the step size σ at t = 0 while t < t max do Move solution x i toward x j by using Equation (20)  In the presented pseudocode, t max represents the maximum number of iterations in one algorithm's run.

CNN optimization simulation environment, the algorithms' adaptations, and parameter setup
This section represents the first empirical section of this paper, and it is devoted to the simulation environment setup for CNN architecture simulations. In the first part of this section, we show the EE-TGA and E 3 -FA algorithms' adaptations and the solution encoding scheme for CNN hyperparameter optimization. The second part of this section gives detailed information regarding the environment, as well as the algorithms' control parameter values that were utilized in practical CNN experiments.
The same experimental conditions and setup were used in our previous research [71,72], as well as in [17].

Solutions' encoding and algorithms' adaptations
As was already mentioned in Section 1, CNN hyperparameters that were subject to optimization in the proposed model included the number of convolution layers along with the number of kernels per layer and the kernel size and number of fully-connected (dense) layers along with the size of each fully-connected layer. We did not include the type of pooling layer, activation function, regularization technique, and other CNN hyperparameters in the automatic search because of the limitations of available computing power.
A detailed description of the utilized mathematical model is given in Section 2.
The possible hyperparameter values that were subject to the optimization in this study were defined within the lower and upper bounds (these bounds are given in Subsection 4.2), with the aim to exclude infeasible solutions of CNN architectures and reduce the utilization of computer resources.
The population at iteration t consisted of the set of N CNN configuration solutions, and the set is defined as follows: where each configuration h consists of h conv and h f c . According to the mathematical formulation of the utilized model, given in Section 2, each solution in the population (network configuration h) consisted of variable length parameters. Since we did not include continuous parameters like the learning rate, all parameters took only discrete values.
The number of solution parameters was different for each solution in the population and depended on the number of convolutional layers (where each convolutional layer was further determined by two parameters: the number of kernels and the kernel size) and the number of FC network layers. For example, if a generated solution (network configuration) consisted of three convolutional layers and four FC layers, then such a solution would be encoding with 10 discrete parameters (3 · 2 + 4).
A visual representation of the solutions' encoding scheme is given in Figure 1. To decrease the number of possible network configurations and to reduce the search domain, the convolutional layers were sorted in descending order, first concerning the filter size (k size ) and then based on the number of filters (k count ). The value of s, which denotes the FC layer size, was utilized for sorting dense layers.
In the beginning, the initial population of solutions (convolutional neural network structures) was randomly generated within the defined lower and upper bounds by using the following equation: where h i,j represents the jth parameter of the ith solution in the population, rand is a random number drawn from a uniform distribution, and min j is the lower bound and max j the upper bound of the jth parameter. Due to the relatively large range of parameter values, we did not adapt the proposed approaches for discrete optimization, and we just rounded the results to the nearest integer. After generating the initial population, we evaluated the fitness value of each solution, and the current best solution in the population was determined. The fitness function that was utilized was simply inversely proportional to the classification error rate (err) of the respective CNN architecture h. For calculating the fitness of solution i, the following expression is used: where the fitness function of CNN structure h i is denoted by F(h i ) and err i represents the objective function of the concerned structure. The fitness calculation process utilized much computational resource because each fitness evaluation required a CNN to be developed, trained, and validated against the dataset. For this reason, within the evaluation procedure of both devised approaches, EE-TGA and E 3 -FA, we used classes from the deep learning programming library Deeplearning4j in order to generate and validate a network of a certain structure.
The high-level procedure of the CNN framework for both proposed algorithms is given in Algorithm 3. In the presented pseudocode, t max represents the maximum number of iterations in one algorithm's run.
The CNN optimization frameworks based on the EE-TGA and E 3 FA are named EE-TGA-CNN and E 3 FA-CNN, respectively.

Parameter setup
In the following few paragraphs, we briefly describe the EE-TGA and E 3 -FA control parameter values that were utilized in the simulation with CNN hyperparameter optimization. The same environment was utilized in [71,72].
In EE-TGA, the population size N was set to eight CNN structures, and subpopulations N 1 and N 2 were set to three. The algorithm was iterated in 20 iterations (t max = 20). In each iteration, two worst solutions were removed from the population, and instead of them, two new random solutions were generated (N3 = N − (N1 + N2)). In the fourth group (N 4 ), in each iteration, one new solution was generated and modified with the max operator. The values of the control parameters λ and θ were set to 0.5 and 0.2, respectively. The initial step size σ 0 was set to three.
In the second enhanced metaheuristic, E 3 -FA, the initial population size, maximum iteration, and the initial step size had the same values as in EE-TGA. The α randomization parameter was set to 0.5, the light absorption coefficient γ to 1.0, and the attractiveness at a distance of zero to 0.2. The metaheuristic control parameter setup of both enhanced algorithms is summarized in Table 1. The control parameters for the original TGA and FA were the same as for the enhanced versions, except the step size, which was not used in the basic TGA and FA implementations.  The search space of the number of convolutional layers was set within the range between zero and six; the lower and upper bounds of the number of fully-connected layers were set within the interval [1,4]; and the lower and upper boundaries of the filter size were adjusted between one and eight, respectively. The number of filters was set between one and 128, and the hidden unit number in the FC layer was bounded between values of 16 and 2048.
We restricted the possible set of CNNs hyperparameters' values to reduce the search space and to exclude the most infeasible CNN architectures. Lower and upper bounds for each hyperparameter were determined empirically and in this way, the proposed framework was much more efficient.
Moreover, we wanted to start the search process by generating less complex and, thus, less expensive CNN configurations. For this reason, in the initialization phase, we set an even narrower range of possible values for the number of convolutional layers (h conv ) and the number of FC layers (h f c ). Please refer to Table 2.  [16,2048] With the objectives of reducing the amount of parameters (weights), the computational burden, and to control overfitting, for each convolutional layer, we used a pooling (downsampling) layer of a fixed size of 2 × 2 along with the max pooling option. Further, we utilized the ReLU (rectified linear unit) activation function and employed the Adam optimizer with a fixed learning rate of θ = 0.0001 and the mean squared error (MSE) loss function. We used valid convolution and did not apply padding on the input volume. None of the regularization techniques were employed.
To make a better and unbiased comparative analysis, the initial CNN hyperparameter configuration was similar to the configuration in [17]. The network was trained in one epoch with a batch size of 54. The hyperparameters that were not subject to the optimization were set to fixed values, and they are presented in Table 2.
The TGA, FA, EE-TGA, and E 3 -FA metaheuristics were developed in Java SE 10 (18.3). The framework was developed and implemented by using deep learning programming library Deeplearning4j (https://deeplearning4j.org/), which was written for Java and Java Virtual Machine (JVM). To improve the execution speed of the performed study, due to the large utilization of computer resources, we adapted the framework for execution on a graphical processing unit (GPU) using NVIDIA's ™CUDA parallel computing platform and programming model. All experiments were performed on the computer platform with six NVIDIA GTX 1080 GPUs, each having 2560 ™CUDA cores.

Experimental results and discussion
The experimental section is divided into two parts. In the first part, we show the simulation results of our proposed metaheuristics for unconstrained benchmark functions along with comparative analysis with the basic versions, the control parameters' setup, and the mathematical formulations of the test functions.
When a modified method is created, its effectiveness should first be validated against a wider set of benchmarks, and the performance should be compared with the original method. This is the only way to establish a real improvement of the enhanced method over the original one.
In the second part, we describe the benchmark dataset that was used for the CNN hyperparameters' optimization, and we present the obtained experimental results along with a comparative analysis with other, similar methods, the results of which are retrieved from the modern computer science literature.

Performance evaluation of EE-TGA and E 3 -FA on standard unconstrained benchmarks
For the sake of easier following, this subsection is also divided into two parts. The first part presents the enhanced tree growth algorithm (EE-TGA) performance evaluation, and the second part shows the evaluation of the enhanced firefly algorithm (E 3 -FA) for standard unconstrained benchmarks along with respective comparative analysis with original versions.

Performance evaluation of EE-TGA
For the purpose of the EE-TGA evaluation, we tested the algorithm on ten standard benchmark functions. The details of the functions and their characteristics are described in Table 3 and Table 4, respectively. Table 3. Benchmark function details (EE-TGA simulations).

ID Function Name Function Definition
The parameters of the EE-TGA were set up according to the best results of the parameter tuning in the paper [39] to make an unbiased comparison. The maximum iteration was set to 250 for the problems with a large dimension. In contrast, for the problems with a small dimension, the maximum iteration was set to 50, and we obtained the test results from 30 independent trials of the algorithms. The population size N was set to 100, and the subpopulations N 1 , N 2 , and N 4 are set to 20, 20, and 30, respectively. The control parameter λ was set to 0.1, and θ had a value of 0.8. Specific control parameters of the EE-TGA were set as shown in Table 1.
The comparison between the original TGA and the EE-TGA is given in Table 5, where the results of the TGA algorithm were taken from the original TGA paper [39]. We took the best, mean, and standard deviation as performance metrics. In the presented table, the bold values indicated better results. We note that we have also performed tests with the basic TGA and that we got the same results as in [39].   From the results presented in Table 5, it can be clearly seen that EE-TGA obtained better solution quality in most tests by improving the exploitation of the original TGA.
Since the most significant performance difference between EE-TGA and the original TGA could be noticed in the F9benchmark, in Figure 2, we show the convergence speed graph for this problem of the best generated solutions. In the case of this test, the mean values generated by the EE-TGA were much better than the value obtained in the best run by the original TGA.

Performance evaluation of E 3 -FA
For the purpose of the E 3 -FA evaluation, we tested the algorithm on eight standard benchmark functions. The details of the functions and their characteristics are described in Table 6 and Table 7, respectively. Table 6. Benchmark function details (E 3 -FA simulations).

ID Function Name Function Definition
F1 Michalewicz Function F5 Rastrigin Function The parameter configuration of E 3 -FA was conducted based on the suggested control parameter values in [31]. The simulation of E 3 -FA was conducted in 100 independent runs, and the maximum number of iterations was set to 100. The population size was set to 40, α to 0.2, the light absorption coefficient (γ) to 1, and β 0 = 1. The E 3 -FA specific parameters were set as shown in Table 1.
Sample plots of the benchmark functions are depicted in Figure 3. The comparison of the FA and E 3 -FA is presented in Table 8, where the results of FA were taken from [82]. However, after conducting the simulations of the benchmark functions with FA, we got the same results as in [82]. As in the experiment with EE-TGA, we took the best, mean, and standard deviation as the performance metrics. The bold values in Table 8 refer to the best values. Similar to the case of EE-TGA simulations, based on the results presented in Table 8, the E 3 -FA approach managed to establish better exploitation and exploration than the original FA and, in most benchmark functions, obtained better results.
The comparative analysis of convergence speed between the FA and E 3 FA for F5 benchmark is given in Figure 4. From the presented figure, it can be seen that E 3 FA managed to obtain the optimum solution somewhere between the 80th and 85th iteration.

EE-TGA and E 3 FA benchmark simulations' overall conclusion
Based on the experimental output of the unconstrained functions' simulations, it could be concluded that in the cases of both enhanced implementations, the algorithms outperformed the original metaheuristics. In [31,39], TGA and FA were compared to other metaheuristic algorithms, and comparing those algorithms to our proposed approaches EE-TGA and E 3 -FA, it could be concluded that our algorithms outperformed most of the metaheuristics used for comparison purposes in [31,39].
In the case of EE-TGA, the global optimum was reached at seven functions. On the other hand, the TGA algorithm reached the global optimum only at five functions out of ten. The second enhanced algorithm, E 3 -FA, reached the global optimum on four unconstrained functions, while the original FA managed to do the same for only three functions. Hence, EE-TGA had a 63.6% success rate at reaching the global optimum versus 45.5% fir TGA. E 3 -FA reached the global optimum in 50% and the original FA in 38% test cases.

Convolutional neural network hyperparameter optimization experiments
In the second part of the experimental section, first we describe the benchmark dataset used for the purpose of CNN hyperparameter optimization and architecture design, then the simulation results are presented along with the comparative analysis with similar methods.

Benchmark dataset
The TGA, FA, and the proposed EE-TGA and E 3 -FA methods were tested and validated on the MNIST benchmark dataset [70]. The MNIST database is a large database of handwritten digits from zero to nine; each image is in grayscale; and it consists of 28 × 28 pixels. The dataset consists of 70,000 labeled samples; in the experiment, 50,000 examples were used for training, 10,000 examples for validation, and 10,000 examples for testing purposes. The grayscale images that had pixel values between zero and 255 were normalized in the range from zero to one. Samples of the digit images are shown in Figure 5.

Convolutional neural network hyperparameter optimization results and analysis
After testing the algorithms on the unconstrained benchmark functions, all four metaheuristics were applied to the CNN hyperparameter optimization problem. The performance of the method could not be assessed by the result of a single test, due to the random character of EE-TGA and E 3 -FA. More than one run with independent random population initialization should be made in order to estimate the performance of the introduced approaches. Accordingly, in this study, the results were collected in 10 independent experiments, where for each run, different pseudo-random seeds were employed. Within the each run, the best CNN structure was generated in 20 iterations. The best reported result was taken as the network configuration (solution) that obtained the lowest classification test error from all 10 runs.
The algorithms' control parameters, as well as the CNN hyperparameter range of values that were taken in the simulation are presented in detail in Subsection 4.2.
As already stated, in all conducted experiments, the objective function that was subject to the process of optimization was the classification test error rate.
The results of both proposed metaheuristics (EE-TGA and E 3 FA) were compared to the peer competitors that were tested under the same simulation conditions [17], the original implementations of the TGA [71] and FA [72] metaheuristics and with the same state-of-the-art network configurations that were manually designed.
In [17], the authors proposed a CNN design framework based on the evolutionary algorithm: evolutionary-based hyper-parameter optimization for independent CNNs (IEA-CNN). Furthermore, the authors developed an extension of their framework towards a joint optimization of a committee of CNNs with the goal of leveraging specialization and cooperation between individual networks (joint optimization for a committee of multiple CNNs (CEA-CNN)). A committee represents a set of K trained CNNs where the classification is performed by merging the individual CNN scores. In the proposed approach [17], the average of the decision values over K CNNs was employed.
In all experiments, the committee size was set to 34 (K = 34). Moreover, the authors conducted simulations with CEA-CNN by varying penalization k and found that the best CNN configurations were generated when k was set to two. The results with k set to 1, 2, and 3 were presented in [17].
According to the results presented in [17], CEA-CNN produced larger filter sizes than both our proposed approaches, which extended from 3 × 3 -4 × 4 to 8 × 8 -7 × 7. On the other hand, some of the best generated solutions by EE-TGA-CNN and E 3 FA consisted of three convolutional layers and one fully-connected layer with the filter sizes 2 × 2, 3 × 3, and 4 × 4, while the number of filters per convolutional layer ranged from 42 to 128, and the number of hidden units in the fully-connected layers were between 50 and 128. The network structures were significantly less complex than the architectures in [17].
We performed careful analysis of the empirical results generated from the conducted tests and concluded that the kernel size of the convolutional layers had a very significant influence on the network performance compared to the number of filters (kernels). The details of the five best performance network architectures (solutions) from the total of 80 generated solutions (eight solutions in the population · 10 runs) are summarized in Table 9. Some of generated network architectures by TGA-EE-CNN and E 3 FA-CNN are given in Figure 6.
Since the study presented in this paper was an extension of our previous study where we tested the basic TGA [71] and FA [72] metaheuristics for the same problem and under the same experimental conditions, we also included in the comparative analysis these two methods.
Within the comparative analysis, we also encompassed manually designed state-of-the-art networks, namely LeNet-5 [6], deeply supervised net [83], shallow CNN [84], recurrent CNN [85], and gated pooling CNN [86]. The EE-TGA-CNN approach resulted in the best accuracy and lowest test error rate, that is 0.19%; E 3 -FA-CNN resulted in the classification error rate of 0.21%; TGA-CNN 0.23% [71]; and the fourth approach, FA-CNN, was 0.23% [72]. IEA-CNN [17] obtained the best classification error rate of 0.34%, while CEA-CNN [17] generated the best architecture with a classification error rate of 0.24% with penalization k set to two. With k set to one and three, CEA-CNN managed to establish classification error rates of 0.26% and 0.28%, respectively [17].
On the other side, some manually designed CNN configurations established a much higher classification error rate. The highest error rate of 0.95% was produced by LeNet-5 [6], while the deeply supervised net [83] and shallow CNN [84] obtained error rates of 0.39% and 0.37%, respectively. The recurrent CNN, which managed to establish an error rate of 0.31%, and the gated pooling CNN with an error rate of 0.34% generated better results than the IEA-CNN approach.
In comparison to the peer competitors and other architectures that were manually designed, besides the high classification accuracy, the proposed enhanced metaheuristic techniques (TGA-CNN and E 3 FA-CNN) had fewer parameters and were computationally less expensive.
The results of the proposed methods with other comparable techniques that utilized metaheuristics algorithms and manually designed state-of-the-art architectures are presented in Table 10. We note once again that the best network configuration (solution) from all 10 independent runs is reported. Table 10. Comparative analysis with other metaheuristics. IEA-CNN, evolutionary-based hyper-parameter optimization for independent CNNs ; CEA-CNN, joint optimization for a committee of multiple CNNs.

Method
Test Error Rate (%) LeNet-5 [6] 0.95 Deeply Supervised Net [83] 0.39 Shallow CNN [84] 0.37 Recurrent CNN [85] 0.31 Gated Pooling CNN [86] 0.29 IEA-CNN [17] 0.34 CEA-CNN, k=1 [17] 0.26 CEA-CNN, k=2 [17] 0.24 CEA-CNN, k=3 [17] 0.28 FA-CNN [72] 0.23 TGA-CNN [71] 0.23 A visual representation of the comparative analysis is depicted in Figure 7. For the sake of better visibility, CEA-CNN results with k set to one and three are excluded from the graph. Since we implemented two methods in our current study and two in previous studies [71,72] and with the goal of establishing the visual representation of the quality of the proposed approaches, the convergence speed graph of the accuracy for TGA-CNN, FA-CNN, EE-TGA-CNN, and E 3 -FA-CNN is given in Figure 8. The results presented in the figure were taken from one of the best runs. From the presented figure, it was clear that EE-TGA-CNN and E 3 -FA-CNN established better convergence than the original TGA and FA algorithms. Moreover, the enhanced metaheuristics managed to converge to the best solutions after only 15 iterations.
In order to further access the solution quality of the proposed approaches, we also compared the EE-TGA-CNN and E 3 FA − CNN methods with another swarm intelligence-based CNN optimizer presented in [18]. However, the methods presented in [18] were tested under different experimental conditions, but for the same MNIST dataset. For this reason, comparative analysis with these approaches was not entirely realistic.
In [18], as one of the main drawbacks of CNN models, the over-fitting problem was addressed. In order to overcome this issue, the authors proposed four swarm intelligence methods (FA, BA, CS, and PSO) for proper selection of the regularization parameter dropout and reported the average accuracy over 20 independent runs.
Our first approach, EE-TGA-CNN, established an average accuracy over 10 independent runs of 99.18%, while E 3 FA-CNN for the same metric obtained a result of 99.16%. At the other hand, the basic FA and TGA metaheuristics that we implemented obtained an average accuracy of 99.13% and 99.14%, respectively. We note that the results reported for the FA in [18] were different from our paper since different experimental conditions were used.
The results of the comparative analysis between EE-TGA-CNN and E 3 FA-CNN and the approaches presented [18] are summarized in Table 11, while the visual representation is given in Figure 9. Table 11. Comparative analysis with other swarm algorithms. CS, cuckoo search.
Finally, in the experimental section of this paper, we wanted to give some information regarding the computational burden (costs) of the proposed approaches. The computational burden can be calculated in many ways, for example by determining the time and space complexity, computation time, number of fitness function evaluations, etc. The time complexity uses the number of elementary operations executed by the method, while the space complexity takes into account the memory required by an algorithm to run as a function of the input size. Computational time is a simple measurement in time units taken by the algorithm to perform, and the drawback of this approach is that it highly depends on the overall speed and performance of the computer platform.
However, in the case of metaheuristics, it is more suitable to use the number of fitness function evaluations, which was especially emphasized in the simulations that were presented in this paper. Every time the quality of the solution from the population is estimated (fitness calculation), the evaluation of a particular CNN is performed, which is a very time and resource consuming process. For this reason, inspired by [18], for calculating the computation burden of the proposed approaches, we used the metric of the number of evaluation function (fitness function) calls.
In both proposed approaches (EE-TGA-CNN and E 3 FA-CNN), as well as in the TGA-CNN and FA-CNN methods, the total number of function evaluations depended on the population size N and the maximum number of iterations (t max ) in one run. In the initialization phase, N fitness function evaluations were performed, and then, in each iteration, N more evaluations were executed (that is a total in one run N · t max , one for each solution from the population), and finally, at the end of a run, one final evaluation was executed. For calculating the computational burden, we used the following equation, as in [18]: Since in our simulations, we used values N = 8 and t max = 20 in each run of both proposed methods, as well as in TGA-CNN and FA-CNN, 169 fitness function calls were used (8 + 8 · 20 + 1). However, according to the simulation results, we concluded that EE-TGA-CNN and E 3 FA-CNN managed to establish the best CNN accuracy after only 15 iterations (please refer to Figure 8); in this case, the computational burden was 129 evaluation function calls (8 + 8 · 15 + 1).
The methods proposed in this paper had a higher computational burden than the state-of-the-art algorithms presented in [18] and significantly lower computational costs than the evolutionary approaches shown in [17].
As shown above, the proposed metaheuristics were tested only for an MNIST dataset. Moreover, the simulations with different control parameter adjustments could also be performed to further establish the validity and efficiency of the proposed approaches and potentially to obtain better results. This represents one of the most significant threats and limitations of the proposed research. Since the simulations took a long time to execute, we could not encompass more datasets in one set of the simulations, and we plan do this in some of our future research works in this domain.

Conclusions
In this paper, we proposed two enhanced versions of TGA and FA swarm intelligence metaheuristics for optimizing the CNN's hyperparameters. The first proposed approach (EE-TGA) enhanced the exploitation ability of the original TGA, while the second proposed metaheuristics (E 3 FA) improved both the exploration and exploitation of the basic FA.
Based on our previous experience with the TGA method, we came to the conclusion that the exploitation ability, as well as the intensification-diversification balance of this algorithm could be improved in order to establish better solution quality and convergence. We addressed this issue by scattering some of the newly generated solutions in each iteration around the current best solution by using a normal distribution. In this way, the convergence speed and exploitation-exploration balance of the original TGA version were improved.
Similarly, founded on our previous experience with the FA metaheuristic, despite the fact that the original FA exhibited relatively stable and efficient exploitation, according to the results of the performed simulations on the standard unconstrained benchmarks, we concluded that the exploitation process of the original FA could be further enhanced and also that the exploitation-exploration balance could be better adjusted. Moreover, the original FA approach in some runs showed behavior of premature convergence because the whole population converged towards the current best solutions. By incorporating two procedures: one that enhanced the exploitation and one that improved exploration, into the original, we addressed both issues.
When a modified/improved/hybridized method is created, its effectiveness should first be validated against a wider set of benchmarks, and the performance should be compared with the original method. This is the only way to establish a real improvement of the enhanced method over the original one. For that reason, we firstly performed simulations of the proposed EE-TGA and E 3 FA metaheuristics on standard unconstrained benchmarks and compared the generated solutions' quality and convergence with the original algorithms. In most cases, the methods proposed in this paper managed to obtain better results than the original implementations.
After conducting tests on standard benchmarks, we adapted and tested the proposed EE-TGA and E 3 FA CNNs' hyperparameter optimization. The EE-TGA-CNN and E 3 FA-CNN frameworks aimed to optimize the following CNN hyperparameters: the number of convolutional layers, the number of filters, filter size, the number of FC layers, and the the number of hidden units in the FC layers. The proposed frameworks searched for the right set of hyperparameters and automatically designed the CNN architecture for a given task. We did not include the type of pooling layer, activation function, regularization technique, and other CNN hyperparameters in the automatic search because of the limitations of available computing power.
The proposed metaheuristics were applied to the well-known MNIST benchmark dataset, and the obtained results were compared to other similar methods that were conducted under the same and similar conditions, as well as to state-of-the-art architectures. The results of both proposed frameworks (EE-TGA-CNN and E 3 FA-CNN) were compared to the peer evolutionary competitors that were tested under the same simulation conditions [17], the original implementations of the TGA [71] and FA [72] metaheuristics and with same of the state-of-the-art network configurations that were manually designed. Finally, we included one more comparative analysis with state-of-the-art swarm intelligence approaches [18], which were tested under slightly different experimental conditions.
The results proved that the proposed method was robust, achieved high classification accuracy, was competitive in terms of the employed computational resources, efficient in real-world classification tasks, and did not require the user to have any prior expertise in the domain.
The EE-TGA and E 3 FA metaheuristics were developed in Java SE 10 (18.3). The framework was developed and implemented by using the deep learning programming library Deeplearning4j (https://deeplearning4j.org/), which was written for Java and Java Virtual Machine (JVM). To improve the execution speed of the performed study, due to the large utilization of computer resources, we adapted the framework for the execution on the graphical processing unit (GPU) using NVIDIA's ™CUDA parallel computing platform and programming model. All experiments were performed on the computer platform with six NVIDIA GTX 1080 GPUs, each having 2560 ™CUDA cores.
The detailed description of the experimental conditions along with the control parameters' values were presented in this paper.
Based on the details of the presented research, the scientific contribution was twofold: we first managed to setup an automatic framework for CNNs' architecture generation that outperformed other well-known approaches and that could be used in practical challenges; secondly, we managed to improve the original FA and TGA implementations.
In our future research from this domain, first we want to test and improve other swarm algorithms and to adapt them for solving the CNN hyperparameter optimization problem. Furthermore, we plan to include other CNN hyperparameters in the optimization problem, such as the dropout regularization technique, learning rate, etc. Moreover, we plan to utilize other datasets as well to further establish and to try to improve the classification accuracy and to minimize the classification error. Some of the datasets that we plan to include in our future research are the following: the USPS, CIFAR-10, Semeion Handwritten Digits, and SEEDdatasets.