Performance of a Novel Chaotic Fireﬂy Algorithm with Enhanced Exploration for Tackling Global Optimization Problems: Application for Dropout Regularization

: Swarm intelligence techniques have been created to respond to theoretical and practical global optimization problems. This paper puts forward an enhanced version of the ﬁreﬂy algorithm that corrects the acknowledged drawbacks of the original method, by an explicit exploration mechanism and a chaotic local search strategy. The resulting augmented approach was theoretically tested on two sets of bound-constrained benchmark functions from the CEC suites and practically validated for automatically selecting the optimal dropout rate for the regularization of deep neural networks. Despite their successful applications in a wide spectrum of different ﬁelds, one important problem that deep learning algorithms face is overﬁtting. The traditional way of preventing overﬁtting is to apply regularization; the ﬁrst option in this sense is the choice of an adequate value for the dropout parameter. In order to demonstrate its ability in ﬁnding an optimal dropout rate, the boosted version of the ﬁreﬂy algorithm has been validated for the deep learning subﬁeld of convolutional neural networks, with respect to ﬁve standard benchmark datasets for image processing: MNIST, Fashion-MNIST, Semeion, USPS and CIFAR-10. The performance of the proposed approach in both types of experiments was compared with other recent state-of-the-art methods. To prove that there are signiﬁcant improvements in results, statistical tests were conducted. Based on the experimental data, it can be concluded that the proposed algorithm clearly outperforms other approaches.


Introduction
Swarm intelligence is popular in the field of optimization. However, as the "no free lunch" theorem infers, no single algorithm is universally the best performing algorithm for all problems. Hence, many techniques inspired by the behaviors of living organisms have been developed and applied for theoretical and practical tasks, including function optimization, parameter and method calibration, and efficiency improvement in industrial scenarios.
The current paper introduces a modified version of the firefly algorithm (FA) and verifies its boosted abilities on global optimization tasks. The FA [1] is a well-known SI algorithm that has shown great promise in the field of optimization based on metaheuristics.
The proposed method is theoretically tested on two bound-constrained benchmark sets: (i) with chosen functions from the CEC test suite, with 10, 30, and 100 dimensions; and • A novel modified FA algorithm was implemented by specifically targeting the known flaws of the basic implementation of the FA approach; • The devised algorithm was later utilized to help establish the proper dropout value and enhancing the CNN accuracy; • Other well-known swarm intelligence metaheuristics for CNN dropout regularization challenge were further investigated.
The rest of the paper is organized in the following manner. Section 2 describes the fundamental technologies used (swarm intelligence and CNN). Section 3 introduces the modified version of the algorithm, as well as the original one. Section 4 provides the results of the experiments. Section 5 deals with the optimization of the dropout parameter, and the final observations are given in Section 6.

Preliminaries and Related Works
Improving an existing solution by modifying an algorithm, i.e., via another metaheuristic approach, yields good results in this field. Metaheuristic solutions are stochastic, and for an algorithm to be categorized as metaheuristic, it must be inspired by a certain process in the nature. These processes come from group animal behaviors, in which animals work towards a common goal, unachievable by solely working alone. This type of behavior exhibits group intelligence. The intellectual potential of a single unit of a species is not very high. On the contrary, while in large groups, even simple organisms perform complex tasks successfully. The solutions inspired by these kinds of animals are metaheuristic and belong to the field of swarm intelligence, which has proven successful in solving NP-hard problems. This has been exploited in algorithm hybridization for improving machine learning algorithms; this type of combination is referred to as learnheuristics.
In this work, dropout regularization improvement was achieved by the previously mentioned methods. Swarm intelligence is a metaheuristic field that adapts animal behavior, specifically in animals that move in swarms, in regard to algorithms used in the field of artificial intelligence [9,10]. The field of SI has a wide application because it efficient in solving NP-hard problems. SI methods have been frequently used to address different optimization tasks, both theoretical [11] and from various practical fields, including wireless sensor networks (WSNs) [12][13][14][15], task scheduling in the cloud, and edge computing [16,17]. Recently, one of the most important fields of interest has been the hybrid approach with SI and machine learning. The number of publications in this domain increased drastically in recent years; some of the most prominent works include hyperparameter optimization [3,18,19], feature selection problems [2], time series prediction tasks, e.g., estimation of COVID-19 cases [6,20], and neural network training [21,22].
Hybridization of these algorithms yields the most benefits. With this approach, it is possible to significantly improve convergence times. SI algorithms apply a stochastic approach in the search of global optima, making them heavily reliant on the number of iterations. This process is divided into two phases, exploration and exploitation, similar to the training and testing phases in machine learning. In exploration, the focus is on exploring the local search area, while the latter phase is global. These phases must be balanced out, again, similar to the training and testing phases in machine learning. The SI goal is not to achieve the best solution, but rather to quickly provide a sub-optimal one. The search for the best solution can be greatly enhanced by adding evolutionary principles to the algorithm. The evolutionary algorithms implement a mechanism that transfers the knowledge from the previous population to the next one. This is achieved through mutation, crossover, and selection. Mutation can be translated into the algorithm, keeping a unit from the previous generation, but with the modification of the value it carries. Crossover is the combination of two neurons, and selection is the process of selecting the best units. This is a different approach compared to the random generation of a hive population. Evolution-based swarms are prone to provide faster convergence compared to the classic population-based swarm algorithms, but are less sensitive in terms of finding local optima, as they can get more easily stuck in them.
The SI method proposed in this paper regards an augmentation of the FA algorithm. The improved version was tested on several benchmark theoretical functions, before it is applied to dropout optimization in CNN.
Humans are highly visual creatures and rely heavily on this sense. This translates to limitations of input that can be used for machine learning. While the field yielded tremendous results in big data and prediction-based insights, most of the ideas that em-ploy AI required visual input. For the majority of adopters, which are non-professional individuals, the only contact with AI is by using certain software that manipulates the visual input. While these tasks are trivial, in regard to changing one's appearance, the true importance of this adoption lies in the previously mentioned nature of our species. Humans are not computational beings. Therefore, the human species does not process any of the information absorbed by labeling, tagging, and placing into tables. This creates a limitation for the accurate representation of the information obtained in the computational form. It is inefficient and too complex of a process for an individual to translate the obtained information from a photograph into words (in a way that a program can process them). As a result, CNNs have been widely applied because they excel in these types of tasks, including speech recognition, natural language processing, and computer vision. These models must be 'modeled', such as the human nervous system [23][24][25]. The most recent applications include facial recognition [26][27][28][29], document analysis [30][31][32], image classification tasks in medicine as support for diagnostic processing and faster illness detection [33][34][35], analysis of climate change and extreme weather prediction [36,37], and many others. The metaheuristic approach in CNN comes from the animal visual cortex. The visual cortex is built from layers that receive, segment, and integrate visual input. The output of each layer is the input for the next layer. During this process, the data get cleaner as they get deeper. This means that data are simplified, making it easier to process further, while retaining all of the important features. An example of this behavior is edge forming on the first layer, the set of edges and corners on the second layer, sets of corners and contours and parts of objects on the third layer and, finally, the full object on the last layer. The convolution layer, pooling layer, and the fully connected layer, in that order, represent the anatomy of a CNN.
Firstly, the convolution layers apply the corresponding operations, which filter the data. It is important to emphasize that the filters are always smaller in size from the input. Widely used sizes are 3 × 3, 5 × 5, and 7 × 7. The convolution operation of the input vector is: z The symbols from the equation bear the following meaning: z i,j,k denotes the output feature value of the k-th feature map at location i, j, the input is x at the location i, j, w represents filters, and bias is b.
The activation operation is: where g(·) denotes the non-linear function exploiting the output. There are two types of pooling layers: global and local. The most widely used method is the max and average pooling.
The resolution is reduced through the pooling function: Classification is performed by the fully connected layers. The softmax layer performs multi-classification. In the case of binary classification, the logistic layer is used.
As stated in Section 1, several techniques are used to avoid the overfitting issue; one of them is dropout regularization. This research focuses on optimizing the dropout probability (dp); hence, the dropout regularization is explained in the following paragraphs.
In light of the proposed CNN model, the dropout technique can be considered a new CNN layer. With this in mind, r denotes the activation or dropout of M nodes in the observed layer. Every variable r j is assigned the value of 1 with probability 1 − p, independently. If the observed r j contains the value 1, then that unit remains in the network, and if not, then that particular unit is removed from the network with all of its connections.
The probability p is unconstrained from other cells in the network, and it is obtained from the Bernoulli distribution, described with the Equation (4).
With this in mind, it is possible to denote the outputs vector of a layer L with y (L) during the network training. After applying the dropout, the new outputs vectorỹ (L) can be defined by Equation (5): At the end, during the network testing, the weight matrix W is required to be scaled by ratio p for averaging all 2 M possible networks that have dropped out. This step summarizes the main contributions of the regularization method, because it is needed to test a single network, as shown in Equation (6).
where W (L) denotes the weight matrix at layer L.

Proposed Method
This beginning section introduces the basic implementation of the FA metaheuristics, followed by the discussion about the known and observed flaws and drawbacks of the original version. At the end, a detailed description of the proposed modified method that is devised to specifically overcome these flaws of the original algorithm is provided.

The Original Firefly Algorithm
The FA metaheuristics, introduced by Yang [1], is motivated by flashing and social characteristics of fireflies. Since, in the 'real-world', the natural system is relatively complex and sophisticated, the FA models it by using several approximation rules [1].
Brightness and attractiveness of fireflies are used for modeling fitness functions; attractiveness, in most typical FA implementations, depend on the brightness, which is in turn determined by the objective function value. In the case of minimization problems, it is formulated as [1]: where I(x) represents attractiveness and f (x) denotes the value of objective function at location x. Light intensity; hence, the attractiveness of the individual decreases, as the distance from the light source increases [1]: where I(r) represents light intensity at the distance r, while I 0 stands for the light intensity at the source. Furthermore, for modeling real natural systems, where the light is partially absorbed by its surroundings, the FA makes use of the γ parameter, which represents the light absorption coefficient. In most FA versions, the combined effect of the inverse square law for distance and the γ coefficient is approximated with the following Gaussian form [1]: Moreover, each firefly individual utilizes attractiveness β, which is directly proportional to the light intensity of a given firefly and also depends on the distance, as shown in Equation (10).
where parameter β 0 designates attractiveness at distance r = 0. It should be noted that, in practice, Equation (10) is often replaced by Equation (11) [1]: Based on the above, the basic FA search equation for a random individual i, which moves in iteration t + 1 to a new location x i towards individual j with greater fitness, is given as [1]: where α stands for the randomization parameter, the random number drawn from Gaussian or a uniform distribution is denoted as κ, and r i,j represents the distance between two observed fireflies i and j. Typical values that establish satisfying results for most problems for β 0 and α are 1 and [0, 1], respectively. The r i,j is the Cartesian distance, which is calculated by using Equation (13).
where D marks the number specific problem parameters.

Motivation for Improvements
Notwithstanding the outstanding performance of original FA for many benchmarks [38] and practical challenges [39], findings of previous studies suggest that the basic FA shows some deficiencies in terms of insufficient exploration and inadequate intensificationdiversification balance [40][41][42]. The lack of diversification is particularly emphasized in early iterations, when, in some runs, the algorithm is not able to converge to optimal search space regions, and ultimately worse mean values are obtained. In such scenarios, a basic FA search procedure (Equation (12)), which primarily conducts exploitation, is not able to guide the search towards optimum domains. Conversely, when in the initialization phase, random solutions are generated by chance in the optimal or near-optimal regions, the FA manages to obtain satisfying results.
Further, by analyzing a fundamental FA search equation (Equation (12)), it can be observed that it does not encompass an explicit exploration procedure. To address this issue, some FA implementations utilize the dynamic randomization parameter α, where this parameter is gradually decreased from its initial value α 0 towards the predefined threshold α min , as shown in Equation (14). In this way, at the beginning of a run, exploration is more emphasized, while in later iterations, the balance between intensification and diversification moves towards exploitation [43]. However, based on the extensive empirical simulations, it is deduced that the application of dynamic α is not enough to enhance FA exploration abilities and the proposed mechanism only slightly eliminates this issue.
where t and t + 1 denote current and next iteration, respectively, while the T is the maximum iteration number in one run of an algorithm. It is also worth noting that the previous studies show that FA exploitation abilities are efficient in tackling various kinds of tasks, and FA is known as metaheuristic, with robust exploitation capabilities [40][41][42].

Novel FA Metaheuristics
A novel FA approach proposed in this study addresses issues of the basic FA by assimilating the following procedures: • Explicit exploration mechanism based on the solution's exhaustiveness; • gBest chaotic local search (CLS) strategy.
No matter what the outstanding exploitation capabilities of the original FA are, by using the CLS mechanism, intensification can further improve, as shown in the empirical section of this manuscript.
Motivated by proposed enhancements, a novel FA is named chaotic FA with enhanced exploration (CFAEE).

Explicit Exploration Mechanism
The goal of the explicit exploration procedure is to assure that the algorithm converges to the optimum part of the search space in early iteration, while in late phases of execution, it facilitates exploration around the parameter boundaries of the current best individual x * . To incorporate this behavior, each solution is modeled by using additional attributes trial, which is incremented every time when the solution cannot be improved by the basic FA search (Equation (12)). When the trial parameter for a particular solution reaches a predetermined limit value, the individual is replaced with the random solution drawn from within the boundaries of the search space by utilizing the same procedure as in the initialization phase: where x i,j represents j-th component of i-th individual, u j and l j denote upper and lower search boundaries of the j-th parameter, and rand is a uniformly distributed random number from the interval [0, 1]. The solution, which trial exceeds the limit, is said to become exhaustive. This idea, as well as terminology, was adapted from the well-known ABC metaheuristics [44], which are known to have efficient exploration mechanisms [45].
Replacement of the exhausted solution, with a pseudo-random individual, stirs up search performances in early iterations, when the algorithm does not identify proper parts of the search region. However, in later iterations, following the reasonable assumption that the optimal region has been found, this kind of replacement wastes function evaluations. For that reason, in later iterations, the random replacement procedure is changed with the guided replacement mechanism around the lower and upper parameter values of all solutions in the population: where Pl j and Pu j represent the lowest and highest values of the j-th component from the entire population P.

The gBest CLS Strategy
The chaos as random phenomenon exists in non-linear and deterministic systems and is highly responsive to its initial condition [46]. From the mathematical perspective, chaotic search is more efficient than the ergodic one [47], because a vast number of sequences can be generated by only tweaking its initial values.
Notwithstanding that, in modern literature, many chaotic maps exist, after conducting empirical experiments, it was concluded that, in case of the proposed novel FA, the logistic map obtains the most promising results. We should note that the logistic map has been utilized in many swarm intelligence approaches [48][49][50].
The logistic map that the proposed method utilizes executes in K steps and it is defined as: where σ k i,j and σ k+1 i,j represent chaotic variable for j-th component of the i-th solution in steps k and k + 1, respectively, and µ is control variable. The σ i,j = 0.25, 0.5 and 0.75, σ i,j ∈ (0, 1) and µ is set to 4, since this value was previously determined empirically [50].
The proposed method incorporates the global best (gBest) CLS strategy because the chaotic search is performed around the x * solution. In each step k, new x * , denoted as x * , is generated with the Equations (18) and (19), which are applied for to component j of x * : where σ k j is determined by Equation (17) and λ is the dynamic shrinkage parameter that depends on the current fitness function evaluation (FFE) and on the maximum number of fitness function evaluations (maxFFE) in the run: By using dynamic λ, better exploitation-exploration equilibrium around the x * is established. In earlier phases of the execution, a wider search radius around the x * was explored, while in the later phases, a fine-tuned exploitation was performed. The FFE and maxFFE can be replaced with t and T when the maximum number of iterations is taken as the termination condition.
In this way, by using the CLS strategy, x * is (attempted to be) improved in K steps, and if the x * obtains better fitness than the x * , the CLS procedure is terminated, and the x * is replaced with x * . However, if in K steps the x * could not be improved, it is retained in the population.

Chaotic FA with Enhanced Exploration Pseudo-Code
In order to efficiently incorporate the exploration mechanism and gBest CLS strategy in the original FA, a few things should be considered. First, as already suggested in Section 3.3.1, in the early phases of the execution, the random replacement mechanism should be conducted, while in latter phases, the guided one would generate better results. Second, the gBest CLS strategy, in early iterations, would not generate significant improvements because the x * likely still did not converge to the optimum region, and it would just waste FFEs.
To control the above mentioned behavior, the additional control parameter ψ is included in the following way: if t < ψ, the exhausted solutions from the population are replaced with the random ones (Equation (15)) and the gBest CLS will not be executed; if t ≥ ψ, the guided replacement mechanism will be executed (Equation (16)) and the gBest CLS will be triggered.
Moreover, to fine-tune, the basic FA search proposed method utilizes dynamic α, according to Equation (14).
Taking all of the above, the pseudo-code of the proposed CFAEE is summarized in Algorithm 1.

Algorithm 1 The CFAEE pseudo-code
Initialize main metaheuristics control parameters N and T Initialize search space parameters D, u j and l j Initialize CFAEE control parameters γ, β 0 , α 0 , α min , K and φ Generate initial random population P init = {x i,j }, i = 1, 2, 3..., N; j = 1, 2, , ...D using Equation (15) in the search space while t < T do for i = 1 to N do for z = 1 to i do if I z < I i then Move solution z in the direction of individual i in D dimensions (Equation (12)) Attractiveness changes with distance r as exp[−γr] (Equation (10)) Evaluate new solution, replace the worse individual with better one and update intensity of light (fitness) end if end for end for if t < φ then Replace all solutions for which trial = limit with random ones using Equation (15) else Replace all solutions for which trial = limit with guided replacement using Equation (16) for k = 1 to K do Perform gBest CLS around the x * using Equations (17)- (19) and generate x * Retain better solution between x * and x * end for end if Update α and λ according to Equations (14) and (20), respectively end while Return the best individual x * from the population Post-process results and perform visualization

The CFAEE Complexity and Drawbacks
The number of FFEs can be taken as a metric to determine the complexity of the swarm intelligence algorithm because the most computationally expensive part is the objective evaluation [38]. The basic FA evaluates objective functions in the initialization and in the solution updating phases. While updating solutions, according to the Equation (12), the FA employs one main loop for T iterations and two inner loops going through N solutions [38].
Thus, including the initialization phase, the worst case complexity of basic FA metaheuristics is O(N) + O(N 2 · T). However, if N is relatively large, it is possible to use one inner loop by ranking the attractiveness or brightness of all fireflies using sorting algorithms, and in this case, complexity is O(N) + O(N · T · log (N)) [38].
The complexity of the proposed CFAEE is higher than the original FA due to the application of the explicit exploration mechanism and gBest CLS strategy. In the worst case scenario, if the limit = 0, all solutions will be replaced in every iteration, and the gBest CLS strategy will be triggered throughout the whole run if φ = 0. Assuming that the value of K is set to 4, then the worst case CFAEE complexity is given as: . However, in practice, the complexity is much better because of limit and ψ control parameter adjustments.
Drawbacks of the proposed CFAEE over the original version involve utilization of additional control parameters limit and ψ. However, conducting empirical simulations, values of these parameters can be relatively easy determined. Moreover, the employment of these two parameters is justified because the CFAEE exhibits substantial performance improvements over the original FA for benchmark challenges and for the dropout regularization challenge from the machine learning domain, as shown in Sections 4 and 5.

Bound-Constrained Benchmark Simulations
The proposed novel FA was first rigorously tested on a set of standard boundconstrained benchmarks that encompass functions from the well-known Congress on Evolutionary Computation (CEC) benchmark suite and other notable instances. The first benchmark set consists of 18 carefully chosen complex uni-modal, multi-modal, and two-dimensional functions, with the goal of determining convergence speed and exploration ability of the proposed method. Comparative analysis was performed with another state-of-the-art FA version. The purpose of the second benchmark set, which includes challenging CEC2017 unconstrained functions, is to measure the robustness and efficiency of the proposed CFAEE over other the state-of-the-art swarm intelligence metaheuristics.

Experimental Setup
Due to the stochastic nature of metaheuristics, the only way to determine proper control parameter values is by performing a "trial and error" approach on a wider set of theoretical problems, such as extensively utilized bound-constrained benchmarks. Afterwards, the results for a set of independent runs are averaged, and control parameters that obtain the best mean performances are utilized in further experiments. This is usual practice for establishing proper control parameter values for novel and improved implementations of existing metaheuristics approaches [1,[51][52][53].
Following the above-mentioned firmly established practice, the optimal (or nearoptimal) CFAEE control parameter setup was determined by conducting extensive simulations on classical unconstrained benchmarks. The goal was to find control parameter values that would, on average, for all test instances, accomplish satisfying results.
The CFAEE control parameter values that were utilized in both bound-constrained simulations are shown in Table 1. Since the CFAEE may utilize different number of FFE in each run, the maxFFE is used as termination criteria instead of T. Expressions for calculating values of limit and φ parameters are also determined empirically. Both bound-constrained experiments were executed in 50 independent runs and all methods included in the comparative analysis were implemented for the purpose of this research. All algorithms were implemented in Python by using core (built-in), as well as specific data science and machine learning Python libraries: NumPy, SciPy, pandas, scikit-learn, pyplot, and seaborn.
All experiments were conducted on Intel ® CoreTM i7-8700K CPU and 32 GB of RAM running, using a Windows 10 × 64-bit operating system computer platform.

Benchmark Problem Set 1
The goal of the first bound-constrained simulation was to validate convergence speed and exploration ability of the proposed method against other state-of-the-art FA approaches. The same 'opponent' algorithms and the same test beds as in [54] are included in the analysis. Table 2  State-of-the-art FA versions that were included in he comparative analysis are the following: dynamic adaptive weight firefly algorithm (WFA) [55], chaotic FA based on logistic map (CLFA) [56], Levy flights FA (LFA) [57], variable step size firefly algorithm for numerical optimization (VSSFA) [58], and the dynamically adaptive firefly algorithm with global orientation (GDAFA) [54].
In [54], all of the above-mentioned FA approaches were tested with N = 20 and T = 1000 per run, which, in the worst case, yielded a total of 400,040 FFE (please refer to Section 3.3.4). However, empirically, it was determined that not all N · N evaluations were executed in each iteration and the best approximation would be FFE/2.5, which is around 160,000. Thus, in this research, experiments provided in [54] were recreated with FFE = 160,000 for all methods in order to establish fair comparative analysis because the proposed CFAEE utilizes more FFE in each iteration than other opponent methods. Basic control parameter setups for all FA versions are the same, as shown in Table 1; for their other specific parameters, please refer to [54].
It should be noted that, for all methods, except for the basic FA, similar results as in [54] were obtained. In the conducted experiments, basic FA with dynamic parameter al pha (Equation (14)) was used, and much better results than reported in [54] were obtained. Authors in [54] implemented a static FA approach, and other proposed improved FA methods established better performance.
All simulations were conducted with 10, 30, and 100 dimensions (D = [10, 30, 100]) for benchmark function instances from f 1 to f 15 and comparative analysis results were summarized in Tables 3-5, respectively. Comparative analysis results with two-dimensional functions ( f 16 − f 18) are provided in Table 6. In all simulations, best, worst, and mean values averaged over 50 runs are reported. The results in bold and slightly larger font denote the algorithm that showed the best results for that performance metric.
The overall conclusion from all presented results is that the best two methods are proposed-CFAEE and GDAFA. Benchmark instance with D = 10 are relativity easy for optimization and both methods in each run for all benchmarks obtained optimum results. The most significant performance difference between the original FA and other methods can be observed in the f 14 test, where the basic version completely failed to converge to the optimum region. On the other hand, the basic FA showed very competitive results for the f 3 benchmark.
When the benchmarks with D = 30 were considered, the proposed CFAEE again obtained superior results, leaving the GDAFA approach in second place. The superiority of CFAEE can be seen in f 5, f 7, f 8, and f 13 benchmarks, where the difference between CFAEE (first), followed by GDAFA (second), and all other observed algorithms, were the most significant. It is also worth noting that the basic FA implementation again performed well, and exhibited competitive performances for the test instances f 1, f 2, f 5, f 9, and f 10, where it outperformed several other enhanced FA implementations.
When the most complex benchmarks (D = 100) are observed, the superiority of the proposed CFAEE can be seen once more. This is most obvious in the test instances f 7, f 8, and f 13, where performance of the CFAEE (first), followed closely by GDAFA (second), were by far the best when compared to all other algorithms, with the most significant difference. The GDAFA, on the other hand, performed very well in test instances f 6, f 9, and f 14, finishing in the first place, in front of the proposed CFAEE. Again, similar as to the D = 10 and D = 30 benchmarks, the basic FA implementation performances were very competitive, which can be easily seen for f 1 and f 6 benchmarks, where the basic FA performances were close to CFAEE and GDAFA, while leaving other enhanced FA implementations behind.
Finally, for instances with only two dimensions (Table 6), all methods, except FA and WFA, managed to reach optimum in all runs. These complex functions exhibit many local optima and FA and WFA did not show satisfactory exploration ability in all runs. This issue of basic FA is described in Section 3.2.
For making performance differences more clear for the readers-the number of times that each algorithm outperformed the benchmark, as well as each performance indicator, are counted in Table 7.
Further, to see if there is a statistically significant difference in the results, we applied the Wilcoxon signed rank-test to perform the pair-wise results comparisons between the proposed CFAEE and other improved FA versions, and the original FA algorithm, for 100dimensional simulations (Table 5). Following the usual practice for determining whether the results came from different distributions, a significance level of α = 0.05 was taken. It should be noted that the results for D = 10 and D = 30 do not exhibit statistically significant differences since low-dimensional and medium-dimensional problems are easy tasks for all methods included in the analysis.
Results of the Wilcoxon signed-rank test are summarized in Table 8. As can be seen from the presented table, the calculated p-value is lesser than the critical level α = 0.05 in all cases, and it can be concluded that the proposed CFAEE, on average, significantly outperforms all other approaches. Table 2. Function details for benchmarks problem set I.          Convergence speed graphs of some functions, averaged over 50 runs for all metaheuristics taken for comparative analysis in Table 5, are shown in Figure 1.

Benchmark Problem Set 2
The second bound-constrained validation of the proposed CFAEE was conducted on a very challenging CEC 2017 benchmark suite [59]. The suite is composed of 30 benchmarks divided into 4 groups: F1-F3 are uni-modal, F4-F10 are multi-modal, F11-F20 belong to the class of hybrid functions, while tests F21-F30 are very challenging composite functions. The last group contains properties of all uni-modal, multi-modal, and hybrid functions; moreover, they are shifted and rotated.
Test instance F2 was deleted from the test suite due to unstable behavior [60], and these results are not reported. Basic details of CEC 2017 instances are given in Table 9. Simulations are executed with 30-dimensional instances (D = 30) and mean (average) and standard deviation (std) results for 50 runs are reported. The proposed CFAEE is compared against the basic FA with dynamic α , state-of-the-art improved Harris hawks optimization (IHHO) presented in [61], and other well-known efficient nature-inspired metaheuristics: HHO, DE, GOA, GWO, MFO, MVO, PSO, WOA, and SCA.
In this study, the same experimental setup as in [61] was recreated. The study shown in [61] reports results with N = 30 and T = 500. However, as in the case of the first unconstrained experiment, since the CFAEE utilizes more FFE in each run, the maxFFE is used as the termination criteria. All approaches included in the comparative analysis employ one FFE per solution in the initialization and update phases, and to conduct an unbiased comparison, maxFFE was set to 15,030 (N + N · T). Control parameter adjustments of opponent methods can be retrieved from [61].
Comparative analysis results for the CEC 2017 benchmark suite are reported in Table 10. The best results for each performance indicator and instance are marked bold. Moreover, if two or more algorithms obtained the same performance, which are the best at the same time, these results are also underlined. Very similar results as in [54] are obtained, but with subtle discrepancies due to the stochastic nature metaheuristics.
The Table 10 shows that the CFAEE had the best results over 21 functions; those were F1 , F3, F5, F6, F7, F8, F11, F12, F13, F15, F17, F19, F20, F21, F22, F23, F25, F26, F28, F29, and F30. In some cases, these case algorithms had the best results, but they were tied with results from another algorithm. In these cases, both results are in bold. The algorithm outperformed every other algorithm in these cases and the IHHO. With some functions from the previously mentioned set, the CFAEE had the same results as the IHHO, and in those situations, they were tied as having the best results. Such cases are F3, F6, F19, F21, and F29. These results are underlined and in bold. It can be observed that, not only CFAEE and IHHO were tied with their results, from some functions; these results are also underlined and in bold. For function F9, the two best algorithms were MVO and PSO. The results of CFAEE and PSO were also tied (for function F11) as having the best results. Finally, with functions F13 and F15, the CFAEE was tied with the DE as having the best results.
In the minority of cases, the CFAEE was outperformed by the IHHO and other algorithms. The IHHO algorithm was only better for functions F4 and F14. The alternative best solutions only came from PSO, MVO, and DE. The previously mentioned case of PSO and MVO, being tied as having the best result with function F9, is one of them; two other cases where PSO was best were with functions F10 and F16. The only other algorithm that was better is the DE, in cases of F18, F24, and F27.
It is important to note that, in no case, was the original FA better than the improved version CFAEE. For some functions, the CFAEE achieved vastly improved results, as much as more than 1000 times better, as seen in F1. Large differences can be seen with other functions as well, such as F12, F13, F18, and F30.
Considering all of the mentioned cases, there is no doubt that the proposed solution, CFAEE, is superior to the original solution, FA, but also to every other algorithm tested. Furthermore, the improvement is justified.
The Friedman test [62,63] and the two-way variance analysis, by ranks, were performed for the determination of the difference significance between the novel CFAEE and the alternative methods used for comparison. This was conducted to further establish the statistical significance of enhancements, not only by comparing the results. Tables 11 and 12 present the results achieved by 12 different algorithms over the 30 functions from the CEC2017 set for Friedman test ranks and the aligned Friedman test ranks, respectively.   Function  IHHO  HHO  DE  GOA  GWO  MFO  MVO  PSO  WOA  SCA  FA  CFAEE   F1  2  7  347  5  9  8  3  4  346  348  6  1  F3  56.5  63  327  58.5  323  334  58.5  61  328  326  60  56.5  F4  144  226  211  177  164  192  152  157  255  278  183  146  F5  138  213  242  194  174  206  169  196  241  245  197  135  F6 153 As seen in Table 11, one could conclude that the proposed method, CFAEE, achieves better performance than the 10 other algorithms, as well as the original FA. The original HHO algorithm had an average ranking of 9.483. The modified version of HHO, the IHHO, had 3.138. The original FA had 6.655. The improved CFAEE was more than twice better than the previous best solution of IHHO with the average ranking of 1.551.
Additionally, the Iman and Davenport test [64] was also performed because the research [65] proves that the test could possibly provide better results in terms of precision than the χ 2 . Summary of the results from Friedman and Iman and Davenport's test can be seen in Table 13.
Upon completion of the calculations, the result of the Iman and Davenport test is 36.95 and put into comparison against the F-distribution critical value (F(9, 9 × 10) = 1.820) shows that the Iman and Davenport test returns a significantly higher result. This test also rejects H 0 . Furthermore, the Friedman statistics (χ 2 r = 181.50) are larger than the χ 2 critical value with 10 degrees of freedom (1.82), while at the significance level of α = 0.05.
Consequentially, it is possible to reject the null hypothesis (H 0 ); it could be suggested that CFAEE performed vastly better than the rest of the algorithms that were tested. Since the null hypothesis was rejected by both performed statistical tests, the nonparametric post-hoc procedure, the Holm step-down procedure, was also conducted and presented in Table 14. By using this procedure, all methods were sorted according to their p value and compared with α/(k − i), where k and i represent the degree of freedom and the algorithm number, respectively. In this study, the α was set to 0.05 and 0.1. Moreover, it should be noted that the p-value results are provided in scientific notation. The results given in the Table 14 suggest that the proposed algorithm significantly outperformed all opponent algorithms at both significance levels.
Finally, to establish a visual difference between methods included in the comparison, dispersion of results over 50 runs for some benchmark instances, and better performing methods using box and whiskers diagrams, are shown in Figures 2 and 3.

Dropout Estimation Simulations
In this section, an empirical study of the proposed CFAEE for a practical problem of dropout regularization in CNN is presented. A basic experimental setup (problem modeling, control parameter setup, and dataset details) is shown first, followed by a presentation of the obtained results, comparative analysis with other metaheuristics-based methods, and a discussion.
For experimental purposes, two CNN structures with default values provided by the Caffe library, which obtained modest performances for employed datasets, were used. The purpose of the experiment was to further investigate the performance of metaheuristics for optimizing dropout probability dp. The same experimental conditions as in [5] were utilized.
All metaheuristics, as well as the CNN framework, were developed in Python using its core and data science libraries (scikit-learn, NumPy, SciPy, along with pandas and matplotlib for visualization) and Keras API. Experiments are conducted on Intel ® CoreTM i7-8700K CPU, 64 GB RAM, and Windows 10 OS with 6 × NVIDIA GTX 1080 GPUs.

Basic Experimental Setup
The study proposed in this manuscript utilizes a similar research setup as shown in [5]. Four parameters that influence the CNN learning process, which were taken into consideration in this study, are: the learning rate η, L1 regularization (penalty, momentum) α, L2 regularization (weight decay) λ, and the dropout probability dp. However, in all experiments, tuple (η, α, λ) was fixed, while the metaheuristics approaches attempted to optimize only the dp parameter. Therefore, this problem belongs to the group of global optimization challenges, with only one parameter that is being optimized.
In the conducted simulations, two CNN architectures, provided by the well-known Caffe library [66] examples (https://caffe.berkeleyvision.org/, accessed on 10 October 2021), as in [5], were utilized. First, CNN architecture was used for performing classification tasks for MNIST, Fashion-MNIST, Semeion, and USPS datasets, while the second was employed for CIFAR-10 challenge. The only differences in CNN design over the proposed Caffe CNNs are the following: an extra dropout layer was added, and for Semeion and USPS simulations, the kernel size was set to 3 × 3 instead of 5 × 5 (as provided in Caffe), due to the lower image resolutions.
Graphical representation of the utilized CNN structures generated by the plot_model Keras function is shown in Figure 4. Method was tested on five well-known image classification datasets: The total number of instances per each class in the training and testing sets, for all datasets employed in the simulations, is shown in Figure 5. Nevertheless, some datasets are unbalanced (does not have the same number of observations for each class) in the original train and test sets; the original split was used in experiments and all metaheuristics were tested under the same experimental conditions. The only dataset that was not originally split into training and testing sets was Semeion; for the purpose of this study, it was manually divided into 400 and 993 observations used for training and testing, respectively, as suggested in [5].
The training set for each database was further divided into train and validation, while the same proportion of the number of instances for each class was maintained. Data preprocessing was not applied. The dataset details, in terms of the split, along with the training batch size (provided in parentheses), are shown in Table 15. The same configuration was employed in [5].
The values for η, α, and λ parameters, as well as the number of training epochs, were set to default values, provided by the Caffe library with only an exception for the Semeion dataset. In this case, the η was set to 0.001 (not Caffe default) due to fewer images in the dataset. The dp, which is subject to optimization, can take any continuous value from the range [0, 1]. The parameter setup is summarized in Table 16.   Each solution in the population represents one possible dp value. The fitness of solution is calculated in the following way: the CNN with dp is generated and trained on the training set and validated on the validation set with early stopping conditions (the early stopping is adjusted as 5% of the total number of training epochs); afterwards, trained CNN is evaluated for the test set and classification error rate E r is return. The fitness is reversed, proportional to the E r : f it = 1/E r .
All metaheuristics were tested with a total number of 77 FFEs.The sStudy proposed in [5] evaluated methods with N = 7 and T = 10, which also yielded a total of 77 FFEs (7 + 7 × 10).
With the goal of visualizing the CNN dropout regularization experiment flow and design, the general CFAEE flowchart and the flowchart for fitness calculation are sown in Figure 6.

Results, Comparative Analysis, and Discussion
For the purpose of the study proposed in [5], the bat algorithm (BA) [67], cuckoo search (CS) [68], FA [1], and particle swarm optimization (PSO) [69] metaheuristics were implemented and tested. However, to compare the performance of metaheuristics-defined dp, results of the standard Caffe CNN with dropout (Dropout Caffe) and without dropout (Caffe) are also provided.
The CFAEE was tested with the same control parameter adjustments as in boundconstrained experiments ( Table 1). Summary of control parameters for other metaheuristics methods included in the analysis are summarized in Table 17.

Method
MNIST Fashion-MNIST Semeion USPS CIFAR-10 acc. dp acc. dp acc. dp acc. dp acc. dp The results from Table 18 clearly indicate superior performance of the proposed CFAEE method regarding the dp value that was subjected to the optimization process. On the MNIST dataset, the proposed CFAEE method obtained superior accuracy of 99.23% with the determined dp value of 0.516. All other metaheuristics approaches obtained the dp value below the standard Dropout Caffe value dp = 0.5. In this particular case, the results obviously show that the dp value should be slightly greater than 0.5 in order to achieve better accuracy, and the proposed CFAEE method was the only one that was able to achieve it.
A similar conclusion can be derived for the Fashion-MNIST experiment. Most methods included in the analysis generated dp, which is lower than 0.5, and worse results than those achieved by the Dropout Caffe were reported. However, BA, SSA SSA, FA, and the proposed CFAEE obtained better accuracy than Dropout Caffe with dp > 0.5.
In the Semeion dataset, again, the proposed CFAEE obtained the best accuracy result of 98.46%, with the dp value of 0.719. It is clear that the accuracy increases with the values of dp, higher than the standard Dropout Caffe value 0.5 in this particular dataset. The second best method was BA, which achieved an accuracy of 98.35% with dp = 0.692. The simple Caffe that does not employ the dropout (dp = 0) achieved 97.62% in this dataset, while the Dropout Caffe approach (dp = 0.5) achieved an accuracy of 98.14%.
A similar pattern can be seen in the USPS dataset as well. The proposed CFAEE again achieved the best accuracy of 96.8% with the obtained dp value of 0.845. Similar to the previous datasets, the increase of dp value leads to better accuracy values. The second best method in this dataset was BA, which achieved 96.45 % with the dp = 0.762. The improvement of the accuracy over the standard Caffe and Dropout Caffe methods is significant, as the proposed CFAEE achieved accuracy approximately 1% greater than the Caffe, and about 0.6% greater than the Dropout Caffe.
Finally, the results on the CIFAR-10 dataset show a different pattern, as they indicate that, if the dp is larger than the standard Dropout Caffe (dp = 0.5), the performance start to drop and accuracy decreases. In this particular case, the model drops out neurons, and it is not able to generalize well. At the same time, if the dp is too small, again, the performances will drop (similar to the standard Caffe that utilizes dp = 0). It can be concluded that, on the CIFAR-10 dataset, the best performances are achieved for the dp values slightly below 0.5. The proposed CFAEE method achieved the best accuracy of 72.32% with the dp = 0.388, and it was the only method that found the dp value below 0.5, as all other metaheuristics determined the dp values in range [0. 5,1].
Finally, the original FA method showed an average performance and the proposed CFAEE in all tests managed to substantially outscore its basic version. Therefore, similar to the case of unconstrained benchmarks, the improvements over the original approach were also validated against the practical challenge of dropout regularization.
Similarly, as it was performed for unconstrained benchmark problem set 1 (Section 4.2), to establish if there were significant result differences between the proposed CFAEE and other methods, a Wilcoxon signed rank-test was conducted. A mean classification error rate generated over 20 independent runs and critical level α = 0.05 were taken for the test.
Results of the Wilcoxon signed-rank test are shown in Table 19. The calculated pvalues in all cases are lesser than critical values α = 0.05, which implies that the proposed CFAEE, on average, substantially outperformed all other approaches.

Conclusions
The proposed manuscript introduced a novel FA approach that further enhanced both exploration and exploitation processes of the original method. The CFAEE incorporates an explicit exploration mechanism and CLS, and in this way, the observed deficiencies of the original FA were suppressed.
Following the recent practices in the optimization process-the introduced CFAEE algorithm was first tested on the recent CEC benchmark functions set, and the obtained results were compared with other modern metaheuristic methods, which were tested under the same experimental environment. Additionally, the statistical tests were executed and delivered the proofs that the enhanced FA algorithm outscored other methods, significantly. Furthermore, the proposed CFAEE outperforms the original FA.
The second part of the experiment focused on applying the proposed CFAEE to the practical CNN problem-optimization of the dropout probability value. Dropout is crucial in overfitting prevention, and it is an important challenge in the machine learning domain.
The CFAEE driven CNN was tested on five standard datasets: MNIST, Fashion-MNIST, Semeion, USPS, and CIFAR-10. Furthermore, since the potential of metaheuristics for this type of challenge was not investigated enough, 10 other well-known swarm intelligence approaches were also implemented and tested for this problem. The achieved accuracies on those datasets indicate that the CFAEE has superior performance over other methods, as well as a promising future in this area.
Accordingly, future work will focus on applying the proposed CFAEE method on other machine learning problems. Due to its promising performances, CFAEE will be adapted and used in tackling other NP-hard problems, including challenges in wireless sensor networks and cloud computing. Finally, regularization in CNNs can be further addressed by utilizing CFAEE and fine-tuning α and λ parameters, with a goal to obtain even better classification accuracy. Moreover, the variables of the convolutional layers, such as the size and depth of the filters, can be parameterized through the CFAEE, instead of the more classical metaheuristic algorithms [73].