Multi-Swarm Algorithm for Extreme Learning Machine Optimization

There are many machine learning approaches available and commonly used today, however, the extreme learning machine is appraised as one of the fastest and, additionally, relatively efficient models. Its main benefit is that it is very fast, which makes it suitable for integration within products that require models taking rapid decisions. Nevertheless, despite their large potential, they have not yet been exploited enough, according to the recent literature. Extreme learning machines still face several challenges that need to be addressed. The most significant downside is that the performance of the model heavily depends on the allocated weights and biases within the hidden layer. Finding its appropriate values for practical tasks represents an NP-hard continuous optimization challenge. Research proposed in this study focuses on determining optimal or near optimal weights and biases in the hidden layer for specific tasks. To address this task, a multi-swarm hybrid optimization approach has been proposed, based on three swarm intelligence meta-heuristics, namely the artificial bee colony, the firefly algorithm and the sine–cosine algorithm. The proposed method has been thoroughly validated on seven well-known classification benchmark datasets, and obtained results are compared to other already existing similar cutting-edge approaches from the recent literature. The simulation results point out that the suggested multi-swarm technique is capable to obtain better generalization performance than the rest of the approaches included in the comparative analysis in terms of accuracy, precision, recall, and f1-score indicators. Moreover, to prove that combining two algorithms is not as effective as joining three approaches, additional hybrids generated by pairing, each, two methods employed in the proposed multi-swarm approach, were also implemented and validated against four challenging datasets. The findings from these experiments also prove superior performance of the proposed multi-swarm algorithm. Sample code from devised ELM tuning framework is available on the GitHub.


Introduction
Extreme machine learning (ELM) represents one of the recent and promising approaches that can be applied to the single hidden layer feed-forward artificial neural networks (SLFN). This approach was initially proposed in [1], and it introduced the concept that the input weight and bias values in the hidden layer are allocated in a random fashion, while the output weight values are computed by utilizing the Moore-Penrose (MP) pseudo inverse [2]. ELMs have shown excellent generalization capabilities [3], and they are known to be very fast and efficient due to the fact that they do not require traditional training, which is one of the most time-consuming tasks when dealing with other types of neural given in [2], the proposed method has been tested on seven benchmark datasets in order to provide a fair comparison of the results.
Moreover, to prove that combining two algorithms is not as effective as joining three approaches, in an additional set of experiments, hybrids generated by pairing each two methods employed in the proposed multi-swarm approach, were also implemented and validated against four imbalanced challenging datasets.
The rest of the manuscript is structured in the following way. Section 2 provides a literature survey on ELM and swarm intelligence meta-heuristics. The description of the proposed multi-swarm approach is provided in Section 3. Section 4 describes the conducted experiments and exhibits the simulation findings on seven datasets together with the comparative analysis with similar approaches. Lastly, Section 5 delivers final observations, proposes future directions in this area, and concludes the paper.

Background
This section first introduces the ELM as one of the ML models. After that, a brief survey of swarm intelligence meta-heuristics is provided, together with the most common applications. Finally, an overview of the ELM models optimized with swarm intelligence meta-heuristic algorithms is given.

Extreme Learning Machine
Extreme learning machine (ELM) was proposed by Huang et al. [1] for single-hidden layer feed-forward neural networks (SLFNs). The algorithm randomly chooses the input weights and analytically determines the output weights of SLFNs. After the input weights and the hidden layer biases are chosen arbitrarily, SLFNs can be simply considered as a linear system and the output weights of SLFNs can he analytically determined through a simple generalized inverse operation of the hidden layer output matrices. This algorithm provides a faster learning speed than traditional feed-forward network learning algorithms, while obtaining better generalization performance. Additionally, ELM tends to reach the smallest training error and the smallest norm of weights. The output weights are computed using Moore-Penrose (MP) generalized inverse [9]. As shown in [10], the learning speed of ELM can be thousands of times faster than conventional learning algorithms with better generalization performance than the gradient-based learning models. Unlike the traditional classic gradient-based learning algorithms that only work for differentiable activation functions, the ELM learning algorithm could be used to train SLFNs with many non-differentiable activation functions.
For a set of training samples {(x j , t j )} N j=1 with N samples and m classes, the SLFN with L hidden nodes and activation function g(x) is expressed as in (1) [1], where w i = [w i1 , . . . , w in ] T is the input weight, b i is the bias of the ith hidden node, β i = [β i1 , . . . , β im ] T is the weight vector connecting the ith hidden node and the output nodes, w i · x j is inner product of w i and x j and t j is network output with respect to input x j .
In this equation, H is the hidden layer output matrix of the neural network as explained in [11], while β is the output weight matrix.
The ELM is successfully used in solving many practical problems, such as text categorization [12], face recognition [13], image classification [14], different medical diagnostics [15,16], and so on. Over time, researchers have presented various improvements for the original ELM. In [4], authors propose a pruned ELM algorithm as a systematic and automated approach for designing an ELM classifier network. By considering the relevance of the hidden nodes to the class labels, the algorithm removes the irrelevant nodes from the initial large set of hidden nodes. Zhu et al. in their paper [5], presented an improved ELM, which uses the differential evolutionary algorithm to tune the input weights and MP generalized inverse. Experimental results show that this approach provides good generalization performance with more compact networks. Adopting this idea, in [17], the authors introduced a new kind of evolutionary algorithm based on PSO which, using the concepts of ELM, can train the network more suitably for some prediction problems. In order to deal with data with imbalanced class distribution, in [18], a weighted ELM is proposed which is able to generalize to balanced data. Recently, Alshamiri et al. presented in [2] a model for tuning ELM by using two SI based techniques-ABC and Invasive Weed Optimization (IWO) [19]. In this approach, the input weights and hidden biases are selected using ABC and IWO and the output weights are computed using the MP generalized inverse.

ELM Tuning by Swarm Intelligence Meta-Heuristics
An extensive literature survey indicates that swarm intelligence meta-heuristics have not been sufficiently exploited for the optimization of the ELM. In addition to the already mentioned paper [2] that inspired this research, just a few approaches that combine ELM and meta-heuristics were published in the past several years. Research published in [68] proposed a hybrid PSO-ELM model, and used it for flash flood prediction. The algorithm was tested on the geospatial database of a typhoon area, and compared to traditional ML models. The obtained results have shown that the PSO-ELM model was superior to other ML models. ELM optimized by PSO was also used in [69], where the authors used ELM to derive hydropower reservoir operation rules. The proposed method was named class-based evolutionary extreme learning machine (CEELM), and it combined k-means clustering that was used to separate the influential factors into clusters with more simple pattern, followed by the application of the ELM optimized by PSO for identifying the complex input-output relationships for every cluster. According to the authors, CEELM showed excellent generalization capabilities.
Faris et al. [70] discussed the application of the salp swarm algorithm (SSA) for optimizing ELM and improving the accuracy. The proposed approach was tested against ten benchmark datasets and compared to other popular training techniques. They concluded that ELM hybridized with SSA outperforms other approaches in achieved accuracy, and obtained satisfactory prediction stability. Finally, improved bacterial foraging optimization algorithm (BFO) has been proposed in [71] and applied for the ELM optimizing task. The obtained results once again indicated that it is possible to achieve similar or even better performances than other ML methods, in a reduced amount of time.

Proposed Hybrid Meta-Heuristics
This section first introduces the basic implementations of the three algorithms used for the proposed research, namely ABC, FA, and SCA. Since each algorithm has specific deficiencies, a novel multi-swarm algorithm has been proposed, that combines the strengths of the individual algorithms and overcomes their individual flaws, by creating synergy and achieving a complementary effect.

The Original ABC Algorithm
The artificial bee colony (ABC) algorithm was designed for continuous optimization problems and it was inspired by the foraging behavior of honey bees [23,72]. ABC uses three control parameters and utilizes three classes of artificial bees: employed bees, onlookers, and scouts. Employed bees make half of a colony. In this model, food source represents the possible problem solution. There is only one employed bee per each food source. The employed bee performs the search process by examining the solution's neighborhood. The onlooker chooses a food source for exploitation based on the information which they gain from employed bees. If a food source does not improve for a predetermined number of cycles, the scouts replace that food source with a new one which is chosen randomly. The limit parameter controls this process [73].
The ABC algorithm, as an iterative algorithm, starts by associating each employed bee with a randomly generated food source. Each bee x i (i = 1, 2, ...N) is a D-dimensional vector, where N denotes the size of the population. The initial population of candidate solutions is created using the following expression (4), where x i,j is the j-the parameter of th i th bee in the population, rand(0, 1) is a random real number between 0 and 1, and ub j and lb j are upper and lower bounds of the j th parameter, respectively. Naturally, x represents a different element than the training samples from Equation (1).
There are many formulations of the fitness function, but in most implementations, for maximization problems, fitness is simply proportional to the value of objective function. In case the problem to be solved targets the minimization of a function denoted here by objFun, the task is converted for maximization using a modification, such as in (5).
Each employed bee discovers a food source in its neighborhood and evaluates its fitness. The discovery of a new neighborhood solution is simulated with the expression (6), where x i,j is j th parameter of the old solution i, x k,j is j th parameter of a neighbor solution k, φ is a random number between 0 and 1, and MR is modification rate. MR is a control parameter of ABC algorithm.
If the fitness of the new solution is higher than the fitness of the old one, the employed bee continues the exploitation process with the new food source, otherwise it retains the old one. Employed bees share information about the fitness of a food source with onlookers, and onlookers select a food source i with a probability that is proportional to the solution's fitness: 3.1.

The Original Firefly Algorithm
The Firefly algorithm was introduced by Yang [24]. The proposed model uses brightness and attractiveness of fireflies. Brightness is determined by the objective function value, while attractiveness depend on the brightness. This is expressed with Equation (8) [24], where I(x) represents attractiveness and f (x) denotes the value of objective function at location x. Again, it is noted that the x in the current subsection should not be mistaken for the representations in the previous subsections.
The attractiveness of the firefly decreases, as the distance from the light source increases [24]: where I(r) represents light intensity at distance r, and I 0 stands for the light intensity at the source. In order to model a real nature system, where the light is partially absorbed by its surroundings, the FA uses the γ parameter, which represents the light absorption coefficient. The combined effect of the inverse square law for distance and the γ coefficient is approximated with the following Gaussian form [24]: Moreover, each firefly individual utilizes attractiveness β, which is directly proportional to the light intensity of a given firefly and also depends on the distance, as shown in Equation (11).
where parameter β 0 designates attractiveness at distance r = 0. It should be noted that, in practice, Equation (11) is often replaced by Equation (12) [24]: Based on the above, the basic FA search equation for a random individual i, which moves in iteration t + 1 to a new location x i towards individual j with greater fitness, is given as [24]: where α stands for the randomization parameter, the random number drawn from Gaussian or a uniform distribution is denoted as κ, and r i,j represents the distance between two observed fireflies i and j. Typical values that establish satisfying results for most problems for β 0 and α are 1 and [0, 1], respectively. The r i,j is the Cartesian distance, which is calculated by using Equation (14).
where D marks the number specific problem parameters.

The Original SCA Method
The sine-cosine algorithm (SCA) proposed in Mirjalili [74] is based on mathematical model of the sine and cosine trigonometric functions. The solutions' positions in the population are updated based on the sine and cosine functions outputs which makes them oscillate around the best solution. The return values of these functions are between −1 and +1, which is the mechanism that keeps the solutions fluctuating. An algorithm starts with generating a set of random candidate solutions within the boundaries of the search space in the initialization phase. Exploration and exploitation are controlled differently throughout the execution by random adaptive variables.
The solutions' position update process is performed in each iteration by using Equations (15) and (16), where X t i and X t+1 i is the current solution's position in the i-th dimension at t-th and i + 1-th iteration, respectively, r 1−3 are pseudo-randomly generated numbers, the P * i denotes the destination point's position (current best approximation of an optimum) in the i-th dimension, while symbol || represents the absolute value. The same notations as in the original manuscript where the method was initially proposed [74] are used in this manuscript.
These two equations are used in combination by using control parameter r 4 : where r 4 represents a randomly generated number between 0 and 1. It is noted that, for every component of each solution in the population, new values for pseudo-random parameters r 1−4 are generated.
The algorithm's search process is controled by four random parameters and they influence the current and the best solution's positions. In order to converge towards the global optima, the balance between solutions is required. This is achieved by changing the range of the based functions in an ad-hoc manner. Exploitation is guaranteed by the fact that sine and cosine functions exhibit cyclic patterns which allow for reposition around the solution. Changes in ranges of sine and cosine functions allow the algorithm to search outside of their corresponding destinations. Furthermore, the solution requires its position not to overlap with the areas of other solutions.
For better quality of randomness, the values for parameter r 2 are generated within the range [0, 2Π] and that guarantees exploration. The controls of the balance between diversification and exploitation are shown with Equation (18).
where t is the current iteration, T represents the maximum number of iterations in a run, while a is a constant.

Motivation and Preliminaries
The effectiveness of meta-heuristics in optimization process largely depends on efficiency and balance between exploitation and exploration, that direct the search towards optimum (sub-optimum) solutions. Additionally, according to the no free lunch theorem (NFL), optimizer without flaws does not exist, nor there is one which can render satisfying solutions for all kinds of NP-hard challenges. Therefore, every meta-heuristics suffers from some deficiencies and also each one has distinctive advantages over others.
One of promising techniques that can be used to combine different meta-heuristics is hybridization [75,76]. If the right meta-heuristics are chosen as components of hybrid method, strengths of one approach compensates weaknesses of the other, and vice-versa. Hybrid meta-heuristics are proven as efficient optimizers and they were validated against different problems [56,59,[77][78][79].
In the modern literature, many taxonomies of hybrid meta-heuristics can be found, however on of the most widely adopted is the one provided by Talbi [76]. According to [76], by using the notion of hierarchical classification, hybrid algorithms can be differentiated between low-level (LLH) and high-level hybrids (HLH). In the case of LLH, search function of one method is replaced with one that belongs to other optimization method. Conversely, HHL approaches are self-contained [76].
Further, both LLH and HLH can be executed in relay or teamwork mode [76]. The first mode executes in a pipeline manner, where the output of first meta-heuristics is used as the input for the second, while in the case of teamwork mode, meta-heuristics evolve in parallel, cooperatively exploring and exploiting search space.
Approach which is developed for the purpose of this research represents combination of LLH and HLH and encompasses well-known ABC, FA, and SCA meta-heuristics. These three meta-heuristics are chosen due to its complementary weaknesses and strengths that makes them a promising candidates for hybridization.
Based on the previous findings, the ABC algorithm has efficient exploration mechanism which discards individuals that can not be improved in the predefined number of iterations, however it suffers from poor exploitation [73]. Conversely, both FA and SCA meta-heuristics exhibit above average intensification abilities, but they do not employ explicit exploration mechanism which leads to lower diversification capabilities [67,80]. Dynamic FA implementation controls exploitation-exploration balance by shrinking parameter α throughout iterations, while the SCA also uses dynamic parameter r 1 . However, if the initially generated population is far away from optimum, dynamic parameters would only perform exploration around current solutions (novel solutions from other regions of the search space will not be generated), and when termination condition is reached, in most cases local optimum solutions will be rendered. Additionally, regardless of good intensification of FA and SCA, the search can be further boosted by combination of its search expressions. This stems from the fact that FA and SCA employ different search equations-the FA uses the notion of distance between solution, while the SCA employs trigonometric functions.
Motivated by the facts provided above, proposed hybrid meta-heuristics first combines FA and SCA algorithms in a form of LLH with teamwork mode and afterwards such approach is hybridized with the ABC meta-heuristics, forming a HLH teamwork mode optimizer. Method which is proposed for the purpose of this research is therefore named multi-swarm-ABC-FA-SCA (MS-AFS).

Overview of MS-AFS
In addition to combining ABC, FA, and SCA meta-heuristics, proposed MS-AFS also employs the following mechanisms: • Chaotic and quasi-reflection-based learning (QRL) population initialization in order to establish boosting of the search by redirecting solutions towards more favorable parts of the domain; • Efficient learning mechanism between swarms with the goal of combining weakness and strengths of different approaches more efficiently.
The concept of employing chaotic maps in meta-heuristics methods was first proposed by Caponetto et al. in [81]. The stochastic essence of the majority of meta-heuristics methods relates on random number generators. Nevertheless, several recent studies suggest that the search procedure could be improved if it were grounded in chaotic sequences [82,83].
Numerous chaotic maps exist, including circle, Chebyshev, logistic, sine, sinusoidal, tent, and many others. Extensive simulations conducted for the purpose of current, as well as previous research [63] with all the above-mentioned maps yielded the conclusion that the best results can be obtained by applying the logistic map, that was selected for implementation.
To establish chaotic-based population initialization, pseudo-random number θ 0 is generated, as the seed for chaotic sequence θ created by the logistic mapping: where N denotes the population size, i is the sequence number, while µ is chaotic sequence control parameter. The µ was set to 4, as suggested in [84], while 0 < θ 0 < 1 and θ 0 = 0.25, 0.5, 0.75, 1. Every parameter j of each solution i is mapped to rendered chaotic sequences by the following equation: where X c i is new position of individual i after chaotic perturbations. The QRL procedure was initially proposed in [85]. This approach implies the generation of the quasi-reflexive-opposite solutions following the logic that if the original individual is positioned at a large distance from the optimum, a decent chance exists that the opposite solution could be located much nearer to the optimum.
When utilizing the QRL procedure described above, the quasi-reflexive-opposite individual X qr of the solution X will be created by applying the following expression for every component j of solution X: where rnd LB + UB 2 , X is used to generate an arbitrary number from the uniform distribution within LB + UB 2 , X , and LB and UB are lower and upper search boundaries, respectively. This strategy will be executed for each parameter of observed solution X in D dimensions. Taking all into account, population initialization of proposed MS-AFS is summarized in Algorithm 1.
As it can be observed from Algorithm 1, the size of starting population P start is N/2 individuals. In this way, the fitness function evaluations FFEs in the initialization phase are executed only N times and additional load, in terms of computational requirements, on the MS-AFS complexity is not imposed.
After initialization of population P by Algorithm 1, N/2 worse solutions are chosen as the initial population (P 1 ) for first swarm (s 1 ), while remaining individuals (P 2 ) are delegated to the second swarm (s 2 ). The s 2 is created by establishing LLH with teamwork mode between FA and SCA algorithms, while the s 1 is executed only by the ABC metaheuristics. Due to the fact that the ABC exhibits better exploration abilities and that it has more chance to hit the favorable regions of the search domain, worse N/2 individuals are chosen as initial population for the s 1 .
Step 2: Randomly select 2 subsets of N/4 from P start for chaotic and QRL initialization, denoted as P c and P qrl , respectively.
Step 3: Extend P c by applying chaotic sequences to each individual in P c using expressions (19) and (20). The size of P c after extension is N/2.
Step 4: Extend P qrl by applying QRL mechanism to each individual in P qrl using expression (21). The size of P qrl after extension is N/2.
Step 5: Calculate fitness of all individuals from P c and P qrl .
Step 6: Sort all solutions from P c ∪ P qrl according to fitness.
Step 7: Select N best solutions as the initial population P.
The s 2 simply combines search expressions of FA and SCA algorithms, Equations (13) and (17), respectively, and in each iteration every individual is evolved either by performing FA or SCA search. Finally, the s 1 and s 2 execute independently, where each swarm evolves its own population of candidate solutions.
The s 1 and s 2 search processes are shown in Algorithms 2 and 3, respectively.
Algorithm 2 Search process of s 1 -ABC algorithm.
for each solution X i do perform employed bee phase according to Equation (6) perform onlooker bee phase according to expressions (6) and (7) end for perform scout bee phase (explicit exploration) according to expression (4) Algorithm 3 Search process of s 2 -LLH between FA and SCA.
for each solution X i do if rand(0, 1) > 0.5 then Evolve X i by FA search-expression (13) else Evolve X i by SCA search-expression (17) end if end for However, as noted above, in order to facilitate the search, after ψ iterations, the mechanism of exchanging knowledge (knowledge exchange mechanism-KEM) about the search region between s 1 and s 2 is triggered and it is executed in the following way in every iteration: if rand(0, 1) > ke f replace one worst solution from s 1 (X w,s1 ) with the best individual from s 2 (X b,s2 ) and vice-versa. However, this mechanism may also render some problems. If the exchange of solutions between swarms is triggered too early and/or too frequently, then diversity of swarms may be lost and local optimal solutions may be returned. This scenario is mitigated by additional two control parameters: ψ and ke f . The ke f (knowledge exchange frequency) controls the frequency of KEM triggering after the condition t > ψ, where t is the current iteration counter, has been satisfied.
High-level inner workings of proposed MS-AFS are described in Algorithm 4.

Algorithm 4 High-level MS-AFS pseudo-code.
Initialize global parameters: t = 0, T, and N. Initialize: control parameters of ABC, FA, and SCA meta-heuristics. Generate initial population P according to Algorithm 1. Determine populations for s 1 and s 2 -P 1 and P 2 , respectively. while t ≤ T do Execute s 1 according to Algorithm 2 Execute s 2 according to Algorithm 3 if t > φ then if rand(0, 1) > ke f then Trigger KEM mechanism end if end if end while Return X best Results analysis, performance metrics generation and visualization

Computational Complexity, MS-AFS Solutions' Encoding for ELM Tuning and Flow-Chart
Because the most computationally costly portion of the swarm intelligence algorithm is the objective evaluation [86], the number of FFEs may be used to assess the complexity of the method.
Proposed MS-AFS does not impose additional FFEs, not even in the initialization phases, therefore in terms of FFEs, its complexity is given as: However, there is always a trade-off, therefore the proposed MS-AFS also exhibits some limitations. The major drawback of MS-AFS method is reflected in the fact that the algorithm requires more control parameters. All three components of the MS-AFS, namely the ABC, FA, and SCA, have to be tuned with their respective control parameters. Nevertheless, the proposed MS-AFS is significantly more efficient than the individual algorithms, justifying the requirement for more control parameters, as it is shown in Section 4. The plain ELM model is based on the random initial set of the input weights and biases, consequently being vulnerable to several performance drawbacks. More specifically, the plain ELM frequently requires a significant amount of neurons, that could be not necessary and/or sub-optimal. This increase in the number of neurons in the hidden layer can slow down the ELM response in case that previously unknown data are wired to the network inputs, rendering it impractical for numerous practical applications.
The proposed hybrid multi-swarm meta-heuristics and ELM model framework utilizes MS-AFS meta-heuristics to optimize the input weights and biases of the ELM model, while the number of neurons in the hidden layer was determined by a simple grid search. The MP generalized inverse has been used to obtain the output weights. Therefore, the proposed hybrid technique is named ELM-MS-AFS.
Each MS-AFS solution consists of nn · f s + nn parameters, where nn and f s denote number of neurons in the hidden layer, and the size of input feature vector, respectively. For the sake of clarity, a flow-chart of proposed ELM-MS-AFS is given in Figure 1.

Experiments
This section first describes the datasets used in the experiments, followed by the metrics that were used to evaluate the results. Finally, this section provides the obtained results and their comparative analysis with other similar cutting-edge methods.

Datasets
The experiments in this research were performed on seven well-known UCI (University of California, Irvine) benchmark datasets, namely Diabetes, Heart Disease, Iris, Wine, Wine Quality, Satellite and Shuttle, that can be retrieved from https://archive.ics.uci.edu/ ml/datasets.php (accessed on 15 May 2022).
Their characteristics have been summarized in Table 1. The Pima Indians Diabetes dataset is utilized in diabetes diagnostics, to determine if the patient is positive or not. The dataset comprises 768 patterns belonging to two distinct classes. The Heart Disease dataset comprises 270 patterns, with 13 attributes and two classes, that indicate if the patient has a heart disease or not. The third dataset, namely the Fisher Iris dataset, consists of three flower species measurements (viz. Setosa, Verginica, and Versicolor). The Iris dataset is comprised of three classes, and every class has fifty samples. The Wine dataset comprises 178 samples belonging to three sorts of wines. The Wine dataset was created by the chemical analyses that have been performed on wines produced from the grapes grown in the same region in Italy, but by three different cultivators. The fifth dataset used, Wine Quality, deals with the sorts of the Portuguese "Vinho Verde" wines. The quality of wines is modeled by the results obtained with physiochemical testing. The satellite image dataset comprises the multi-spectral pixel values located in 3 × 3 neighbourhood areas of the satellite images. This dataset is also available on the UCI repository (https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite) (accessed on 15 May 2022)), where it is stated that it has seven classes. However, it actually has just six classes, as reported in Table 1. Finally, the seventh dataset, Shuttle, relates to the radiators' placement on board of the Space Shuttle, and it comprises 58,000 samples, with nine attributes and separated into seven classes.
All datasets have been divided into training and testing groups. Satellite and Shuttle datasets are available with already predetermined train and test subsets, and they were used accordingly. Diabetes, Disease, Iris, Wine, and Wine Quality datasets do not have predetermined training and testing subsets, as each one of them comes in the form of a singular dataset. Therefore, all five mentioned datasets were subsequently separated into testing and training subsets by utilizing 70% of data for training process, and 30% for testing. Since most of the datasets are imbalanced, data are split in a stratified fashion to maintain the same proportions of class labels in training and testing subsets as in the input dataset.
Visualization of class distributions in the employed datasets is provided in Figures 2 and 3 for Diabetes, Disease, Iris, Wine, and Wine Quality before split into training and testing subsets and for Satellite and Shuttle with already predetermined training and testing groups, respectively.

Metrics
In order to evaluate the performances of the proposed MS-AFS, it is required to measure them accurately and precisely. The common approach to evaluate machine learning models is based on the false positives (FP) and false negatives (FN), along with true positives (TP) and true negatives (TN), to accurately verify the classification accuracy, as defined by the general formula given by Equation (23).
By utilizing TP, TN, FP, and FN, the model's recall, sensitivity (recall) and F-measure can easily be determined by applying the formulas given in Equations (24)-(26): Recall(sensitivity) = TP/(TP + FN) The precision and recall measurements are very important for the imbalanced datasets.

Experimental Results and Comparative Analysis with Other Cutting-Edge Meta-Heuristics
The performance of the suggested method has been evaluated by utilizing the similar experimental setup as proposed in the referred paper [2]. The proposed method has been validated and compared against the basic versions of the algorithms that were used to create a multi-swarm method-ABC [23], FA [24], and SCA [74]. Additionally, the elaborated algorithm has been compared to the bat algorithm (BA) [87], Harris hawk optimization (HHO) [88], whale optimization algorithm (WOA) [27], and Invasive Weed Optimization (IWO) [19], which were also used in [2]. It is important to note that all meta-heuristics included in the experiments were independently implemented by the authors, and these results were reported in the tables. Additionally, to emphasize that meta-heuristics were applied to ELM tuning, each proposed approach is shown with the prefix 'ELM'.
All meta-heuristics included in comparative analysis were tested with optimal (suboptimal) parameters which are suggested in original papers. Values of MS-AFS specific parameters were determined empirically and they were set as follows for all simulations: ψ = T/5 and ke f = 0.6.
In paper [2] simulations were executed with 20 solutions in the population (N = 20) and the termination condition was limited to 100 iterations (T = 100). However, in this research, a lower number of neurons in the hidden ELM layer were employed for all observed datasets.
In the proposed research, a simple grid search has been applied to determine the optimal (sub-optimal) number of neurons for all datasets in average. The search was performed with 10-200 neurons with a step size of 10 and it was observed that, in average for all datasets, the best performance was obtained with 30, 60, and 90 neurons. Therefore, in this research, simulations with 30, 60, and 90 neurons are conducted to evaluate the performance of the proposed ELM-MS-AFS model.
However, in this research, all methods were tested by employing a substantially lower number of iterations than in [2]. All methods were tested with N = 20 and T = 20 in 50 independent runs, and best, worst, and mean accuracy along with standard deviation performance metrics are reported in Tables 2-4 for 30, 60, and 90 neurons, respectively. The basic ELM was also tested on each dataset in 50 independent runs.  The findings from Tables 2-4 demonstrate the superior performance of meta-heuristicsbased ELMs over the basic ELM. It can be noted that the plain ELM exhibited high standard deviations on all datasets, for 30, 60, and 90 neurons, which was expected as the weights are initialized in a random fashion, without any kind of "intelligence". The proposed ELM-MS-AFS approach produced the best results by far, considering the meta-heuristicsbased ELMs. In case of 30 neurons in a hidden layer, depicted in Table 2, the ELM-MS-AFS obtained the best results in terms of best, worst, and mean accuracies on five datasets (Diabetes, Disease, Wine Quality, Satellite, and Shuttle), and also being tied on the first place in two occasions (Iris and Wine). Similar trends are observed in case of 60 neurons (Table 3), where ELM-MS-AFS achieved the best results in terms of best, worst, and mean accuracies on three datasets (Diabetes, Disease, Wine Quality, and Satellite), and being tied on the first position on Iris and Wine Datasets. The ELM-MS-AFS also obtained the highest best accuracy in the case of the Shuttle dataset. Finally, on the experiments with 90 neurons in the hidden layer shown in Table 4, the proposed ELM-MS-AFS obtained the best results on five datasets (Diabetes, Disease, Wine Quality, Satellite, and Shuttle), and was tied for the first place on the Wine dataset.
Another interesting conclusion can be derived from the obtained performance for a different amount of neurons in the hidden layer. For example, for the ELM-MS-AFS approach, performance rise with the increased number of neurons on some datasets, as it can be seen for the Disease dataset, where the ELM-MS-AFS achieved an average accuracy of 91.80% with 30 neurons, 92.86% with 60 neurons, and 95.70% with 90 neurons. Similar patterns can be observed for the Satellite dataset. On the other hand, on the Diabetes dataset, ELM-MS-AFS achieved the best performance and average accuracy of 84.63% with 30 neurons, then a drop to 81.60% in the average accuracy can be seen with 60 neurons, and, finally, again an increase to 83.55% with 90 neurons. Finally, for the Wine Quality dataset, the ELM-MS-AFS achieved the best performances with an accuracy of 67.60% with 60 neurons in the hidden layer. A further incrementation in neurons did not result in an increased accuracy, as there is a drop of the average accuracy to 67.40% when the network is leveraged to 90 neurons. This is a classic example of the over-fitting issue, where increasing the number of neurons reduces the generalization capabilities of the model, and results in the network that learns training data too well and under-performs on the test data.
As already noted above, for imbalanced datasets, accuracy metric is not enough to gain insights into classification results, therefore in Tables 5-7, macro averaged precision, recall, and f1-score metrics, obtained by ELM tuned with meta-heuristics approaches for the best run, were also shown for experiments with 30, 60, and 90 neurons, respectively. All those metrics were extracted from classification report.
In order to better visualize the performance and classification error rate speed of convergence for the proposed ELM-MF-AFS method, convergence graphs for all seven datasets, for the cases of 30, 60, and 90 neurons, are shown in Figure 4. The compared algorithms were also plotted in Figure 4. It is obvious that the proposed method converges much faster than other approaches for most of the datasets. Additionally, it can be observed that the proposed MS-AFS has initial advantage due to the chaotic and QRL initialization.      Finally, visualization of obtained metrics is further showed in Figure 5, where generated confusion matrices and precision-recall (PR) curves for some simulations by proposed ELM-MS-AFS are shown.

Statistical Tests
In this section, findings of statistical tests conducted for simulations shown in Section 4.3, are presented with the goal of establishing whether or not performance improvements of proposed ELM-MS-AFS over other state-of-the-art meta-heuristics are statistically significant.
All statistical tests were performed by taking best values of all methods obtained in all three simulations-with 30, 60, and 90 neurons in the hidden layer. In order to determine if the generated improvements are significant in terms of statistics, a Friedman Aligned test [89,90] and two-way variance analysis by ranks have been employed. By analyzing the test results, a conclusion can be made if there is a significant results' difference among the proposed ELM-MS-AFS and other methods encompassed by comparison. The Friedman Aligned test results for the eight compared algorithms on seven datasets are presented in Table 8.
The results presented in Table 8 statistically indicate that the proposed ELM-MS-AFS algorithm has superior performance when compared to the other seven algorithms with an average rank value of 9.5. The second best performance was achieved by ELM-HHO algorithm that scored the average rank of 24.36, while the ELM-IWO algorithm obtained the average rank of 27.64 at third place. The basic ELM-ABC, ELM-FA and ELM-SCA metaheuristics obtained the average ranks of 34.21, 31.07, and 32.5, respectively. Additionally, the Friedman Aligned statistics (χ 2 r = 18.49) is greater than the χ 2 critical value with seven degrees of freedom (14.07), at significance level α = 0.05. As the result, the null hypothesis (H 0 ) can be rejected and it can be stated that the suggested ELM-MS-AFS achieved results that are significantly different than the other seven algorithms. Finally, the non-parametric post-hoc procedure, the Holm's step-down procedure, is also conducted and presented in Table 9. By using this procedure, all methods are sorted according to their p value and compared with α/(k − i), where k and i represent the degree of freedom (in this work k = 10) and the algorithm number after sorting according to the p value in ascending order (which corresponds to rank), respectively. In this study the α is set to 0.05 and 0.1. Additionally, it is noted that the p-value results are provided in scientific notation.
The results given in the Table 9 suggest that the proposed algorithm significantly outperformed all opponent algorithms at both significance levels α = 0.1 and α = 0.05.

Hybridization by Pairs
Although the reasons of combining ABC, FA and SCA meta-heuristics in multi-swarm approach are elaborated in Section 3.2.1, for the purpose of this research, additional methods were implemented to prove that combining two algorithms is not as effective as joining three approaches. Therefore, the following HLH teamwork mode optimizers were implemented: ABC-FA, ABC-SCA, and FA-SCA.
All methods have the same properties as the MS-AFS meta-heuristics-they employ chaotic and QRL population initialization and the KEM procedure controlled by ke f and ψ control parameters (for more details please refer to Section 3.2.2). During the initialization phase, N/2 worse individuals are included in population s 1 , which is guided by the ABC algorithm in case of ABC-FA and ABC-SCA approaches, and by the FA algorithm in the case of FA-SCA method. It is also worth mentioning that all three additional hybrid methods have the same computational complexity as the MS-AFS.
It needs to be noted that the hybrid between FA and SCA is established as the HLH, not as the LLH which is the case of the MS-AFS, because only in this way two populations controlled by different methods can be generated. Alternatively, establishing LLH between FA and SCA would not render a fair comparison with the MS-AFS, because the KEM procedure could not be implemented. Naturally, the three methods above can be combined in various different ways, but performing all hybridization possibilities would go far beyond the scope of our research.
The same experimental ELM's tuning setup as in the basic experiment (Section 4.3) was established and the same control parameters' values as for ELM-MS-AFS were used for ELM-ABC-FA, ELM-ABC-SCA, and ELM-FA-SCA. The additionally implemented methods were validated only for three more challenging datasets from the previous experiment: Wine Quality, Satellite, and Shuttle with 30, 60, and 90 neurons.
However, with the aim of gaining more insights into the performance of proposed ELM-MS-AFS, one more challenging dataset was included for the current comparison. The newly utilized NSL-KDD dataset is an improved version of the KDD'99 dataset for network intrusion detection and it has been widely used in the modern literature [91][92][93]. However, according to authors' findings, the ELM has never been applied to this dataset before.
Predefined training and testing sets for the NSL-KDD, as well as its description, can be retrieved from the following URL: https://unb.ca/cic/datasets/nsl.html (accessed on 15 May 2022) and it includes in total 148,517 instances with 41 mixed numerical and categorical features along with five classes. Class 0 represents normal network traffic (no intrusion), while the other four classes denote malicious type of network traffic (Probe, DoS, U2R, and R2L). For training ELM, all categorical features were transformed into integers using one hot encoding (OHE) scheme, resulting in a dataset with 122 attributes. Other features are normalized. It also should be emphasized that the NSL-KDD dataset is highly imbalanced ( Figure 6) and in the conducted experiments it was used as such. Following the setup from the previous experiments, all hybrids are tested with N = 20 and T = 20 in 50 independent runs and best, mean, worst accuracy along with standard deviation for all four datasets with 30, 60, and 90 ELM neurons are captured and reported in Table 10. Detailed performance indicators for the best run in terms of macro averaged precision, recall, and f1-score are shown in Table 11. In both tables, the best achieved results are denoted with bold style.
Convergence speed graphs for all additional simulations are shown in Figure 7.  From provided simulation results, as well as from convergence graphs, it can clearly be stated that the proposed ELM-MS-AFS on average exhibits superior performance over ELM-ABC-FA, ELM-ABC-SCA, and ELM-FA-SCA hybrid meta-heuristics, therefore the assumption that combining three approaches renders better performance than joining two methods is justified. It is also interesting to notice that, on average, when all simulations are taken into account, the ELM-ABC-FA and ELM-FA-SCA are close in terms of performance and that the ELM-ABC-SCA achieves slightly worse results. Finally, by comparing with the metrics established by other state-of-the-art swarm approaches, shown in the tables from Section 4.3, all three hybrid meta-heuristics on average proved to be more efficient and robust optimizers than standard, non-hybridized algorithms.
Additionally, since the NSL-KDD dataset is highly imbalanced, the PR curves for all four hybrid methods for simulations with 30, 60, and 90 ELM's neurons are shown in Figure 8. From this visualization, it can be also concluded that the ELM-MS-AFS on average manages to better classify classes with minority of samples.

Conclusions
This paper proposes a novel approach to ELM optimization by swarm intelligence metaheuristics. For this purpose, a novel multi-swarm algorithm has been implemented, by combining three famous algorithms: ABC, FA, and SCA. The goal of this hybrid method was to combine the strengths of each individual algorithm, and compensate their weaknesses. New multi-swarm meta-heuristics has been named MS-AFS, and later used to optimize the weights and biases in ELM model. The number of ELM's hidden neurons was not subjected to optimization, as the simple grid search was employed to determine the optimal number of neurons.
To validate the new ELM-MS-AFS technique, thorough simulations were conducted with seven UCI benchmark datasets, with 30, 60, and 90 neurons in the hidden layer. The results have been compared to the basic ELM, and to seven other cutting-edge metaheuristics-based ELMs. The proposed ELM-MS-AFS method has proven to be superior to other methods included in the analysis, as it was confirmed with statistical tests employed to determine the significance of the improvements of the proposed method.
Additionally, to prove that combining two algorithms is not as effective as joining three approaches, hybrids generated by pairing each two methods employed in the proposed multi-swarm approach, were also implemented and validated against four challenging datasets. From obtained simulation results, it was concluded that the proposed ELM-MS-AFS on average exhibits superior performance over ELM-ABC-FA, ELM-ABC-SCA, and ELM-FA-SCA hybrid meta-heuristics, therefore the assumption that combining three approaches renders better performance than joining two methods is justified.
The future research in this area will include extensive testing of the proposed ELM-MS-AFS approach on other benchmark and real-life datasets, and employing it in various application domains. Additionally, the number of neurons in the hidden layer will also be subjected to the optimization process. Finally, the proposed MS-AFS meta-heuristics will be tested and employed for solving NP-hard tasks for other domains, such as wireless sensor networks and cloud-based systems. Data Availability Statement: All datasets used in this study are public and available on the UCI repository on the following URL: https://archive.ics.uci.edu/ml/datasets.php, accessed on 15 May 2022. Preprocessed datasets along with same code is available on the following GitHub link: https: //github.com/nbacanin/sensorsELM2022, accessed on 15 May 2022.